Tag Archives: Intermediate (200)

How to automate rule management for AWS Network Firewall

Post Syndicated from Ajinkya Patil original https://aws.amazon.com/blogs/security/how-to-automate-rule-management-for-aws-network-firewall/

AWS Network Firewall is a stateful managed network firewall and intrusion detection and prevention service designed for the Amazon Virtual Private Cloud (Amazon VPC). This post concentrates on automating rule updates in a central Network Firewall by using distributed firewall configurations. If you’re new to Network Firewall or seeking a technical background on rule management, see AWS Network Firewall – New Managed Firewall Service in VPC.

Network Firewall offers three deployment models: Distributed, centralized, and combined. Many customers opt for a centralized model to reduce costs. In this model, customers allocate the responsibility for managing the rulesets to the owners of the VPC infrastructure (spoke accounts) being protected, thereby shifting accountability and providing flexibility to the spoke accounts. Managing rulesets in a shared firewall policy generated from distributed input configurations of protected VPCs (spoke accounts) is challenging without proper input validation, state-management, and request throttling controls.

In this post, we show you how to automate firewall rule management within the central firewall using distributed firewall configurations spread across multiple AWS accounts. The anfw-automate solution provides input-validation, state-management, and throttling controls, reducing the update time for firewall rule changes from minutes to seconds. Additionally, the solution reduces operational costs, including rule management overhead while integrating seamlessly with the existing continuous integration and continuous delivery (CI/CD) processes.


For this walkthrough, the following prerequisites must be met:

  • Basic knowledge of networking concepts such as routing and Classless Inter-Domain Routing (CIDR) range allocations.
  • Basic knowledge of YAML and JSON configuration formats, definitions, and schema.
  • Basic knowledge of Suricata Rule Format and Network Firewall rule management.
  • Basic knowledge of CDK deployment.
  • AWS Identity and Access Management (IAM) permissions to bootstrap the AWS accounts using AWS Cloud Development Kit (AWS CDK).
  • The firewall VPC in the central account must be reachable from a spoke account (see centralized deployment model). For this solution, you need two AWS accounts from the centralized deployment model:
    • The spoke account is the consumer account the defines firewall rules for the account and uses central firewall endpoints for traffic filtering. At least one spoke account is required to simulate the user workflow in validation phase.
    • The central account is an account that contains the firewall endpoints. This account is used by application and the Network Firewall.
  • StackSets deployment with service-managed permissions must be enabled in AWS Organizations (Activate trusted access with AWS Organizations). A delegated administrator account is required to deploy AWS CloudFormation stacks in any account in an organization. The CloudFormation StackSets in this account deploy the necessary CloudFormation stacks in the spoke accounts. If you don’t have a delegated administrator account, you must manually deploy the resources in the spoke account. Manual deployment isn’t recommended in production environments.
  • A resource account is the CI/CD account used to deploy necessary AWS CodePipeline stacks. The pipelines deploy relevant cross-account cross-AWS Region stacks to the preceding AWS accounts.
    • IAM permissions to deploy CDK stacks in the resource account.

Solution description

In Network Firewall, each firewall endpoint connects to one firewall policy, which defines network traffic monitoring and filtering behavior. The details of the behavior are defined in rule groups — a reusable set of rules — for inspecting and handling network traffic. The rules in the rule groups provide the details for packet inspection and specify the actions to take when a packet matches the inspection criteria. Network Firewall uses a Suricata rules engine to process all stateful rules. Currently, you can create Suricata compatible or basic rules (such as domain list) in Network Firewall. We use Suricata compatible rule strings within this post to maintain maximum compatibility with most use cases.

Figure 1 describes how the anfw-automate solution uses the distributed firewall rule configurations to simplify rule management for multiple teams. The rules are validated, transformed, and stored in the central AWS Network Firewall policy. This solution isolates the rule generation to the spoke AWS accounts, but still uses a shared firewall policy and a central ANFW for traffic filtering. This approach grants the AWS spoke account owners the flexibility to manage their own firewall rules while maintaining the accountability for their rules in the firewall policy. The solution enables the central security team to validate and override user defined firewall rules before pushing them to the production firewall policy. The security team operating the central firewall can also define additional rules that are applied to all spoke accounts, thereby enforcing organization-wide security policies. The firewall rules are then compiled and applied to Network Firewall in seconds, providing near real-time response in scenarios involving critical security incidents.

Figure 1: Workflow launched by uploading a configuration file to the configuration (config) bucket

Figure 1: Workflow launched by uploading a configuration file to the configuration (config) bucket

The Network Firewall firewall endpoints and anfw-automate solution are both deployed in the central account. The spoke accounts use the application for rule automation and the Network Firewall for traffic inspection.

As shown in Figure 1, each spoke account contains the following:

  1. An Amazon Simple Storage Service (Amazon S3) bucket to store multiple configuration files, one per Region. The rules defined in the configuration files are applicable to the VPC traffic in the spoke account. The configuration files must comply with the defined naming convention ($Region-config.yaml) and be validated to make sure that only one configuration file exists per Region per account. The S3 bucket has event notifications enabled that publish all changes to configuration files to a local default bus in Amazon EventBridge.
  2. EventBridge rules to monitor the default bus and forward relevant events to the custom event bus in the central account. The EventBridge rules specifically monitor VPCDelete events published by Amazon CloudTrail and S3 event notifications. When a VPC is deleted from the spoke account, the VPCDelete events lead to the removal of corresponding rules from the firewall policy. Additionally, all create, update, and delete events from Amazon S3 event notifications invoke corresponding actions on the firewall policy.
  3. Two AWS Identity and Access Manager (IAM) roles with keywords xaccount.lmb.rc and xaccount.lmb.re are assumed by RuleCollect and RuleExecute functions in the central account, respectively.
  4. A CloudWatch Logs log group to store event processing logs published by the central AWS Lambda application.

In the central account:

  1. EventBridge rules monitor the custom event bus and invoke a Lambda function called RuleCollect. A dead-letter queue is attached to the EventBridge rules to store events that failed to invoke the Lambda function.
  2. The RuleCollect function retrieves the config file from the spoke account by assuming a cross-account role. This role is deployed by the same stack that created the other spoke account resources. The Lambda function validates the request, transforms the request to the Suricata rule syntax, and publishes the rules to an Amazon Simple Queue Service (Amazon SQS) first-in-first-out (FIFO) queue. Input validation controls are paramount to make sure that users don’t abuse the functionality of the solution and bypass central governance controls. The Lambda function has input validation controls to verify the following:
    • The VPC ID in the configuration file exists in the configured Region and the same AWS account as the S3 bucket.
    • The Amazon S3 object version ID received in the event matches the latest version ID to mitigate race conditions.
    • Users don’t have only top-level domains (for example, .com, .de) in the rules.
    • The custom Suricata rules don’t have any as the destination IP address or domain.
    • The VPC identifier matches the required format, that is, a+(AWS Account ID)+(VPC ID without vpc- prefix) in custom rules. This is important to have unique rule variables in rule groups.
    • The rules don’t use security sensitive keywords such as sid, priority, or metadata. These keywords are reserved for firewall administrators and the Lambda application.
    • The configured VPC is attached to an AWS Transit Gateway.
    • Only pass rules exist in the rule configuration.
    • CIDR ranges for a VPC are mapped appropriately using IP set variables.

    The input validations make sure that rules defined by one spoke account don’t impact the rules from other spoke accounts. The validations applied to the firewall rules can be updated and managed as needed based on your requirements. The rules created must follow a strict format, and deviation from the preceding rules will lead to the rejection of the request.

  3. The Amazon SQS FIFO queue preserves the order of create, update, and delete operations run in the configuration bucket of the spoke account. These state-management controls maintain consistency between the firewall rules in the configuration file within the S3 bucket and the rules in the firewall policy. If the sequence of updates provided by the distributed configurations isn’t honored, the rules in a firewall policy might not match the expected ruleset.

    Rules not processed beyond the maxReceiveCount threshold are moved to a dead-letter SQS queue for troubleshooting.

  4. The Amazon SQS messages are subsequently consumed by another Lambda function called RuleExecute. Multiple changes to one configuration are batched together in one message. The RuleExecute function parses the messages and generates the required rule groups, IP set variables, and rules within the Network Firewall. Additionally, the Lambda function establishes a reserved rule group, which can be administered by the solution’s administrators and used to define global rules. The global rules, applicable to participating AWS accounts, can be managed in the data/defaultdeny.yaml file by the central security team.

    The RuleExecute function also implements throttling controls to make sure that rules are applied to the firewall policy without reaching the ThrottlingException from Network Firewall (see common errors). The function also implements back-off logic to handle this exception. This throttling effect can happen if there are too many requests issued to the Network Firewall API.

    The function makes cross-Region calls to Network Firewall based on the Region provided in the user configuration. There is no need to deploy the RuleExecute and RuleCollect Lambda functions in multiple Regions unless a use case warrants it.


The following section guides you through the deployment of the rules management engine.

  • Deployment: Outlines the steps to deploy the solution into the target AWS accounts.
  • Validation: Describes the steps to validate the deployment and ensure the functionality of the solution.
  • Cleaning up: Provides instructions for cleaning up the deployment.


In this phase, you deploy the application pipeline in the resource account. The pipeline is responsible for deploying multi-Region cross-account CDK stacks in both the central account and the delegated administrator account.

If you don’t have a functioning Network Firewall firewall using the centralized deployment model in the central account, see the README for instructions on deploying Amazon VPC and Network Firewall stacks before proceeding. You need to deploy the Network Firewall in centralized deployment in each Region and Availability Zone used by spoke account VPC infrastructure.

The application pipeline stack deploys three stacks in all configured Regions: LambdaStack and ServerlessStack in the central account and StacksetStack in the delegated administrator account. It’s recommended to deploy these stacks solely in the primary Region, given that the solution can effectively manage firewall policies across all supported Regions.

  • LambdaStack deploys the RuleCollect and RuleExecute Lambda functions, Amazon SQS FIFO queue, and SQS FIFO dead-letter queue.
  • ServerlessStack deploys EventBridge bus, EventBridge rules, and EventBridge Dead-letter queue.
  • StacksetStack deploys a service-managed stack set in the delegated administrator account. The stack set includes the deployment of IAM roles, EventBridge rules, an S3 Bucket, and a CloudWatch log group in the spoke account. If you’re manually deploying the CloudFormation template (templates/spoke-serverless-stack.yaml) in the spoke account, you have the option to disable this stack in the application configuration.
    Figure 2: CloudFormation stacks deployed by the application pipeline

    Figure 2: CloudFormation stacks deployed by the application pipeline

To prepare for bootstrapping

  1. Install and configure profiles for all AWS accounts using Amazon Command Line Interface (AWS CLI)
  2. Install the Cloud Development Kit (CDK)
  3. Install Git and clone the GitHub repo
  4. Install and enable Docker Desktop

To prepare for deployment

  1. Follow the README and cdk bootstrapping guide to bootstrap the resource account. Then, bootstrap the central account and delegated administrator account (optional if StacksetStack is deployed manually in the spoke account) to trust the resource account. The spoke accounts don’t need to be bootstrapped.
  2. Create a folder to be referred to as <STAGE>, where STAGE is the name of your deployment stage — for example, local, dev, int, and so on — in the conf folder of the cloned repository. The deployment stage is set as the STAGE parameter later and used in the AWS resource names.
  3. Create global.json in the <STAGE> folder. Follow the README to update the parameter values. A sample JSON file is provided in conf/sample folder.
  4. Run the following commands to configure the local environment:
    npm install
    export STAGE=<STAGE>
    export AWS_REGION=<AWS_Region_to_deploy_pipeline_stack>

To deploy the application pipeline stack

  1. Create a file named app.json in the <STAGE> folder and populate the parameters in accordance with the README section and defined schema.
  2. If you choose to manage the deployment of spoke account stacks using the delegated administrator account and have set the deploy_stacksets parameter to true, create a file named stackset.json in the <STAGE> folder. Follow the README section to align with the requirements of the defined schema.

    You can also deploy the spoke account stack manually for testing using the AWS CloudFormation template in templates/spoke-serverless-stack.yaml. This will create and configure the needed spoke account resources.

  3. Run the following commands to deploy the application pipeline stack:
    export STACKNAME=app && make deploy

    Figure 3: Example output of application pipeline deployment

    Figure 3: Example output of application pipeline deployment

After deploying the solution, each spoke account is required to configure stateful rules for every VPC in the configuration file and upload it to the S3 bucket. Each spoke account owner must verify the VPC’s connection to the firewall using the centralized deployment model. The configuration, presented in the YAML configuration language, might encompass multiple rule definitions. Each account must furnish one configuration file per VPC to establish accountability and non-repudiation.


Now that you’ve deployed the solution, follow the next steps to verify that it’s completed as expected, and then test the application.

To validate deployment

  1. Sign in to the AWS Management Console using the resource account and go to CodePipeline.
  2. Verify the existence of a pipeline named cpp-app-<aws_ organization_scope>-<project_name>-<module_name>-<STAGE> in the configured Region.
  3. Verify that stages exist in each pipeline for all configured Regions.
  4. Confirm that all pipeline stages exist. The LambdaStack and ServerlessStack stages must exist in the cpp-app-<aws_organization_scope>-<project_name>-<module_name>-<STAGE> stack. The StacksetStack stage must exist if you set the deploy_stacksets parameter to true in global.json.

To validate the application

  1. Sign in and open the Amazon S3 console using the spoke account.
  2. Follow the schema defined in app/RuleCollect/schema.json and create a file with naming convention ${Region}-config.yaml. Note that the Region in the config file is the destination Region for the firewall rules. Verify that the file has valid VPC data and rules.
    Figure 4: Example configuration file for eu-west-1 Region

    Figure 4: Example configuration file for eu-west-1 Region

  3. Upload the newly created config file to the S3 bucket named anfw-allowlist-<AWS_REGION for application stack>-<Spoke Account ID>-<STAGE>.
  4. If the data in the config file is invalid, you will see ERROR and WARN logs in the CloudWatch log group named cw-<aws_organization_scope>-<project_name>-<module_name>-CustomerLog-<STAGE>.
  5. If all the data in the config file is valid, you will see INFO logs in the same CloudWatch log group.
    Figure 5: Example of logs generated by the anfw-automate in a spoke account

    Figure 5: Example of logs generated by the anfw-automate in a spoke account

  6. After the successful processing of the rules, sign in to the Network Firewall console using the central account.
  7. Navigate to the Network Firewall rule groups and search for a rule group with a randomly assigned numeric name. This rule group will contain your Suricata rules after the transformation process.
    Figure 6: Rules created in Network Firewall rule group based on the configuration file in Figure 4

    Figure 6: Rules created in Network Firewall rule group based on the configuration file in Figure 4

  8. Access the Network Firewall rule group identified by the suffix reserved. This rule group is designated for administrators and global rules. Confirm that the rules specified in app/data/defaultdeny.yaml have been transformed into Suricata rules and are correctly placed within this rule group.
  9. Instantiate an EC2 instance in the VPC specified in the configuration file and try to access both the destinations allowed in the file and any destination not listed. Note that requests to destinations not defined in the configuration file are blocked.

Cleaning up

To avoid incurring future charges, remove all stacks and instances used in this walkthrough.

  1. Sign in to both the central account and the delegated admin account. Manually delete the stacks in the Regions configured for the app parameter in global.json. Ensure that the stacks are deleted for all Regions specified for the app parameter. You can filter the stack names using the keyword <aws_organization_scope>-<project_name>-<module_name> as defined in global.json.
  2. After deleting the stacks, remove the pipeline stacks using the same command as during deployment, replacing cdk deploy with cdk destroy.
  3. Terminate or stop the EC2 instance used to test the application.


This solution simplifies network security by combining distributed ANFW firewall configurations in a centralized policy. Automated rule management can help reduce operational overhead, reduces firewall change request completion times from minutes to seconds, offloads security and operational mechanisms such as input validation, state-management, and request throttling, and enables central security teams to enforce global firewall rules without compromising on the flexibility of user-defined rulesets.

In addition to using this application through S3 bucket configuration management, you can integrate this tool with GitHub Actions into your CI/CD pipeline to upload the firewall rule configuration to an S3 bucket. By combining GitHub actions, you can automate configuration file updates with automated release pipeline checks, such as schema validation and manual approvals. This enables your team to maintain and change firewall rule definitions within your existing CI/CD processes and tools. You can go further by allowing access to the S3 bucket only through the CI/CD pipeline.

Finally, you can ingest the AWS Network Firewall logs into one of our partner solutions for security information and event management (SIEM), security monitoring, threat intelligence, and managed detection and response (MDR). You can launch automatic rule updates based on security events detected by these solutions, which can help reduce the response time for security events.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Ajinkya Patil

Ajinkya Patil

Ajinkya is a Security Consultant at Amazon Professional Services, specializing in security consulting for AWS customers within the automotive industry since 2019. He has presented at AWS re:Inforce and contributed articles to the AWS Security blog and AWS Prescriptive Guidance. Beyond his professional commitments, he indulges in travel and photography.

Stephan Traub

Stephan Traub

Stephan is a Security Consultant working for automotive customers at AWS Professional Services. He is a technology enthusiast and passionate about helping customers gain a high security bar in their cloud infrastructure. When Stephan isn’t working, he’s playing volleyball or traveling with his family around the world.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics: Part 2

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-2-enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/

Monitoring data pipelines in real time is critical for catching issues early and minimizing disruptions. AWS Glue has made this more straightforward with the launch of AWS Glue job observability metrics, which provide valuable insights into your data integration pipelines built on AWS Glue. However, you might need to track key performance indicators across multiple jobs. In this case, a dashboard that can visualize the same metrics with the ability to drill down into individual issues is an effective solution to monitor at scale.

This post, walks through how to integrate AWS Glue job observability metrics with Grafana using Amazon Managed Grafana. We discuss the types of metrics and charts available to surface key insights along with two use cases on monitoring error classes and throughput of your AWS Glue jobs.

Solution overview

Grafana is an open source visualization tool that allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. With Grafana, you can create, explore, and share visually rich, data-driven dashboards. The new AWS Glue job observability metrics can be effortlessly integrated with Grafana for real-time monitoring purpose. Metrics like worker utilization, skewness, I/O rate, and errors are captured and visualized in easy-to-read Grafana dashboards. The integration with Grafana provides a flexible way to build custom views of pipeline health tailored to your needs. Observability metrics open up monitoring capabilities that weren’t possible before for AWS Glue. Companies relying on AWS Glue for critical data integration pipelines can have greater confidence that their pipelines are running efficiently.

AWS Glue job observability metrics are emitted as Amazon CloudWatch metrics. You can provision and manage Amazon Managed Grafana, and configure the CloudWatch plugin for the given metrics. The following diagram illustrates the solution architecture.

Implement the solution

Complete following steps to set up the solution:

  1. Set up an Amazon Managed Grafana workspace.
  2. Sign in to your workspace.
  3. Choose Administration.
  4. Choose Add new data source.
  5. Choose CloudWatch.
  6. For Default Region, select your preferred AWS Region.
  7. For Namespaces of Custom Metrics, enter Glue.
  8. Choose Save & test.

Now the CloudWatch data source has been registered.

  1. Copy the data source ID from the URL https://g-XXXXXXXXXX.grafana-workspace.<region>.amazonaws.com/datasources/edit/<data-source-ID>/.

The next step is to prepare the JSON template file.

  1. Download the Grafana template.
  2. Replace <data-source-id> in the JSON file with your Grafana data source ID.

Lastly, configure the dashboard.

  1. On the Grafana console, choose Dashboards.
  2. Choose Import on the New menu.
  3. Upload your JSON file, and choose Import.

The Grafana dashboard visualizes AWS Glue observability metrics, as shown in the following screenshots.

The sample dashboard has the following charts:

  • [Reliability] Job Run Errors Breakdown
  • [Throughput] Bytes Read & Write
  • [Throughput] Records Read & Write
  • [Resource Utilization] Worker Utilization
  • [Job Performance] Skewness
  • [Resource Utilization] Disk Used (%)
  • [Resource Utilization] Disk Available (GB)
  • [Executor OOM] OOM Error Count
  • [Executor OOM] Heap Memory Used (%)
  • [Driver OOM] OOM Error Count
  • [Driver OOM] Heap Memory Used (%)

Analyze the causes of job failures

Let’s try analyzing the causes of job run failures of the job iot_data_processing.

First, look at the pie chart [Reliability] Job Run Errors Breakdown. This pie chart quickly identifies which errors are most common.

Then filter with the job name iot_data_processing to see the common errors for this job.

We can observe that the majority (75%) of failures were due to glue.error.DISK_NO_SPACE_ERROR.

Next, look at the line chart [Resource Utilization] Disk Used (%) to understand the driver’s used disk space during the job runs. For this job, the green line shows the driver’s disk usage, and the yellow line shows the average of the executors’ disk usage.

We can observe that there were three times when 100% of disk was used in executors.

Next, look at the line chart [Throughput] Records Read & Write to see whether the data volume was changed and whether it impacted disk usage.

The chart shows that around four billion records were read at the beginning of this range; however, around 63 billion records were read at the peak. This means that the incoming data volume has significantly increased, and caused local disk space shortage in the worker nodes. For such cases, you can increase the number of workers, enable auto scaling, or choose larger worker types.

After implementing those suggestions, we can see lower disk usage and a successful job run.

(Optional) Configure cross-account setup

We can optionally configure a cross-account setup. Cross-account metrics depend on CloudWatch cross-account observability. In this setup, we expect the following environment:

  • AWS accounts are not managed in AWS Organizations
  • You have two accounts: one account is used as the monitoring account where Grafana is located, another account is used as the source account where the AWS Glue-based data integration pipeline is located

To configure a cross-account setup for this environment, complete the following steps for each account.

Monitoring account

Complete the following steps to configure your monitoring account:

  1. Sign in to the AWS Management Console using the account you will use for monitoring.
  2. On the CloudWatch console, choose Settings in the navigation pane.
  3. Under Monitoring account configuration, choose Configure.
  4. For Select data, choose Metrics.
  5. For List source accounts, enter the AWS account ID of the source account that this monitoring account will view.
  6. For Define a label to identify your source account, choose Account name.
  7. Choose Configure.

Now the account is successfully configured as a monitoring account.

  1. Under Monitoring account configuration, choose Resources to link accounts.
  2. Choose Any account to get a URL for setting up individual accounts as source accounts.
  3. Choose Copy URL.

You will use the copied URL from the source account in the next steps.

Source account

Complete the following steps to configure your source account:

  1. Sign in to the console using your source account.
  2. Enter the URL that you copied from the monitoring account.

You can see the CloudWatch settings page, with some information filled in.

  1. For Select data, choose Metrics.
  2. Do not change the ARN in Enter monitoring account configuration ARN.
  3. The Define a label to identify your source account section is pre-filled with the label choice from the monitoring account. Optionally, choose Edit to change it.
  4. Choose Link.
  5. Enter Confirm in the box and choose Confirm.

Now your source account has been configured to link to the monitoring account. The metrics emitted in the source account will show on the Grafana dashboard in the monitoring account.

To learn more, see CloudWatch cross-account observability.


The following are some considerations when using this solution:

  • Grafana integration is defined for real-time monitoring. If you have a basic understanding of your jobs, it will be straightforward for you to monitor performance, errors, and more on the Grafana dashboard.
  • Amazon Managed Grafana depends on AWS IAM Identify Center. This means you need to manage single sign-on (SSO) users separately, not just AWS Identity and Access Management (IAM) users and roles. It also requires another sign-in step from the AWS console. The Amazon Managed Grafana pricing model depends on an active user license per workspace. More users can cause more charges.
  • Graph lines are visualized per job. If you want to see the lines across all the jobs, you can choose ALL in the control.


AWS Glue job observability metrics offer a powerful new capability for monitoring data pipeline performance in real time. By streaming key metrics to CloudWatch and visualizing them in Grafana, you gain more fine-grained visibility that wasn’t possible before. This post showed how straightforward it is to enable observability metrics and integrate the data with Grafana using Amazon Managed Grafana. We explored the different metrics available and how to build customized Grafana dashboards to surface actionable insights.

Observability is now an essential part of robust data orchestration on AWS. With the ability to monitor data integration trends in real time, you can optimize costs, performance, and reliability.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Xiaoxi Liu is a Software Development Engineer on the AWS Glue team. Her passion is building scalable distributed systems for efficiently managing big data on the cloud, and her concentrations are distributed system, big data, and cloud computing.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Shenoda Guirguis is a Senior Software Development Engineer on the AWS Glue team. His passion is in building scalable and distributed data infrastructure and processing systems. When he gets a chance, Shenoda enjoys reading and playing soccer.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18-year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

Post Syndicated from Venkata Kampana original https://aws.amazon.com/blogs/big-data/automate-aws-clean-rooms-querying-and-dashboard-publishing-using-aws-step-functions-and-amazon-quicksight-part-2/

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. For example, during the COVID-19 pandemic, access to timely data insights was critically important for public health agencies worldwide as they coordinated emergency response efforts. Up-to-date information and analysis empowered organizations to monitor the rapidly changing situation and direct resources accordingly.

This is the second post in this series; we recommend that you read this first post before diving deep into this solution. In our first post, Enable data collaboration among public health agencies with AWS Clean Rooms – Part 1 , we showed how public health agencies can create AWS Clean Room collaborations, invite other stakeholders to join the collaboration, and run queries on their collective data without either party having to share or copy underlying data with each other. As mentioned in the previous blog, AWS Clean Rooms enables multiple organizations to analyze their data and unlock insights they can act upon, without having to share sensitive, restricted, or proprietary records.

However, public health organizations leaders and decision-making officials don’t directly access data collaboration outputs from their Amazon Simple Storage Service (Amazon S3) buckets. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

To ensure these dashboards showcase the most updated insights, the organization builders and data architects need to catalog and update AWS Clean Rooms collaboration outputs on an ongoing basis, which often involves repetitive and manual processes that, if not done well, could delay your organization’s access to the latest data insights.

Manually handling repetitive daily tasks at scale poses risks like delayed insights, miscataloged outputs, or broken dashboards. At a large volume, it would require around-the-clock staffing, straining budgets. This manual approach could expose decision-makers to inaccurate or outdated information.

Automating repetitive workflows, validation checks, and programmatic dashboard refreshes removes human bottlenecks and help decrease inaccuracies. Automation helps ensure continuous, reliable processes that deliver the most current data insights to leaders without delays, all while streamlining resources.

In this post, we explain an automated workflow using AWS Step Functions and Amazon QuickSight to help organizations access the most current results and analyses, without delays from manual data handling steps. This workflow implementation will empower decision-makers with real-time visibility into the evolving collaborative analysis outputs, ensuring they have up-to-date, relevant insights that they can act upon quickly

Solution overview

The following reference architecture illustrates some of the foundational components of clean rooms query automation and publishing dashboards using AWS services. We automate running queries using Step Functions with Amazon EventBridge schedules, build an AWS Glue Data Catalog on query outputs, and publish dashboards using QuickSight so they automatically refresh with new data. This allows public health teams to monitor the most recent insights without manual updates.

The architecture consists of the following components, as numbered in the preceding figure:

  1. A scheduled event rule on EventBridge triggers a Step Functions workflow.
  2. The Step Functions workflow initiates the run of a query using the StartProtectedQuery AWS Clean Rooms API. The submitted query runs securely within the AWS Clean Rooms environment, ensuring data privacy and compliance. The results of the query are then stored in a designated S3 bucket, with a unique protected query ID serving as the prefix for the stored data. This unique identifier is generated by AWS Clean Rooms for each query run, maintaining clear segregation of results.
  3. When the AWS Clean Rooms query is successfully complete, the Step Functions workflow calls the AWS Glue API to update the location of the table in the AWS Glue Data Catalog with the Amazon S3 location where the query results were uploaded in Step 2.
  4. Amazon Athena uses the catalog from the Data Catalog to query the information using standard SQL.
  5. QuickSight is used to query, build visualizations, and publish dashboards using the data from the query results.


For this walkthrough, you need the following:

Launch the CloudFormation stack

In this post, we provide a CloudFormation template to create the following resources:

  • An EventBridge rule that triggers the Step Functions state machine on a schedule
  • An AWS Glue database and a catalog table
  • An Athena workgroup
  • Three S3 buckets:
    • For AWS Clean Rooms to upload the results of query runs
    • For Athena to upload the results for the queries
    • For storing access logs of other buckets
  • A Step Functions workflow designed to run the AWS Clean Rooms query, upload the results to an S3 bucket, and update the table location with the S3 path in the AWS Glue Data Catalog
  • An AWS Key Management Service (AWS KMS) customer-managed key to encrypt the data in S3 buckets
  • AWS Identity and Access Management (IAM) roles and policies with the necessary permissions

To create the necessary resources, complete the following steps:

  1. Choose Launch Stack:

Launch Button

  1. Enter cleanrooms-query-automation-blog for Stack name.
  2. Enter the membership ID from the AWS Clean Rooms collaboration you created in Part 1 of this series.
  3. Choose Next.

  1. Choose Next again.
  2. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

After you run the CloudFormation template and create the resources, you can find the following information on the stack Outputs tab on the AWS CloudFormation console:

  • AthenaWorkGroup – The Athena workgroup
  • EventBridgeRule – The EventBridge rule triggering the Step Functions state machine
  • GlueDatabase – The AWS Glue database
  • GlueTable – The AWS Glue table storing metadata for AWS Clean Rooms query results
  • S3Bucket – The S3 bucket where AWS Clean Rooms uploads query results
  • StepFunctionsStateMachine – The Step Functions state machine

Test the solution

The EventBridge rule named cleanrooms_query_execution_Stepfunctions_trigger is scheduled to trigger every 1 hour. When this rule is triggered, it initiates the run of the CleanRoomsBlogStateMachine-XXXXXXX Step Functions state machine. Complete the following steps to test the end-to-end flow of this solution:

  1. On the Step Functions console, navigate to the state machine you created.
  2. On the state machine details page, locate the latest query run.

The details page lists the completed steps:

  • The state machine submits a query to AWS Clean Rooms using the startProtectedQuery API. The output of the API includes the query run ID and its status.
  • The state machine waits for 30 seconds before checking the status of the query run.
  • After 30 seconds, the state machine checks the query status using the getProtectedQuery API. When the status changes to SUCCESS, it proceeds to the next step to retrieve the AWS Glue table metadata information. The output of this step contains the S3 location to which the query run results are uploaded.
  • The state machine retrieves the metadata of the AWS Glue table named patientimmunization, which was created via the CloudFormation stack.
  • The state machine updates the S3 location (the location to which AWS Clean Rooms uploaded the results) in the metadata of the AWS Glue table.
  • After a successful update of the AWS Glue table metadata, the state machine is complete.
  1. On the Athena console, switch the workgroup to CustomWorkgroup.
  2. Run the following query:
“SELECT * FROM "cleanrooms_patientdb "."patientimmunization" limit 10;"

Visualize the data with QuickSight

Now that you can query your data in Athena, you can use QuickSight to visualize the results. Let’s start by granting QuickSight access to the S3 bucket where your AWS Clean Rooms query results are stored.

Grant QuickSight access to Athena and your S3 bucket

First, grant QuickSight access to the S3 bucket:

  1. Sign in to the QuickSight console.
  2. Choose your user name, then choose Manage QuickSight.
  3. Choose Security and permissions.
  4. For QuickSight access to AWS services, choose Manage.
  5. For Amazon S3, choose Select S3 buckets, and choose the S3 bucket named cleanrooms-query-execution-results -XX-XXXX-XXXXXXXXXXXX (XXXXX represents the AWS Region and account number where the solution is deployed).
  6. Choose Save.

Create your datasets and publish visuals

Before you can analyze and visualize the data in QuickSight, you must create datasets for your Athena tables.

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Select Athena.
  4. Enter a name for your dataset.
  5. Choose Create data source.
  6. Choose the AWS Glue database cleanrooms_patientdb and select the table PatientImmunization.
  7. Select Directly query your data.
  8. Choose Visualize.

  1. On the Analysis tab, choose the visual type of your choice and add visuals.

Clean up

Complete the following steps to clean up your resources when you no longer need this solution:

  1. Manually delete the S3 buckets and the data stored in the bucket.
  2. Delete the CloudFormation templates.
  3. Delete the QuickSight analysis.
  4. Delete the data source.


In this post, we demonstrated how to automate running AWS Clean Rooms queries using an API call from Step Functions. We also showed how to update the query results information on the existing AWS Glue table, query the information using Athena, and create visuals using QuickSight.

The automated workflow solution delivers real-time insights from AWS Clean Rooms collaborations to decision makers through automated checks for new outputs, processing, and Amazon QuickSight dashboard refreshes. This eliminates manual handling tasks, enabling faster data-driven decisions based on latest analyses. Additionally, automation frees up staff resources to focus on more strategic initiatives rather than repetitive updates.

Contact the public sector team directly to learn more about how to set up this solution, or reach out to your AWS account team to engage on a proof of concept of this solution for your organization.

About AWS Clean Rooms

AWS Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. With AWS Clean Rooms, you can create a secure data clean room in minutes, and collaborate with any other company on the AWS Cloud to generate unique insights about advertising campaigns, investment decisions, and research and development.

The AWS Clean Rooms team is continually building new features to help you collaborate. Watch this video to learn more about privacy-enhanced collaboration with AWS Clean Rooms.

Check out more AWS Partners or contact an AWS Representative to know how we can help accelerate your business.

Additional resources

About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.

Identify Java nested dependencies with Amazon Inspector SBOM Generator

Post Syndicated from Chi Tran original https://aws.amazon.com/blogs/security/identify-java-nested-dependencies-with-amazon-inspector-sbom-generator/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector currently supports vulnerability reporting for Amazon Elastic Compute Cloud (Amazon EC2) instances, container images stored in Amazon Elastic Container Registry (Amazon ECR), and AWS Lambda.

Java archive files (JAR, WAR, and EAR) are widely used for packaging Java applications and libraries. These files can contain various dependencies that are required for the proper functioning of the application. In some cases, a JAR file might include other JAR files within its structure, leading to nested dependencies. To help maintain the security and stability of Java applications, you must identify and manage nested dependencies.

In this post, I will show you how to navigate the challenges of uncovering nested Java dependencies, guiding you through the process of analyzing JAR files and uncovering these dependencies. We will focus on the vulnerabilities that Amazon Inspector identifies using the Amazon Inspector SBOM Generator.

The challenge of uncovering nested Java dependencies

Nested dependencies in Java applications can be outdated or contain known vulnerabilities linked to Common Vulnerabilities and Exposures (CVEs). A crucial issue that customers face is the tendency to overlook nested dependencies during analysis and triage. This oversight can lead to the misclassification of vulnerabilities as false positives, posing a security risk.

This challenge arises from several factors:

  • Volume of vulnerabilities — When customers encounter a high volume of vulnerabilities, the sheer number can be overwhelming, making it challenging to dedicate sufficient time and resources to thoroughly analyze each one.
  • Lack of tools or insufficient tooling — There is often a gap in the available tools to effectively identify nested dependencies (for example, mvn dependency:tree, OWASP Dependency-Check). Without the right tools, customers can miss critical dependencies hidden deep within their applications.
  • Understanding the complexity — Understanding the intricate web of nested dependencies requires a specific skill set and knowledge base. Deficits in this area can hinder effective analysis and risk mitigation.

Overview of nested dependencies

Nested dependencies occur when a library or module that is required by your application relies on additional libraries or modules. This is a common scenario in modern software development because developers often use third-party libraries to build upon existing solutions and to benefit from the collective knowledge of the open-source community.

In the context of JAR files, nested dependencies can arise when a JAR file includes other JAR files as part of its structure. These nested files can have their own dependencies, which might further depend on other libraries, creating a chain of dependencies. Nested dependencies help to modularize code and promote code reuse, but they can introduce complexity and increase the potential for security vulnerabilities if they aren’t managed properly.

Why it’s important to know what dependencies are consumed in a JAR file

Consider the following examples, which depict a typical file structure of a Java application to illustrate how nested dependencies are organized:

Example 1 — Log4J dependency

|-- mywebapp-1.0-SNAPSHOT.jar
|   |-- spring-boot-3.0.2.jar
|   |   |-- spring-boot-autoconfigure-3.0.2.jar
|   |   |   |-- ...
|   |   |   |   |-- log4j-to-slf4j.jar

This structure includes the following files and dependencies:

  • mywebapp-1.0-SNAPSHOT.jar is the main application JAR file.
  • Within mywebapp-1.0-SNAPSHOT.jar, there’s spring-boot-3.0.2.jar, which is a dependency of the main application.
  • Nested within spring-boot-3.0.2.jar, there’s spring-boot-autoconfigure-3.0.2.jar, a transitive dependency.
  • Within spring-boot-autoconfigure-3.0.2.jar, there’s log4j-to-slf4j.jar, which is our nested Log4J dependency.

This structure illustrates how a Java application might include nested dependencies, with Log4J nested within other libraries. The actual nesting and dependencies will vary based on the specific libraries and versions that you use in your project.

Example 2 — Jackson dependency

|-- myfinanceapp-2.5.jar
|   |-- jackson-databind-2.9.10.jar
|   |   |-- jackson-core-2.9.10.jar
|   |   |   |-- ...
|   |   |-- jackson-annotations-2.9.10.jar
|   |   |   |-- ...

This structure includes the following files and dependencies:

  • myfinanceapp-2.5.jar is the primary application JAR file.
  • Within myfinanceapp-2.5.jar, there is jackson-databind-, which is a library that the main application relies on for JSON processing.
  • Nested within jackson-databind-, there are other Jackson components such as jackson-core-2.9.10.jar and jackson-annotations-2.9.10.jar. These are dependencies that jackson-databind itself requires to function.

This structure is an example for Java applications that use Jackson for JSON operations. Because Jackson libraries are frequently updated to address various issues, including performance optimizations and security fixes, developers need to be aware of these nested dependencies to keep their applications up-to-date and secure. If you have detailed knowledge of where these components are nested within your application, it will be easier to maintain and upgrade them.

Example 3 — Hibernate dependency

|-- myerpsystem-3.1.jar
|   |-- hibernate-core-5.4.18.Final.jar
|   |   |-- hibernate-validator-6.1.5.Final.jar
|   |   |   |-- ...
|   |   |-- hibernate-entitymanager-5.4.18.Final.jar
|   |   |   |-- ...

This structure includes the following files and dependencies:

  • myerpsystem-3.1.jar as the primary JAR file of the application.
  • Within myerpsystem-3.1.jar, hibernate-core-5.4.18.Final.jar serves as a dependency for object-relational mapping (ORM) capabilities.
  • Nested dependencies such as hibernate-validator-6.1.5.Final.jar and hibernate-entitymanager-5.4.18.Final.jar are crucial for the validation and entity management functionalities that Hibernate provides.

In instances where MyERPSystem encounters operational issues due to a mismatch between the Hibernate versions and another library (that is, a newer version of Spring expecting a different version of Hibernate), developers can use the detailed insights that Amazon Inspector SBOM Generator provides. This tool helps quickly pinpoint the exact versions of Hibernate and its nested dependencies, facilitating a faster resolution to compatibility problems.

Here are some reasons why it’s important to understand the dependencies that are consumed within a JAR file:

  • Security — Nested dependencies can introduce vulnerabilities if they are outdated or have known security issues. A prime example is the Log4J vulnerability discovered in late 2021 (CVE-2021-44228). This vulnerability was critical because Log4J is a widely used logging framework, and threat actors could have exploited the flaw remotely, leading to serious consequences. What exacerbated the issue was the fact that Log4J often existed as a nested dependency in various Java applications (see Example 1), making it difficult for organizations to identify and patch each instance.
  • Compliance — Many organizations must adhere to strict policies about third-party libraries for licensing, regulatory, or security reasons. Not knowing the dependencies, especially nested ones such as in the Log4J case, can lead to non-compliance with these policies.
  • Maintainability — It’s crucial that you stay informed about the dependencies within your project for timely updates or replacements. Consider the Jackson library (Example 2), which is often updated to introduce new features or to patch security vulnerabilities. Managing these updates can be complex, especially when the library is a nested dependency.
  • Troubleshooting — Identifying dependencies plays a critical role in resolving operational issues swiftly. An example of this is addressing compatibility issues between Hibernate and other Java libraries or frameworks within your application due to version mismatches (Example 3). Such problems often manifest as unexpected exceptions or degraded performance, so you need to have a precise understanding of the libraries involved.

These examples underscore that you need to have deep visibility into JAR file contents to help protect against immediate threats and help ensure long-term application health and compliance.

Existing tooling limitations

When analyzing Java applications for nested dependencies, one of the main challenges is that existing tools can’t efficiently narrow down the exact location of these dependencies. This issue is particularly evident with tools such as mvn dependency:tree, OWASP Dependency-Check, and similar dependency analysis solutions.

Although tools are available to analyze Java applications for nested dependencies, they often fall short in several key areas. The following points highlight common limitations of these tools:

  • Inadequate depth in dependency trees — Although other tools provide a hierarchical view of project dependencies, they often fail to delve deep enough to reveal nested dependencies, particularly those that are embedded within other JAR files as nested dependencies. Nested dependencies are repackaged within a library and aren’t immediately visible in the standard dependency tree.
  • Lack of specific location details — These tools typically don’t offer the granularity needed to pinpoint the exact location of a nested dependency within a JAR file. For large and complex Java applications, it may be challenging to identify and address specific dependencies, especially when they are deeply embedded.
  • Complexity in large projects — In projects with a vast and intricate network of dependencies, these tools can struggle to provide clear and actionable insights. The output can be complicated and difficult to navigate, leaving customers without a clear path to identifying critical dependencies.

Address tooling limitations with Amazon Inspector SBOM Generator

The Amazon Inspector SBOM Generator (Sbomgen) introduces a significant advancement in the identification of nested dependencies in Java applications. Although the concept of monitoring dependencies is well-established in software development, AWS has tailored this tool to enhance visibility into the complexities of software compositions. By generating a software bill of materials (SBOM) for a container image, Sbomgen provides a detailed inventory of the software installed on a system, including hidden nested dependencies that traditional tools can overlook. This capability enriches the existing toolkit, offering a more granular and actionable understanding of the dependency structure of your applications.

Sbomgen works by scanning for files that contain information about installed packages. Upon finding such files, it extracts essential data such as package names, versions, and other metadata. Then it transforms this metadata into a CycloneDX SBOM, providing a structured and detailed view of the dependencies.

For information about how to install Sbomgen, see Installing Amazon Inspector SBOM Generator (Sbomgen)

A key feature of Sbomgen is its ability to provide explicit paths to each dependency.

For example, given a compiled jar application MyWebApp-0.0.1-SNAPSHOT.jar, users can run the following CLI command with Sbomgen:

./inspector-sbomgen localhost --path /path/to/MyWebApp-0.0.1-SNAPSHOT.jar --scanners java-jar

The output should look similar to the following:

  "bom-ref": "comp-11",
  "type": "library",
  "name": "org.apache.logging.log4j/log4j-to-slf4j",
  "version": "2.19.0",
  "hashes": [
      "alg": "SHA-1",
      "content": "30f4812e43172ecca5041da2cb6b965cc4777c19"
  "purl": "pkg:maven/org.apache.logging.log4j/[email protected]",
  "properties": [
      "name": "amazon:inspector:sbom_generator:source_path",
      "value": "/tmp/MyWebApp-0.0.1-SNAPSHOT.jar/BOOT-INF/lib/spring-boot-3.0.2.jar/BOOT-INF/lib/spring-boot-autoconfigure-3.0.2.jar/BOOT-INF/lib/logback-classic-1.4.5.jar/BOOT-INF/lib/logback-core-1.4.5.jar/BOOT-INF/lib/log4j-to-slf4j-2.19.0.jar/META-INF/maven/org.apache.logging.log4j/log4j-to-slf4j/pom.properties"

In this output, the amazon:inspector:sbom_collector:path property is particularly significant. It provides a clear and complete path to the location of the specific dependency (in this case, log4j-to-slf4j) within the application’s structure. This level of detail is crucial for several reasons:

  • Precise location identification — It helps you quickly and accurately identify the exact location of each dependency, which is especially useful for nested dependencies that are otherwise hard to locate.
  • Effective risk management — When you know the exact path of dependencies, you can more efficiently assess and address security risks associated with these dependencies.
  • Time and resource efficiency — It reduces the time and resources needed to manually trace and analyze dependencies, streamlining the vulnerability management process.
  • Enhanced visibility and transparency — It provides a clearer understanding of the application’s dependency structure, contributing to better overall management and maintenance.
  • Comprehensive package information — The detailed package information, including name, version, hashes, and package URL, of Sbomgen equips you with a thorough understanding of each dependency’s specifics, aiding in precise vulnerability tracking and software integrity verification.

Mitigate vulnerable dependencies

After you identify the nested dependencies in your Java JAR files, you should verify whether these dependencies are outdated or vulnerable. Amazon Inspector can help you achieve this by doing the following:

  • Comparing the discovered dependencies against a database of known vulnerabilities.
  • Providing a list of potentially vulnerable dependencies, along with detailed information about the associated CVEs.
  • Offering recommendations on how to mitigate the risks, such as updating the dependencies to a newer, more secure version.

By integrating Amazon Inspector into your software development lifecycle, you can continuously monitor your Java applications for vulnerable nested dependencies and take the necessary steps to help ensure that your application remains secure and compliant.


To help secure your Java applications, you must manage nested dependencies. Amazon Inspector provides an automated and efficient way to discover and mitigate potentially vulnerable dependencies in JAR files. By using the capabilities of Amazon Inspector, you can help improve the security posture of your Java applications and help ensure that they adhere to best practices.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Chi Tran

Chi Tran

Chi is a Security Researcher who helps ensure that AWS services, applications, and websites are designed and implemented to the highest security standards. He’s a SME for Amazon Inspector and enthusiastically assists customers with advanced issues and use cases. Chi is passionate about information security — API security, penetration testing (he’s OSCP, OSCE, OSWE, GPEN certified), application security, and cloud security.

How to enforce creation of roles in a specific path: Use IAM role naming in hierarchy models

Post Syndicated from Varun Sharma original https://aws.amazon.com/blogs/security/how-to-enforce-creation-of-roles-in-a-specific-path-use-iam-role-naming-in-hierarchy-models/

An AWS Identity and Access Management (IAM) role is an IAM identity that you create in your AWS account that has specific permissions. An IAM role is similar to an IAM user because it’s an AWS identity with permission policies that determine what the identity can and cannot do on AWS. However, as outlined in security best practices in IAM, AWS recommends that you use IAM roles instead of IAM users. An IAM user is uniquely associated with one person, while a role is intended to be assumable by anyone who needs it. An IAM role doesn’t have standard long-term credentials such as a password or access keys associated with it. Instead, when you assume a role, it provides you with temporary security credentials for your role session that are only valid for certain period of time.

This blog post explores the effective implementation of security controls within IAM roles, placing a specific focus on the IAM role’s path feature. By organizing IAM roles hierarchically using paths, you can address key challenges and achieve practical solutions to enhance IAM role management.

Benefits of using IAM paths

A fundamental benefit of using paths is the establishment of a clear and organized organizational structure. By using paths, you can handle diverse use cases while creating a well-defined framework for organizing roles on AWS. This organizational clarity can help you navigate complex IAM setups and establish a cohesive structure that’s aligned with your organizational needs.

Furthermore, by enforcing a specific structure, you can gain precise control over the scope of permissions assigned to roles, helping to reduce the risk of accidental assignment of overly permissive policies. By assisting in preventing inadvertent policy misconfigurations and assisting in coordinating permissions with the planned organizational structure, this proactive solution improves security. This approach is highly effective when you consistently apply established naming conventions to paths, role names, and policies. Enforcing a uniform approach to role naming enhances the standardization and efficiency of IAM role management. This practice fosters smooth collaboration and reduces the risk of naming conflicts.

Path example

In IAM, a role path is a way to organize and group IAM roles within your AWS account. You specify the role path as part of the role’s Amazon Resource Name (ARN).

As an example, imagine that you have a group of IAM roles related to development teams, and you want to organize them under a path. You might structure it like this:

Role name: Dev App1 admin
Role path: /D1/app1/admin/
Full ARN: arn:aws:iam::123456789012:role/D1/app1/admin/DevApp1admin

Role name: Dev App2 admin
Role path: /D2/app2/admin/
Full ARN: arn:aws:iam::123456789012:role/D2/app2/admin/DevApp2admin

In this example, the IAM roles DevApp1admin and DevApp2admin are organized under two different development team paths: D1/app1/admin and D2/app2/admin, respectively. The role path provides a way to group roles logically, making it simpler to manage and understand their purpose within the context of your organization.

Solution overview

Figure 1: Sample architecture

Figure 1: Sample architecture

The sample architecture in Figure 1 shows how you can separate and categorize the enterprise roles and development team roles into a hierarchy model by using a path in an IAM role. Using this hierarchy model, you can enable several security controls at the level of the service control policy (SCP), IAM policy, permissions boundary, or the pipeline. I recommend that you avoid incorporating business unit names in paths because they could change over time.

Here is what the IAM role path looks like as an ARN:


In this example, in the resource name, /EnT/iam/adm/ is the role path, and IAMAdmin is the role name.

You can now use the role path as part of a policy, such as the following:


In this example, in the resource name, /EnT/iam/adm/ is the role path, and * indicates any IAM role inside this path.

Walkthrough of examples for preventative controls

Now let’s walk through some example use cases and SCPs for a preventative control that you can use based on the path of an IAM role.

PassRole preventative control example

The following SCP denies passing a role for enterprise roles, except for roles that are part of the IAM admin hierarchy within the overall enterprise hierarchy.

	"Version": "2012-10-17",
	"Statement": [
			"Sid": "DenyEnTPassRole",
			"Effect": "Deny",
			"Action": "iam:PassRole",
			"Resource": "arn:aws:iam::*:role/EnT/*",
			"Condition": {
				"ArnNotLike": {
					"aws:PrincipalArn": "arn:aws:iam::*:role/EnT/fed/iam/*"

With just a couple of statements in the SCP, this preventative control helps provide protection to your high-privilege roles for enterprise roles, regardless of the role’s name or current status.

This example uses the following paths:

  • /EnT/ — enterprise roles (roles owned by the central teams, such as cloud center of excellence, central security, and networking teams)
  • /fed/ — federated roles, which have interactive access
  • /iam/ — roles that are allowed to perform IAM actions, such as CreateRole, AttachPolicy, or DeleteRole

IAM actions preventative control example

The following SCP restricts IAM actions, including CreateRole, DeleteRole, AttachRolePolicy, and DetachRolePolicy, on the enterprise path.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "DenyIAMActionsonEnTRoles",
            "Effect": "Deny",
            "Action": [
            "Resource": "arn:aws:iam::*:role/EnT/*",
            "Condition": {
                "ArnNotLike": {
                    "aws:PrincipalArn": "arn:aws:iam::*:role/EnT/fed/iam/*"

This preventative control denies an IAM role that is outside of the enterprise hierarchy from performing the actions CreateRole, DeleteRole, DetachRolePolicy, and AttachRolePolicy in this hierarchy. Every IAM role will be denied those API actions except the one with the path as arn:aws:iam::*:role/EnT/fed/iam/*

The example uses the following paths:

  • /EnT/ — enterprise roles (roles owned by the central teams, such as cloud center of excellence, central security, or network automation teams)
  • /fed/ — federated roles, which have interactive access
  • /iam/ — roles that are allowed to perform IAM actions (in this case, CreateRole, DeteleRole, DetachRolePolicy, and AttachRolePolicy)

IAM policies preventative control example

The following SCP policy denies attaching certain high-privilege AWS managed policies such as AdministratorAccess outside of certain IAM admin roles. This is especially important in an environment where business units have self-service capabilities.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "RolePolicyAttachment",
            "Effect": "Deny",
            "Action": "iam:AttachRolePolicy",
            "Resource": "arn:aws:iam::*:role/EnT/fed/iam/*",
            "Condition": {
                "ArnNotLike": {
                    "iam:PolicyARN": "arn:aws:iam::aws:policy/AdministratorAccess"

AssumeRole preventative control example

The following SCP doesn’t allow non-production roles to assume a role in production accounts. Make sure to replace <Your production OU ID> and <your org ID> with your own information.

	"Version": "2012-10-17",
	"Statement": [
			"Sid": "DenyAssumeRole",
			"Effect": "Deny",
			"Action": "sts:AssumeRole",
			"Resource": "*",
			"Condition": {
				"StringLike": {
					"aws:PrincipalArn": "arn:aws:iam::*:role/np/*"
				"ForAnyValue:StringLike": {
					"aws:ResourceOrgPaths": "<your org ID>/r-xxxx/<Your production OU ID>/*"

This example uses the /np/ path, which specifies non-production roles. The SCP denies non-production IAM roles from assuming a role in the production organizational unit (OU) (in our example, this is represented by <your org ID>/r-xxxx/<Your production OU ID>/*”). Depending on the structure of your organization, the ResourceOrgPaths will have one of the following formats:

  • “o-a1b2c3d4e5/*”
  • “o-a1b2c3d4e5/r-ab12/ou-ab12-11111111/*”
  • “o-a1b2c3d4e5/r-ab12/ou-ab12-11111111/ou-ab12-22222222/”

Walkthrough of examples for monitoring IAM roles (detective control)

Now let’s walk through two examples of detective controls.

AssumeRole in CloudTrail Lake

The following is an example of a detective control to monitor IAM roles in AWS CloudTrail Lake.

    userIdentity.arn as "Username", eventTime, eventSource, eventName, sourceIPAddress, errorCode, errorMessage
    <Event data store ID>
    userIdentity.arn IS NOT NULL
    AND eventName = 'AssumeRole'
    AND userIdentity.arn LIKE '%/np/%'
    AND errorCode = 'AccessDenied'
    AND eventTime > '2023-07-01 14:00:00'
    AND eventTime < '2023-11-08 18:00:00';

This query lists out AssumeRole events for non-production roles in the organization for AccessDenied errors. The output is stored in an Amazon Simple Storage Service (Amazon S3) bucket from CloudTrail Lake, from which the csv file can be downloaded. The following shows some example output:

arn:aws:sts::123456789012:assumed-role/np/test,2023-12-09 10:35:45.000,iam.amazonaws.com,AssumeRole,,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/np/test is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::123456789012:role/hello because no identity-based policy allows the sts:AssumeRole action

You can modify the query to audit production roles as well.

CreateRole in CloudTrail Lake

Another example of a CloudTrail Lake query for a detective control is as follows:

    userIdentity.arn as "Username", eventTime, eventSource, eventName, sourceIPAddress, errorCode, errorMessage
    <Event data store ID>
    userIdentity.arn IS NOT NULL
    AND eventName = 'CreateRole'
    AND userIdentity.arn LIKE '%/EnT/fed/iam/%'
    AND eventTime > '2023-07-01 14:00:00'
    AND eventTime < '2023-11-08 18:00:00';

This query lists out CreateRole events for roles in the /EnT/fed/iam/ hierarchy. The following are some example outputs:


arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test,2023-12-09 16:31:11.000,iam.amazonaws.com,CreateRole,,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::123456789012:role/EnT/fed/iam/security because no identity-based policy allows the iam:CreateRole action

arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test,2023-12-09 16:33:10.000,iam.amazonaws.com,CreateRole,,AccessDenied,User: arn:aws:sts::123456789012:assumed-role/EnT/fed/iam/security/test is not authorized to perform: iam:CreateRole on resource: arn:aws:iam::123456789012:role/EnT/fed/iam/security because no identity-based policy allows the iam:CreateRole action

Because these roles can create additional enterprise roles, you should audit roles created in this hierarchy.

Important considerations

When you implement specific paths for IAM roles, make sure to consider the following:

  • The path of an IAM role is part of the ARN. After you define the ARN, you can’t change it later. Therefore, just like the name of the role, consider what the path should be during the early discussions of design.
  • IAM roles can’t have the same name, even on different paths.
  • When you switch roles through the console, you need to include the path because it’s part of the role’s ARN.
  • The path of an IAM role can’t exceed 512 characters. For more information, see IAM and AWS STS quotas.
  • The role name can’t exceed 64 characters. If you intend to use a role with the Switch Role feature in the AWS Management Console, then the combined path and role name can’t exceed 64 characters.
  • When you create a role through the console, you can’t set an IAM role path. To set a path for the role, you need to use automation, such as AWS Command Line Interface (AWS CLI) commands or SDKs. For example, you might use an AWS CloudFormation template or a script that interacts with AWS APIs to create the role with the desired path.


By adopting the path strategy, you can structure IAM roles within a hierarchical model, facilitating the implementation of security controls on a scalable level. You can make these controls effective for IAM roles by applying them to a path rather than specific roles, which sets this approach apart.

This strategy can help you elevate your overall security posture within IAM, offering a forward-looking solution for enterprises. By establishing a scalable IAM hierarchy, you can help your organization navigate dynamic changes through a robust identity management structure. A well-crafted hierarchy reduces operational overhead by providing a versatile framework that makes it simpler to add or modify roles and policies. This scalability can help streamline the administration of IAM and help your organization manage access control in evolving environments.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Varun Sharma

Varun Sharma

Varun is an AWS Cloud Security Engineer who wears his security cape proudly. With a knack for unravelling the mysteries of Amazon Cognito and IAM, Varun is a go-to subject matter expert for these services. When he’s not busy securing the cloud, you’ll find him in the world of security penetration testing. And when the pixels are at rest, Varun switches gears to capture the beauty of nature through the lens of his camera.

Use multiple bookmark keys in AWS Glue JDBC jobs

Post Syndicated from Durga Prasad original https://aws.amazon.com/blogs/big-data/use-multiple-bookmark-keys-in-aws-glue-jdbc-jobs/

AWS Glue is a serverless data integrating service that you can use to catalog data and prepare for analytics. With AWS Glue, you can discover your data, develop scripts to transform sources into targets, and schedule and run extract, transform, and load (ETL) jobs in a serverless environment. AWS Glue jobs are responsible for running the data processing logic.

One important feature of AWS Glue jobs is the ability to use bookmark keys to process data incrementally. When an AWS Glue job is run, it reads data from a data source and processes it. One or more columns from the source table can be specified as bookmark keys. The column should have sequentially increasing or decreasing values without gaps. These values are used to mark the last processed record in a batch. The next run of the job resumes from that point. This allows you to process large amounts of data incrementally. Without job bookmark keys, AWS Glue jobs would have to reprocess all the data during every run. This can be time-consuming and costly. By using bookmark keys, AWS Glue jobs can resume processing from where they left off, saving time and reducing costs.

This post explains how to use multiple columns as job bookmark keys in an AWS Glue job with a JDBC connection to the source data store. It also demonstrates how to parameterize the bookmark key columns and table names in the AWS Glue job connection options.

This post is focused towards architects and data engineers who design and build ETL pipelines on AWS. You are expected to have a basic understanding of the AWS Management Console, AWS Glue, Amazon Relational Database Service (Amazon RDS), and Amazon CloudWatch logs.

Solution overview

To implement this solution, we complete the following steps:

  1. Create an Amazon RDS for PostgreSQL instance.
  2. Create two tables and insert sample data.
  3. Create and run an AWS Glue job to extract data from the RDS for PostgreSQL DB instance using multiple job bookmark keys.
  4. Create and run a parameterized AWS Glue job to extract data from different tables with separate bookmark keys

The following diagram illustrates the components of this solution.

Deploy the solution

For this solution, we provide an AWS CloudFormation template that sets up the services included in the architecture, to enable repeatable deployments. This template creates the following resources:

  • An RDS for PostgreSQL instance
  • An Amazon Simple Storage Service (Amazon S3) bucket to store the data extracted from the RDS for PostgreSQL instance
  • An AWS Identity and Access Management (IAM) role for AWS Glue
  • Two AWS Glue jobs with job bookmarks enabled to incrementally extract data from the RDS for PostgreSQL instance

To deploy the solution, complete the following steps:

  1. Choose  to launch the CloudFormation stack:
  2. Enter a stack name.
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.
  5. Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.
  6. When the stack is complete, copy the AWS Glue scripts to the S3 bucket job-bookmark-keys-demo-<accountid>.
  7. Open AWS CloudShell.
  8. Run the following commands and replace <accountid> with your AWS account ID:
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_1_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_1_job.py
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_2_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_2_job.py

Add sample data and run AWS Glue jobs

In this section, we connect to the RDS for PostgreSQL instance via AWS Lambda and create two tables. We also insert sample data into both the tables.

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose the function LambdaRDSDDLExecute.
  3. Choose Test and choose Invoke for the Lambda function to insert the data.

The two tables product and address will be created with sample data, as shown in the following screenshot.

Run the multiple_job_bookmark_keys AWS Glue job

We run the multiple_job_bookmark_keys AWS Glue job twice to extract data from the product table of the RDS for PostgreSQL instance. In the first run, all the existing records will be extracted. Then we insert new records and run the job again. The job should extract only the newly inserted records in the second run.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job multiple_job_bookmark_keys.
  3. Choose Run to run the job and choose the Runs tab to monitor the job progress.
  4. Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
  5. Choose the log stream in the next window to see the output logs printed.

    The AWS Glue job extracted all records from the source table product. It keeps track of the last combination of values in the columns product_id and version.Next, we run another Lambda function to insert a new record. The product_id 45 already exists, but the inserted record will have a new version as 2, making the combination sequentially increasing.
  6. Run the LambdaRDSDDLExecute_incremental Lambda function to insert the new record in the product table.
  7. Run the AWS Glue job multiple_job_bookmark_keys again after you insert the record and wait for it to succeed.
  8. Choose the Output logs hyperlink under CloudWatch logs.
  9. Choose the log stream in the next window to see only the newly inserted record printed.

The job extracts only those records that have a combination greater than the previously extracted records.

Run the parameterised_job_bookmark_keys AWS Glue job

We now run the parameterized AWS Glue job that takes the table name and bookmark key column as parameters. We run this job to extract data from different tables maintaining separate bookmarks.

The first run will be for the address table with bookmarkkey as address_id. These are already populated with the job parameters.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job parameterised_job_bookmark_keys.
  3. Choose Run to run the job and choose the Runs tab to monitor the job progress.
  4. Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
  5. Choose the log stream in the next window to see all records from the address table printed.
  6. On the Actions menu, choose Run with parameters.
  7. Expand the Job parameters section.
  8. Change the job parameter values as follows:
    • Key --bookmarkkey with value product_id
    • Key --table_name with value product
    • The S3 bucket name is unchanged (job-bookmark-keys-demo-<accountnumber>)
  9. Choose Run job to run the job and choose the Runs tab to monitor the job progress.
  10. Choose the Output logs hyperlink under CloudWatch logs after the job is complete.
  11. Choose the log stream to see all the records from the product table printed.

The job maintains separate bookmarks for each of the tables when extracting the data from the source data store. This is achieved by adding the table name to the job name and transformation contexts in the AWS Glue job script.

Clean up

To avoid incurring future charges, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Select the bucket with job-bookmark-keys in its name.
  3. Choose Empty to delete all the files and folders in it.
  4. On the CloudFormation console, choose Stacks in the navigation pane.
  5. Select the stack you created to deploy the solution and choose Delete.


This post demonstrated passing more than one column of a table as jobBookmarkKeys in a JDBC connection to an AWS Glue job. It also explained how you can a parameterized AWS Glue job to extract data from multiple tables while keeping their respective bookmarks. As a next step, you can test the incremental data extract by changing data in the source tables.

About the Authors

Durga Prasad is a Sr Lead Consultant enabling customers build their Data Analytics solutions on AWS. He is a coffee lover and enjoys playing badminton.

Murali Reddy is a Lead Consultant at Amazon Web Services (AWS), helping customers build and implement data analytics solution. When he’s not working, Murali is an avid bike rider and loves exploring new places.

Combine transactional, streaming, and third-party data on Amazon Redshift for financial services

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/combine-transactional-streaming-and-third-party-data-on-amazon-redshift-for-financial-services/

Financial services customers are using data from different sources that originate at different frequencies, which includes real time, batch, and archived datasets. Additionally, they need streaming architectures to handle growing trade volumes, market volatility, and regulatory demands. The following are some of the key business use cases that highlight this need:

  • Trade reporting – Since the global financial crisis of 2007–2008, regulators have increased their demands and scrutiny on regulatory reporting. Regulators have placed an increased focus to both protect the consumer through transaction reporting (typically T+1, meaning 1 business day after the trade date) and increase transparency into markets via near-real-time trade reporting requirements.
  • Risk management – As capital markets become more complex and regulators launch new risk frameworks, such as Fundamental Review of the Trading Book (FRTB) and Basel III, financial institutions are looking to increase the frequency of calculations for overall market risk, liquidity risk, counter-party risk, and other risk measurements, and want to get as close to real-time calculations as possible.
  • Trade quality and optimization – In order to monitor and optimize trade quality, you need to continually evaluate market characteristics such as volume, direction, market depth, fill rate, and other benchmarks related to the completion of trades. Trade quality is not only related to broker performance, but is also a requirement from regulators, starting with MIFID II.

The challenge is to come up with a solution that can handle these disparate sources, varied frequencies, and low-latency consumption requirements. The solution should be scalable, cost-efficient, and straightforward to adopt and operate. Amazon Redshift features like streaming ingestion, Amazon Aurora zero-ETL integration, and data sharing with AWS Data Exchange enable near-real-time processing for trade reporting, risk management, and trade optimization.

In this post, we provide a solution architecture that describes how you can process data from three different types of sources—streaming, transactional, and third-party reference data—and aggregate them in Amazon Redshift for business intelligence (BI) reporting.

Solution overview

This solution architecture is created prioritizing a low-code/no-code approach with the following guiding principles:

  • Ease of use – It should be less complex to implement and operate with intuitive user interfaces
  • Scalable – You should be able to seamlessly increase and decrease capacity on demand
  • Native integration – Components should integrate without additional connectors or software
  • Cost-efficient – It should deliver balanced price/performance
  • Low maintenance – It should require less management and operational overhead

The following diagram illustrates the solution architecture and how these guiding principles were applied to the ingestion, aggregation, and reporting components.

Deploy the solution

You can use the following AWS CloudFormation template to deploy the solution.

Launch Cloudformation Stack

This stack creates the following resources and necessary permissions to integrate the services:


To ingest data, you use Amazon Redshift Streaming Ingestion to load streaming data from the Kinesis data stream. For transactional data, you use the Redshift zero-ETL integration with Amazon Aurora MySQL. For third-party reference data, you take advantage of AWS Data Exchange data shares. These capabilities allow you to quickly build scalable data pipelines because you can increase the capacity of Kinesis Data Streams shards, compute for zero-ETL sources and targets, and Redshift compute for data shares when your data grows. Redshift streaming ingestion and zero-ETL integration are low-code/no-code solutions that you can build with simple SQLs without investing significant time and money into developing complex custom code.

For the data used to create this solution, we partnered with FactSet, a leading financial data, analytics, and open technology provider. FactSet has several datasets available in the AWS Data Exchange marketplace, which we used for reference data. We also used FactSet’s market data solutions for historical and streaming market quotes and trades.


Data is processed in Amazon Redshift adhering to an extract, load, and transform (ELT) methodology. With virtually unlimited scale and workload isolation, ELT is more suited for cloud data warehouse solutions.

You use Redshift streaming ingestion for real-time ingestion of streaming quotes (bid/ask) from the Kinesis data stream directly into a streaming materialized view and process the data in the next step using PartiQL for parsing the data stream inputs. Note that streaming materialized views differs from regular materialized views in terms of how auto refresh works and the data management SQL commands used. Refer to Streaming ingestion considerations for details.

You use the zero-ETL Aurora integration for ingesting transactional data (trades) from OLTP sources. Refer to Working with zero-ETL integrations for currently supported sources. You can combine data from all these sources using views, and use stored procedures to implement business transformation rules like calculating weighted averages across sectors and exchanges.

Historical trade and quote data volumes are huge and often not queried frequently. You can use Amazon Redshift Spectrum to access this data in place without loading it into Amazon Redshift. You create external tables pointing to data in Amazon Simple Storage Service (Amazon S3) and query similarly to how you query any other local table in Amazon Redshift. Multiple Redshift data warehouses can concurrently query the same datasets in Amazon S3 without the need to make copies of the data for each data warehouse. This feature simplifies accessing external data without writing complex ETL processes and enhances the ease of use of the overall solution.

Let’s review a few sample queries used for analyzing quotes and trades. We use the following tables in the sample queries:

  • dt_hist_quote – Historical quotes data containing bid price and volume, ask price and volume, and exchanges and sectors. You should use relevant datasets in your organization that contain these data attributes.
  • dt_hist_trades – Historical trades data containing traded price, volume, sector, and exchange details. You should use relevant datasets in your organization that contain these data attributes.
  • factset_sector_map – Mapping between sectors and exchanges. You can obtain this from the FactSet Fundamentals ADX dataset.

Sample query for analyzing historical quotes

You can use the following query to find weighted average spreads on quotes:

date_dt :: date,
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
sum(spread * weight)/sum(weight) :: decimal (30,5) as weighted_average_spread
select date_dt,exchange_name,
factset_sector_desc sector_name,
((bid_price*bid_volume) + (ask_price*ask_volume))as weight,
((ask_price - bid_price)/ask_price) as spread
dt_hist_quotes a
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
where ask_price <> 0 and bid_price <> 0
group by 1,2,3

Sample query for analyzing historical trades

You can use the following query to find $-volume on trades by detailed exchange, by sector, and by major exchange (NYSE and Nasdaq):

cast(date_dt as date) as date_dt,
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
factset_sector_desc sector_name,
sum((price * volume):: decimal(30,4)) total_transaction_amt
dt_hist_trades a
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
group by 1,2,3


You can use Amazon QuickSight and Amazon Managed Grafana for BI and real-time reporting, respectively. These services natively integrate with Amazon Redshift without the need to use additional connectors or software in between.

You can run a direct query from QuickSight for BI reporting and dashboards. With QuickSight, you can also locally store data in the SPICE cache with auto refresh for low latency. Refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters for comprehensive details on how to integrate QuickSight with Amazon Redshift.

You can use Amazon Managed Grafana for near-real-time trade dashboards that are refreshed every few seconds. The real-time dashboards for monitoring the trade ingestion latencies are created using Grafana and the data is sourced from system views in Amazon Redshift. Refer to Using the Amazon Redshift data source to learn about how to configure Amazon Redshift as a data source for Grafana.

The users who interact with regulatory reporting systems include analysts, risk managers, operators, and other personas that support business and technology operations. Apart from generating regulatory reports, these teams require visibility into the health of the reporting systems.

Historical quotes analysis

In this section, we explore some examples of historical quotes analysis from the Amazon QuickSight dashboard.

Weighted average spread by sectors

The following chart shows the daily aggregation by sector of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. To calculate the average daily spread, each spread is weighted by the sum of the bid and the ask dollar volume. The query to generate this chart processes 103 billion of data points in total, joins each trade with the sector reference table, and runs in less than 10 seconds.

Weighted average spread by exchanges

The following chart shows the daily aggregation of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. The calculation methodology and query performance metrics are similar to those of the preceding chart.

Historical trades analysis

In this section, we explore some examples of historical trades analysis from the Amazon QuickSight dashboard.

Trade volumes by sector

The following chart shows the daily aggregation by sector of all the individual trades on NASDAQ and NYSE for 3 months. The query to generate this chart processes 3.6 billion of trades in total, joins each trade with the sector reference table, and runs in under 5 seconds.

Trade volumes for major exchanges

The following chart shows the daily aggregation by exchange group of all the individual trades for 3 months. The query to generate this chart has similar performance metrics as the preceding chart.

Real-time dashboards

Monitoring and observability is an important requirement for any critical business application such as trade reporting, risk management, and trade management systems. Apart from system-level metrics, it’s also important to monitor key performance indicators in real time so that operators can be alerted and respond as soon as possible to business-impacting events. For this demonstration, we have built dashboards in Grafana that monitor the delay of quote and trade data from the Kinesis data stream and Aurora, respectively.

The quote ingestion delay dashboard shows the amount of time it takes for each quote record to be ingested from the data stream and be available for querying in Amazon Redshift.

The trade ingestion delay dashboard shows the amount of time it takes for a transaction in Aurora to become available in Amazon Redshift for querying.

Clean up

To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, refer to Deleting a stack on the AWS CloudFormation console.


Increasing volumes of trading activity, more complex risk management, and enhanced regulatory requirements are leading capital markets firms to embrace real-time and near-real-time data processing, even in mid- and back-office platforms where end of day and overnight processing was the standard. In this post, we demonstrated how you can use Amazon Redshift capabilities for ease of use, low maintenance, and cost-efficiency. We also discussed cross-service integrations to ingest streaming market data, process updates from OLTP databases, and use third-party reference data without having to perform complex and expensive ETL or ELT processing before making the data available for analysis and reporting.

Please reach out to us if you need any guidance in implementing this solution. Refer to Real-time analytics with Amazon Redshift streaming ingestion, Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift, and Working with AWS Data Exchange data shares as a producer for more information.

About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 18 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Alket Memushaj works as a Principal Architect in the Financial Services Market Development team at AWS. Alket is responsible for technical strategy for capital markets, working with partners and customers to deploy applications across the trade lifecycle to the AWS Cloud, including market connectivity, trading systems, and pre- and post-trade analytics and research platforms.

Ruben Falk is a Capital Markets Specialist focused on AI and data & analytics. Ruben consults with capital markets participants on modern data architecture and systematic investment processes. He joined AWS from S&P Global Market Intelligence where he was Global Head of Investment Management Solutions.

Jeff Wilson is a World-wide Go-to-market Specialist with 15 years of experience working with analytic platforms. His current focus is sharing the benefits of using Amazon Redshift, Amazon’s native cloud data warehouse. Jeff is based in Florida and has been with AWS since 2019.

Export a Software Bill of Materials using Amazon Inspector

Post Syndicated from Varun Sharma original https://aws.amazon.com/blogs/security/export-a-software-bill-of-materials-using-amazon-inspector/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector has expanded capability that allows customers to export a consolidated Software Bill of Materials (SBOM) for supported Amazon Inspector monitored resources, excluding Windows EC2 instances.

Customers have asked us to provide additional software application inventory collected from Amazon Inspector monitored resources. This makes it possible to precisely track the software supply chain and security threats that might be connected to the results of the current Amazon Inspector. Generating an SBOM gives you critical security information that offers you visibility into specifics about your software supply chain, including the packages you use the most frequently and the related vulnerabilities that might affect your whole company.

This blog post includes steps that you can follow to export a consolidated SBOM for the resources monitored by Amazon Inspector across your organization in industry standard formats, including CycloneDx and SPDX. It also shares insights and approaches for analyzing SBOM artifacts using Amazon Athena.


An SBOM is defined as a nested inventory with a list of ingredients that make up software components. Security teams can export a consolidated SBOM to Amazon Simple Storage Service (Amazon S3) for an entire organization from the resource coverage page in the AWS Management Console for Amazon Inspector.

Using CycloneDx and SPDX industry standard formats, you can use insights gained from an SBOM to make decisions such as which software packages need to be updated across your organization or deprecated, if there’s no other option. Individual application or security engineers can also export an SBOM for a single resource or group of resources by applying filters for a specific account, resource type, resource ID, tags, or a combination of these as a part of the SBOM export workflow in the console or application programming interfaces.

Exporting SBOMs

To export Amazon Inspector SBOM reports to an S3 bucket, you must create and configure a bucket in the AWS Region where the SBOM reports are to be exported. You must configure your bucket permissions to allow only Amazon Inspector to put new objects into the bucket. This prevents other AWS services and users from adding objects to the bucket.

Each SBOM report is stored in an S3 bucket and has the name Cyclonedx_1_4 (Json) or Spdx_2_3-compatible (Json), depending on the export format that you specify. You can also use S3 event notifications to alert different operational teams that new SBOM reports have been exported.

Amazon Inspector requires that you use an AWS Key Management Service (AWS KMS) key to encrypt the SBOM report. The key must be a customer managed, symmetric KMS encryption key and must be in the same Region as the S3 bucket that you configured to store the SBOM report. The new KMS key for the SBOM report requires a key policy to be configured to grant permissions for Amazon Inspector to use the key. (Shown in Figure 1.)

Figure 1: Amazon Inspector SBOM export

Figure 1: Amazon Inspector SBOM export

Deploy prerequisites

The AWS CloudFormation template provided creates an S3 bucket with an associated bucket policy to enable Amazon Inspector to export SBOM report objects into the bucket. The template also creates a new KMS key to be used for SBOM report exports and grants the Amazon Inspector service permissions to use the key.

The export can be initiated from the AWS Inspector delegated administrator account or the AWS Inspector administrator account itself. This way, the S3 bucket contains reports for the AWS Inspector member accounts. To export the SBOM reports from Amazon Inspector deployed in the same Region, make sure the CloudFormation template is deployed within the AWS account and Region. If you enabled AWS Inspector in multiple accounts, the CloudFormation stack must be deployed in each Region where AWS Inspector is enabled.

To deploy the CloudFormation template

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Launch Stack

  2. Review the stack name and the parameters (MyKMSKeyName and MyS3BucketName) for the template. Note that the S3 bucket name must be unique.
  3. Choose Next and confirm the stack options.
  4. Go to the next page and choose Submit. The deployment of the CloudFormation stack will take 1–2 minutes.

After the CloudFormation stack has deployed successfully, you can use the S3 bucket and KMS key created by the stack to export SBOM reports.

Export SBOM reports

After setup is complete, you can export SBOM reports to an S3 bucket.

To export SBOM reports from the console

  1. Navigate to the AWS Inspector console in the same Region where the S3 bucket and KMS key were created.
  2. Select Export SBOMs from the navigation pane.
  3. Add filters to create reports for specific subsets of resources. The SBOMs for all active, supported resources are exported if you don’t supply a filter.
  4. Select the export file type you want. Options are Cyclonedx_1_4 (Json) or Spdx_2_3-compatible (Json).
  5. Enter the S3 bucket URI from the output section of the CloudFormation template and enter the KMS key that was created.
  6. Choose Export. It can take 3–5 minutes to complete depending on the number of artifacts to be exported.
Figure 2: SBOM export configuration

Figure 2: SBOM export configuration

When complete, all SBOM artifacts will be in the S3 bucket. This gives you the flexibility to download the SBOM artifacts from the S3 bucket, or you can use Amazon S3 Select to retrieve a subset of data from an object using standard SQL queries.

Figure 3: Amazon S3 Select

Figure 3: Amazon S3 Select

You can also run advanced queries using Amazon Athena or create dashboards using Amazon QuickSight to gain insights and map trends.

Querying and visualization

With Athena, you can run SQL queries on raw data that’s stored in S3 buckets. The Amazon Inspector reports are exported to an S3 bucket, and you can query the data and create tables by following the Adding an AWS Glue crawler tutorial.

To enable AWS Glue to crawl the S3 data, you must add the role as described in the AWS Glue crawler tutorial to the AWS KMS key permissions so that AWS Glue can decrypt the S3 data.

The following is an example policy JSON that you can update for your use case. Make sure to replace the AWS account ID <111122223333> and S3 bucket name <DOC-EXAMPLE-BUCKET-111122223333> with your own information.

    "Sid": "Allow the AWS Glue crawler usage of the KMS key",
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::<111122223333>:role/service-role/AWSGlueServiceRole-S3InspectorSBOMReports"
    "Action": [
    "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET-111122223333>"

Note: The role created for AWS Glue also needs permission to read the S3 bucket where the reports are exported for creating the crawlers. The AWS Glue AWS Identity and Access Management (IAM) role allows the crawler to run and access your Amazon S3 data stores.

After an AWS Glue Data Catalog has been built, you can run the crawler on a scheduled basis to help ensure that it’s kept up to date with the latest Amazon Inspector SBOM manifests as they’re exported into the S3 bucket.

You can further navigate to the added table using the crawler and view the data in Athena. Using Athena, you can run queries against the Amazon Inspector reports to generate output data relevant to your environment. The schema for the generated SBOM report is different depending on the specific resources (Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, Amazon Elastic Container Registry (Amazon ECR)) in the reports. So, depending on the schema, you can create a SQL Athena query to fetch information from the reports.

The following is an Athena example query that identifies the top 10 vulnerabilities for resources in an SBOM report. You can use the common vulnerability and exposures (CVE) IDs from the report to list the individual components affected by the CVEs.

   vuln.id as vuln_id, 
   count(*) as vuln_count
   UNNEST(Inset_table_name.vulnerabilities)as t(vuln)
vuln_count DESC

The following Athena example query can be used to identify the top 10 operating systems (OS) along with the resource types and their count.

   metadata.component.name as os_name,
   count(*) as os_count 
   resource = 'AWS_LAMBDA_FUNCTION'
   os_count DESC 

If you have a package that has a critical vulnerability and you need to know if the package is used as a primary package or adds a dependency, you can use the following Athena sample query to check for the package in your application. In this example, I’m searching for a Log4j package. The result returns account ID, resource type, package_name, and package_count.

   comp.name as package_name,
   count(*) as package_count 
   <Insert_Table _name>,
   UNNEST(<Insert_Table_name>.components) as t(comp) 
   comp.name = 'Log4j' 
   package_count DESC 
LIMIT 10 ;

Note: The sample Athena queries must be customized depending on the schema of the SBOM export report.

To further extend this solution, you can use Amazon QuickSight to produce dashboards to visualize the data by connecting to the AWS Glue table.


The new SBOM generation capabilities in Amazon Inspector improve visibility into the software supply chain by providing a comprehensive list of software packages across multiple levels of dependencies. You can also use SBOMs to monitor the licensing information for each of the software packages and identify potential licensing violations in your organization, helping you avoid potential legal risks.

The most important benefit of SBOM export is to help you comply with industry regulations and standards. By providing an industry-standard format (SPDX and CycloneDX) and enabling easy integration with other tools, systems, or services (such as Nexus IQ and WhiteSource), you can streamline the incident response processes, improve the accuracy and speed of security assessments, and adhere to compliance with regulatory requirements.

In addition to these benefits, the SBOM export feature provides a comprehensive and accurate understanding of the OS packages and software libraries found in their resources, further enhancing your ability to adhere to industry regulations and standards.

If you have feedback about this post, submit comments in the Comments section below. If you have any question/query in regard to information shared in this post, start a new thread on the AWS IAM Identity Center re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Varun Sharma

Varun Sharma

Varun is an AWS Cloud Security Engineer who wears his security cape proudly. With a knack for unravelling the mysteries of Amazon Cognito and IAM, Varun is a go-to subject matter expert for these services. When he’s not busy securing the cloud, you’ll find him in the world of security penetration testing. And when the pixels are at rest, Varun switches gears to capture the beauty of nature through the lens of his camera.

Disaster recovery strategies for Amazon MWAA – Part 1

Post Syndicated from Parnab Basak original https://aws.amazon.com/blogs/big-data/disaster-recovery-strategies-for-amazon-mwaa-part-1/

In the dynamic world of cloud computing, ensuring the resilience and availability of critical applications is paramount. Disaster recovery (DR) is the process by which an organization anticipates and addresses technology-related disasters. For organizations implementing critical workload orchestration using Amazon Managed Workflows for Apache Airflow (Amazon MWAA), it is crucial to have a DR plan in place to ensure business continuity.

In this series, we explore the need for Amazon MWAA disaster recovery and prescribe solutions that will sustain Amazon MWAA environments against unintended disruptions. This lets you to define, avoid, and handle disruption risks as part of your business continuity plan. This post focuses on designing the overall DR architecture. A future post in this series will focus on implementing the individual components using AWS services.

The need for Amazon MWAA disaster recovery

Amazon MWAA, a fully managed service for Apache Airflow, brings immense value to organizations by automating workflow orchestration for extract, transform, and load (ETL), DevOps, and machine learning (ML) workloads. Amazon MWAA has a distributed architecture with multiple components such as scheduler, worker, web server, queue, and database. This makes it difficult to implement a comprehensive DR strategy.

An active Amazon MWAA environment continuously parses Airflow Directed Acyclic Graphs (DAGs), reading them from a configured Amazon Simple Storage Service (Amazon S3) bucket. DAG source unavailability due to network unreachability, unintended corruption, or deletes leads to extended downtime and service disruption.

Within Airflow, the metadata database is a core component storing configuration variables, roles, permissions, and DAG run histories. A healthy metadata database is therefore critical for your Airflow environment. As with any core Airflow component, having a backup and disaster recovery plan in place for the metadata database is essential.

Amazon MWAA deploys Airflow components to multiple Availability Zones within your VPC in your preferred AWS Region. This provides fault tolerance and automatic recovery against a single Availability Zone failure. For mission-critical workloads, being resilient to the impairments of a unitary Region through multi-Region deployments is additionally important to ensure high availability and business continuity.

Balancing between costs to maintain redundant infrastructures, complexity, and recovery time is essential for Amazon MWAA environments. Organizations aim for cost-effective solutions that minimize their Recovery Time Objective (RTO) and Recovery Point Objective (RPO) to meet their service level agreements, be economically viable, and meet their customers’ demands.

Detect disasters in the primary environment: Proactive monitoring through metrics and alarms

Prompt detection of disasters in the primary environment is crucial for timely disaster recovery. Monitoring the Amazon CloudWatch SchedulerHeartbeat metric provides insights into Airflow health of an active Amazon MWAA environment. You can add other health check metrics to the evaluation criteria, such as checking the availability of upstream or downstream systems and network reachability. Combined with CloudWatch alarms, you can send notifications when these thresholds over a number of time periods are not met. You can add alarms to dashboards to monitor and receive alerts about your AWS resources and applications across multiple Regions.

AWS publishes our most up-to-the-minute information on service availability on the Service Health Dashboard. You can check at any time to get current status information, or subscribe to an RSS feed to be notified of interruptions to each individual service in your operating Region. The AWS Health Dashboard provides information about AWS Health events that can affect your account.

By combining metric monitoring, available dashboards, and automatic alarming, you can promptly detect unavailability of your primary environment, enabling proactive measures to transition to your DR plan. It is critical to factor in incident detection, notification, escalation, discovery, and declaration into your DR planning and implementation to provide realistic and achievable objectives that provide business value.

In the following sections, we discuss two Amazon MWAA DR strategy solutions and their architecture.

DR strategy solution 1: Backup and restore

The backup and restore strategy involves generating Airflow component backups in the same or different Region as your primary Amazon MWAA environment. To ensure continuity, you can asynchronously replicate these to your DR Region, with minimal performance impact on your primary Amazon MWAA environment. In the event of a rare primary Regional impairment or service disruption, this strategy will create a new Amazon MWAA environment and recover historical data to it from existing backups. However, it’s important to note that during the recovery process, there will be a period where no Airflow environments are operational to process workflows until the new environment is fully provisioned and marked as available.

This strategy provides a low-cost and low-complexity solution that is also suitable for mitigating against data loss or corruption within your primary Region. The amount of data being backed up and the time to create a new Amazon MWAA environment (typically 20–30 minutes) affects how quickly restoration can happen. To enable infrastructure to be redeployed quickly without errors, deploy using infrastructure as code (IaC). Without IaC, it may be complex to restore an analogous DR environment, which will lead to increased recovery times and possibly exceed your RTO.

Let’s explore the setup required when your primary Amazon MWAA environment is actively running, as shown in the following figure.

Backup and Restore - Pre

The solution comprises three key components. The first component is the primary environment, where the Airflow workflows are initially deployed and actively running. The second component is the disaster monitoring component, comprised of CloudWatch and a combination of an AWS Step Functions state machine and a AWS Lambda function. The third component is for creating and storing backups of all configurations and metadata that is required to restore. This can be in the same Region as your primary or replicated to your DR Region using S3 Cross-Region Replication (CRR). For CRR, you also pay for inter-Region data transfer out from Amazon S3 to each destination Region.

The first three steps in the workflow are as follows:

  1. As part of your backup creation process, Airflow metadata is replicated to an S3 bucket using an export DAG utility, run periodically based on your RPO interval.
  2. Your existing primary Amazon MWAA environment automatically emits the status of its scheduler’s health to the CloudWatch SchedulerHeartbeat metric.
  3. A multi-step Step Functions state machine is triggered from a periodic Amazon EventBridge schedule to monitor the scheduler’s health status. As the primary step of the state machine, a Lambda function evaluates the status of the SchedulerHeartbeat metric. If the metric is deemed healthy, no action is taken.

The following figure illustrates the additional steps in the solution workflow.

Backup and Restore post

  1. When the heartbeat count deviates from the normal count for a period of time, a series of actions are initiated to recover to a new Amazon MWAA environment in the DR Region. These actions include starting creation of a new Amazon MWAA environment, replicating the primary environment configurations, and then waiting for the new environment to become available.
  2. When the environment is available, an import DAG utility is run to restore the metadata contents from the backups. Any DAG runs that were interrupted during the impairment of the primary environment need to be manually rerun to maintain service level agreements. Future DAG runs are queued to run as per their next configured schedule.

DR strategy solution 2: Active-passive environments with periodic data synchronization

The active-passive environments with periodic data synchronization strategy focuses on maintaining recurrent data synchronization between an active primary and a passive Amazon MWAA DR environment. By periodically updating and synchronizing DAG stores and metadata databases, this strategy ensures that the DR environment remains current or nearly current with the primary. The DR Region can be the same or a different Region than your primary Amazon MWAA environment. In the event of a disaster, backups are available to revert to a previous known good state to minimize data loss or corruption.

This strategy provides low RTO and RPO with frequent synchronization, allowing quick recovery with minimal data loss. The infrastructure costs and code deployments are compounded to maintain both the primary and DR Amazon MWAA environments. Your DR environment is available immediately to run DAGs on.

The following figure illustrates the setup required when your primary Amazon MWAA environment is actively running.

Active Passive pre

The solution comprises four key components. Similar to the backup and restore solution, the first component is the primary environment, where the workflow is initially deployed and is actively running. The second component is the disaster monitoring component, consisting of CloudWatch and a combination of a Step Functions state machine and Lambda function. The third component creates and stores backups for all configurations and metadata required for the database synchronization. This can be in the same Region as your primary or replicated to your DR Region using Amazon S3 Cross-Region Replication. As mentioned earlier, for CRR, you also pay for inter-Region data transfer out from Amazon S3 to each destination Region. The last component is a passive Amazon MWAA environment that has the same Airflow code and environment configurations as the primary. The DAGs are deployed in the DR environment using the same continuous integration and continuous delivery (CI/CD) pipeline as the primary. Unlike the primary, DAGs are kept in a paused state to not cause duplicate runs.

The first steps of the workflow are similar to the backup and restore strategy:

  1. As part of your backup creation process, Airflow metadata is replicated to an S3 bucket using an export DAG utility, run periodically based on your RPO interval.
  2. Your existing primary Amazon MWAA environment automatically emits the status of its scheduler’s health to CloudWatch SchedulerHeartbeat metric.
  3. A multi-step Step Functions state machine is triggered from a periodic Amazon EventBridge schedule to monitor scheduler health status. As the primary step of the state machine, a Lambda function evaluates the status of the SchedulerHeartbeat metric. If the metric is deemed healthy, no action is taken.

The following figure illustrates the final steps of the workflow.

Active Passive post

  1. When the heartbeat count deviates from the normal count for a period of time, DR actions are initiated.
  2. As a first step, a Lambda function triggers an import DAG utility to restore the metadata contents from the backups to the passive Amazon MWAA DR environment. When the imports are complete, the same DAG can un-pause the other Airflow DAGs, making them active for future runs. Any DAG runs that were interrupted during the impairment of the primary environment need to be manually rerun to maintain service level agreements. Future DAG runs are queued to run as per their next configured schedule.

Best practices to improve resiliency of Amazon MWAA

To enhance the resiliency of your Amazon MWAA environment and ensure smooth disaster recovery, consider implementing the following best practices:

  • Robust backup and restore mechanisms – Implementing comprehensive backup and restore mechanisms for Amazon MWAA data is essential. Regularly deleting existing metadata based on your organization’s retention policies reduces backup times and makes your Amazon MWAA environment more performant.
  • Automation using IaC – Using automation and orchestration tools such as AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform can streamline the deployment and configuration management of Amazon MWAA environments. This ensures consistency, reproducibility, and faster recovery during DR scenarios.
  • Idempotent DAGs and tasks – In Airflow, a DAG is considered idempotent if rerunning the same DAG with the same inputs multiple times has the same effect as running it only once. Designing idempotent DAGs and keeping tasks atomic decreases recovery time from failures when you have to manually rerun an interrupted DAG in your recovered environment.
  • Regular testing and validation – A robust Amazon MWAA DR strategy should include regular testing and validation exercises. By simulating disaster scenarios, you can identify any gaps in your DR plans, fine-tune processes, and ensure your Amazon MWAA environments are fully recoverable.


In this post, we explored the challenges for Amazon MWAA disaster recovery and discussed best practices to improve resiliency. We examined two DR strategy solutions: backup and restore and active-passive environments with periodic data synchronization. By implementing these solutions and following best practices, you can protect your Amazon MWAA environments, minimize downtime, and mitigate the impact of disasters. Regular testing, validation, and adaptation to evolving requirements are crucial for an effective Amazon MWAA DR strategy. By continuously evaluating and refining your disaster recovery plans, you can ensure the resilience and uninterrupted operation of your Amazon MWAA environments, even in the face of unforeseen events.

For additional details and code examples on Amazon MWAA, refer to the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.

About the Authors

Parnab Basak is a Senior Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.

Chandan Rupakheti is a Solutions Architect and a Serverless Specialist at AWS. He is a passionate technical leader, researcher, and mentor with a knack for building innovative solutions in the cloud and bringing stakeholders together in their cloud journey. Outside his professional life, he loves spending time with his family and friends besides listening and playing music.

Vinod Jayendra is a Enterprise Support Lead in ISV accounts at Amazon Web Services, where he helps customers in solving their architectural, operational, and cost optimization challenges. With a particular focus on Serverless technologies, he draws from his extensive background in application development to deliver top-tier solutions. Beyond work, he finds joy in quality family time, embarking on biking adventures, and coaching youth sports team.

Rupesh Tiwari is a Senior Solutions Architect at AWS in New York City, with a focus on Financial Services. He has over 18 years of IT experience in the finance, insurance, and education domains, and specializes in architecting large-scale applications and cloud-native big data workloads. In his spare time, Rupesh enjoys singing karaoke, watching comedy TV series, and creating joyful moments with his family.

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

Post Syndicated from Michael Hamilton original https://aws.amazon.com/blogs/big-data/detect-mask-and-redact-pii-data-using-aws-glue-before-loading-into-amazon-opensearch-service/

Many organizations, small and large, are working to migrate and modernize their analytics workloads on Amazon Web Services (AWS). There are many reasons for customers to migrate to AWS, but one of the main reasons is the ability to use fully managed services rather than spending time maintaining infrastructure, patching, monitoring, backups, and more. Leadership and development teams can spend more time optimizing current solutions and even experimenting with new use cases, rather than maintaining the current infrastructure.

With the ability to move fast on AWS, you also need to be responsible with the data you’re receiving and processing as you continue to scale. These responsibilities include being compliant with data privacy laws and regulations and not storing or exposing sensitive data like personally identifiable information (PII) or protected health information (PHI) from upstream sources.

In this post, we walk through a high-level architecture and a specific use case that demonstrates how you can continue to scale your organization’s data platform without needing to spend large amounts of development time to address data privacy concerns. We use AWS Glue to detect, mask, and redact PII data before loading it into Amazon OpenSearch Service.

Solution overview

The following diagram illustrates the high-level solution architecture. We have defined all layers and components of our design in line with the AWS Well-Architected Framework Data Analytics Lens.


The architecture is comprised of a number of components:

Source data

Data may be coming from many tens to hundreds of sources, including databases, file transfers, logs, software as a service (SaaS) applications, and more. Organizations may not always have control over what data comes through these channels and into their downstream storage and applications.

Ingestion: Data lake batch, micro-batch, and streaming

Many organizations land their source data into their data lake in various ways, including batch, micro-batch, and streaming jobs. For example, Amazon EMR, AWS Glue, and AWS Database Migration Service (AWS DMS) can all be used to perform batch and or streaming operations that sink to a data lake on Amazon Simple Storage Service (Amazon S3). Amazon AppFlow can be used to transfer data from different SaaS applications to a data lake. AWS DataSync and AWS Transfer Family can help with moving files to and from a data lake over a number of different protocols. Amazon Kinesis and Amazon MSK also have capabilities to stream data directly to a data lake on Amazon S3.

S3 data lake

Using Amazon S3 for your data lake is in line with the modern data strategy. It provides low-cost storage without sacrificing performance, reliability, or availability. With this approach, you can bring compute to your data as needed and only pay for capacity it needs to run.

In this architecture, raw data can come from a variety of sources (internal and external), which may contain sensitive data.

Using AWS Glue crawlers, we can discover and catalog the data, which will build the table schemas for us, and ultimately make it straightforward to use AWS Glue ETL with the PII transform to detect and mask or and redact any sensitive data that may have landed in the data lake.

Business context and datasets

To demonstrate the value of our approach, let’s imagine you’re part of a data engineering team for a financial services organization. Your requirements are to detect and mask sensitive data as it is ingested into your organization’s cloud environment. The data will be consumed by downstream analytical processes. In the future, your users will be able to safely search historical payment transactions based on data streams collected from internal banking systems. Search results from operation teams, customers, and interfacing applications must be masked in sensitive fields.

The following table shows the data structure used for the solution. For clarity, we have mapped raw to curated column names. You’ll notice that multiple fields within this schema are considered sensitive data, such as first name, last name, Social Security number (SSN), address, credit card number, phone number, email, and IPv4 address.

Raw Column Name Curated Column Name Type
c0 first_name string
c1 last_name string
c2 ssn string
c3 address string
c4 postcode string
c5 country string
c6 purchase_site string
c7 credit_card_number string
c8 credit_card_provider string
c9 currency string
c10 purchase_value integer
c11 transaction_date date
c12 phone_number string
c13 email string
c14 ipv4 string

Use case: PII batch detection before loading to OpenSearch Service

Customers who implement the following architecture have built their data lake on Amazon S3 to run different types of analytics at scale. This solution is suitable for customers who don’t require real-time ingestion to OpenSearch Service and plan to use data integration tools that run on a schedule or are triggered through events.


Before data records land on Amazon S3, we implement an ingestion layer to bring all data streams reliably and securely to the data lake. Kinesis Data Streams is deployed as an ingestion layer for accelerated intake of structured and semi-structured data streams. Examples of these are relational database changes, applications, system logs, or clickstreams. For change data capture (CDC) use cases, you can use Kinesis Data Streams as a target for AWS DMS. Applications or systems generating streams containing sensitive data are sent to the Kinesis data stream via one of the three supported methods: the Amazon Kinesis Agent, the AWS SDK for Java, or the Kinesis Producer Library. As a last step, Amazon Kinesis Data Firehose helps us reliably load near-real-time batches of data into our S3 data lake destination.

The following screenshot shows how data flows through Kinesis Data Streams via the Data Viewer and retrieves sample data that lands on the raw S3 prefix. For this architecture, we followed the data lifecycle for S3 prefixes as recommended in Data lake foundation.

kinesis raw data

As you can see from the details of the first record in the following screenshot, the JSON payload follows the same schema as in the previous section. You can see the unredacted data flowing into the Kinesis data stream, which will be obfuscated later in subsequent stages.


After the data is collected and ingested into Kinesis Data Streams and delivered to the S3 bucket using Kinesis Data Firehose, the processing layer of the architecture takes over. We use the AWS Glue PII transform to automate detection and masking of sensitive data in our pipeline. As shown in the following workflow diagram, we took a no-code, visual ETL approach to implement our transformation job in AWS Glue Studio.

glue studio nodes

First, we access the source Data Catalog table raw from the pii_data_db database. The table has the schema structure presented in the previous section. To keep track of the raw processed data, we used job bookmarks.

glue catalog

We use the AWS Glue DataBrew recipes in the AWS Glue Studio visual ETL job to transform two date attributes to be compatible with OpenSearch expected formats. This allows us to have a full no-code experience.

We use the Detect PII action to identify sensitive columns. We let AWS Glue determine this based on selected patterns, detection threshold, and sample portion of rows from the dataset. In our example, we used patterns that apply specifically to the United States (such as SSNs) and may not detect sensitive data from other countries. You may look for available categories and locations applicable to your use case or use regular expressions (regex) in AWS Glue to create detection entities for sensitive data from other countries.

It’s important to select the correct sampling method that AWS Glue offers. In this example, it’s known that the data coming in from the stream has sensitive data in every row, so it’s not necessary to sample 100% of the rows in the dataset. If you have a requirement where no sensitive data is allowed to downstream sources, consider sampling 100% of the data for the patterns you chose, or scan the entire dataset and act on each individual cell to ensure all sensitive data is detected. The benefit you get from sampling is reduced costs because you don’t have to scan as much data.

PII Options

The Detect PII action allows you to select a default string when masking sensitive data. In our example, we use the string **********.


We use the apply mapping operation to rename and remove unnecessary columns such as ingestion_year, ingestion_month, and ingestion_day. This step also allows us to change the data type of one of the columns (purchase_value) from string to integer.


From this point on, the job splits into two output destinations: OpenSearch Service and Amazon S3.

Our provisioned OpenSearch Service cluster is connected via the OpenSearch built-in connector for Glue. We specify the OpenSearch Index we’d like to write to and the connector handles the credentials, domain and port. In the screen shot below, we write to the specified index index_os_pii.

opensearch config

We store the masked dataset in the curated S3 prefix. There, we have data normalized to a specific use case and safe consumption by data scientists or for ad hoc reporting needs.

opensearch target s3 folder

For unified governance, access control, and audit trails of all datasets and Data Catalog tables, you can use AWS Lake Formation. This helps you restrict access to the AWS Glue Data Catalog tables and underlying data to only those users and roles who have been granted necessary permissions to do so.

After the batch job runs successfully, you can use OpenSearch Service to run search queries or reports. As shown in the following screenshot, the pipeline masked sensitive fields automatically with no code development efforts.

You can identify trends from the operational data, such as the amount of transactions per day filtered by credit card provider, as shown in the preceding screenshot. You can also determine the locations and domains where users make purchases. The transaction_date attribute helps us see these trends over time. The following screenshot shows a record with all of the transaction’s information redacted appropriately.

json masked

For alternate methods on how to load data into Amazon OpenSearch, refer to Loading streaming data into Amazon OpenSearch Service.

Furthermore, sensitive data can also be discovered and masked using other AWS solutions. For example, you could use Amazon Macie to detect sensitive data inside an S3 bucket, and then use Amazon Comprehend to redact the sensitive data that was detected. For more information, refer to Common techniques to detect PHI and PII data using AWS Services.


This post discussed the importance of handling sensitive data within your environment and various methods and architectures to remain compliant while also allowing your organization to scale quickly. You should now have a good understanding of how to detect, mask, or redact and load your data into Amazon OpenSearch Service.

About the authors

Michael Hamilton is a Sr Analytics Solutions Architect focusing on helping enterprise customers modernize and simplify their analytics workloads on AWS. He enjoys mountain biking and spending time with his wife and three children when not working.

Daniel Rozo is a Senior Solutions Architect with AWS supporting customers in the Netherlands. His passion is engineering simple data and analytics solutions and helping customers move to modern data architectures. Outside of work, he enjoys playing tennis and biking.

How to customize access tokens in Amazon Cognito user pools

Post Syndicated from Edward Sun original https://aws.amazon.com/blogs/security/how-to-customize-access-tokens-in-amazon-cognito-user-pools/

With Amazon Cognito, you can implement customer identity and access management (CIAM) into your web and mobile applications. You can add user authentication and access control to your applications in minutes.

In this post, I introduce you to the new access token customization feature for Amazon Cognito user pools and show you how to use it. Access token customization is included in the advanced security features (ASF) of Amazon Cognito. Note that ASF is subject to additional pricing as described on the Amazon Cognito pricing page.

What is access token customization?

When a user signs in to your app, Amazon Cognito verifies their sign-in information, and if the user is authenticated successfully, returns the ID, access, and refresh tokens. The access token, which uses the JSON Web Token (JWT) format following the RFC7519 standard, contains claims in the token payload that identify the principal being authenticated, and session attributes such as authentication time and token expiration time. More importantly, the access token also contains authorization attributes in the form of user group memberships and OAuth scopes. Your applications or API resource servers can evaluate the token claims to authorize specific actions on behalf of users.

With access token customization, you can add application-specific claims to the standard access token and then make fine-grained authorization decisions to provide a differentiated end-user experience. You can refine the original scope claims to further restrict access to your resources and enforce the least privileged access. You can also enrich access tokens with claims from other sources, such as user subscription information stored in an Amazon DynamoDB table. Your application can use this enriched claim to determine the level of access and content available to the user. This reduces the need to build a custom solution to look up attributes in your application’s code, thereby reducing application complexity, improving performance, and smoothing the integration experience with downstream applications.

How do I use the access token customization feature?

Amazon Cognito works with AWS Lambda functions to modify your user pool’s authentication behavior and end-user experience. In this section, you’ll learn how to configure a pre token generation Lambda trigger function and invoke it during the Amazon Cognito authentication process. I’ll also show you an example function to help you write your own Lambda function.

Lambda trigger flow

During a user authentication, you can choose to have Amazon Cognito invoke a pre token generation trigger to enrich and customize your tokens.

Figure 1: Pre token generation trigger flow

Figure 1: Pre token generation trigger flow

Figure 1 illustrates the pre token generation trigger flow. This flow has the following steps:

  1. An end user signs in to your app and authenticates with an Amazon Cognito user pool.
  2. After the user completes the authentication, Amazon Cognito invokes the pre token generation Lambda trigger, and sends event data to your Lambda function, such as userAttributes and scopes, in a pre token generation trigger event.
  3. Your Lambda function code processes token enrichment logic, and returns a response event to Amazon Cognito to indicate the claims that you want to add or suppress.
  4. Amazon Cognito vends a customized JWT to your application.

The pre token generation trigger flow supports OAuth 2.0 grant types, such as the authorization code grant flow and implicit grant flow, and also supports user authentication through the AWS SDK.

Enable access token customization

Your Amazon Cognito user pool delivers two different versions of the pre token generation trigger event to your Lambda function. Trigger event version 1 includes userAttributes, groupConfiguration, and clientMetadata in the event request, which you can use to customize ID token claims. Trigger event version 2 adds scope in the event request, which you can use to customize scopes in the access token in addition to customizing other claims.

In this section, I’ll show you how to update your user pool to trigger event version 2 and enable access token customization.

To enable access token customization

  1. Open the Cognito user pool console, and then choose User pools.
  2. Choose the target user pool for token customization.
  3. On the User pool properties tab, in the Lambda triggers section, choose Add Lambda trigger.
  4. Figure 2: Add Lambda trigger

    Figure 2: Add Lambda trigger

  5. In the Lambda triggers section, do the following:
    1. For Trigger type, select Authentication.
    2. For Authentication, select Pre token generation trigger.
    3. For Trigger event version, select Basic features + access token customization – Recommended. If this option isn’t available to you, make sure that you have enabled advanced security features. You must have advanced security features enabled to access this option.
  6. Figure 3: Select Lambda trigger

    Figure 3: Select Lambda trigger

  7. Select your Lambda function and assign it as the pre token generation trigger. Then choose Add Lambda trigger.
  8. Figure 4: Add Lambda trigger

    Figure 4: Add Lambda trigger

Example pre token generation trigger

Now that you have enabled access token customization, I’ll walk you through a code example of the pre token generation Lambda trigger, and the version 2 trigger event. This code example examines the trigger event request, and adds a new custom claim and a custom OAuth scope in the response for Amazon Cognito to customize the access token to suit various authorization scheme.

Here is an example version 2 trigger event. The event request contains the user attributes from the Amazon Cognito user pool, the original scope claims, and the original group configurations. It has two custom attributes—membership and location—which are collected during the user registration process and stored in the Cognito user pool.

  "version": "2",
  "triggerSource": "TokenGeneration_HostedAuth",
  "region": "us-east-1",
  "userPoolId": "us-east-1_01EXAMPLE",
  "userName": "mytestuser",
  "callerContext": {
    "awsSdkVersion": "aws-sdk-unknown-unknown",
    "clientId": "1example23456789"
  "request": {
    "userAttributes": {
      "sub": "a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
      "cognito:user_status": "CONFIRMED",
      "email": "[email protected]",
      "email_verified": "true",
      "custom:membership": "Premium",
      "custom:location": "USA"
    "groupConfiguration": {
      "groupsToOverride": [],
      "iamRolesToOverride": [],
      "preferredRole": null
    "scopes": [
  "response": {
    "claimsAndScopeOverrideDetails": null

In the following code example, I transformed the user’s location attribute and membership attribute to add a custom claim and a custom scope. I used the claimsToAddOrOverride field to create a new custom claim called demo:membershipLevel with a membership value of Premium from the event request. I also constructed a new scope with the value of membership:USA.Premium through the scopesToAdd claim, and added the new claim and scope in the event response.

export const handler = function(event, context) {
  // Retrieve user attribute from event request
  const userAttributes = event.request.userAttributes;
  // Add scope to event response
  event.response = {
    "claimsAndScopeOverrideDetails": {
      "idTokenGeneration": {},
      "accessTokenGeneration": {
        "claimsToAddOrOverride": {
          "demo:membershipLevel": userAttributes['custom:membership']
        "scopesToAdd": ["membership:" + userAttributes['custom:location'] + "." + userAttributes['custom:membership']]
  // Return to Amazon Cognito
  context.done(null, event);

With the preceding code, the Lambda trigger sends the following response back to Amazon Cognito to indicate the customization that was needed for the access tokens.

"response": {
  "claimsAndScopeOverrideDetails": {
    "idTokenGeneration": {},
    "accessTokenGeneration": {
      "claimsToAddOrOverride": {
        "demo:membershipLevel": "Premium"
      "scopesToAdd": [

Then Amazon Cognito issues tokens with these customizations at runtime:

  "sub": "a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
  "iss": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_01EXAMPLE",
  "version": 2,
  "client_id": "1example23456789",
  "event_id": "01faa385-562d-4730-8c3b-458e5c8f537b",
  "token_use": "access",
  "demo:membershipLevel": "Premium",
  "scope": "openid profile email membership:USA.Premium",
  "auth_time": 1702270800,
  "exp": 1702271100,
  "iat": 1702270800,
  "jti": "d903dcdf-8c73-45e3-bf44-51bf7c395e06",
  "username": "mytestuser"

Your application can then use the newly-minted, custom scope and claim to authorize users and provide them with a personalized experience.

Considerations and best practices

There are four general considerations and best practices that you can follow:

  1. Some claims and scopes aren’t customizable. For example, you can’t customize claims such as auth_time, iss, and sub, or scopes such as aws.cognito.signin.user.admin. For the full list of excluded claims and scopes, see the Excluded claims and scopes.
  2. Work backwards from authorization. When you customize access tokens, you should start with your existing authorization schema and then decide whether to customize the scopes or claims, or both. Standard OAuth based authorization scenarios, such as Amazon API Gateway authorizers, typically use custom scopes to provide access. However, if you have complex or fine-grained authorization requirements, then you should consider using both scopes and custom claims to pass additional contextual data to the application or to a policy-based access control service such as Amazon Verified Permission.
  3. Establish governance in token customization. You should have a consistent company engineering policy to provide nomenclature guidance for scopes and claims. A syntax standard promotes globally unique variables and avoids a name collision across different application teams. For example, Application X at AnyCompany can choose to name their scope as ac.appx.claim_name, where ac represents AnyCompany as a global identifier and appx.claim_name represents Application X’s custom claim.
  4. Be aware of limits. Because tokens are passed through various networks and systems, you need to be aware of potential token size limitations in your systems. You should keep scope and claim names as short as possible, while still being descriptive.


In this post, you learned how to integrate a pre token generation Lambda trigger with your Amazon Cognito user pool to customize access tokens. You can use the access token customization feature to provide differentiated services to your end users based on claims and OAuth scopes. For more information, see pre token generation Lambda trigger in the Amazon Cognito Developer Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Edward Sun

Edward Sun

Edward is a Security Specialist Solutions Architect focused on identity and access management. He loves helping customers throughout their cloud transformation journey with architecture design, security best practices, migration, and cost optimizations. Outside of work, Edward enjoys hiking, golfing, and cheering for his alma mater, the Georgia Bulldogs.

How to use AWS Secrets Manager and ABAC for enhanced secrets management in Amazon EKS

Post Syndicated from Nima Fotouhi original https://aws.amazon.com/blogs/security/how-to-use-aws-secrets-manager-and-abac-for-enhanced-secrets-management-in-amazon-eks/

In this post, we show you how to apply attribute-based access control (ABAC) while you store and manage your Amazon Elastic Kubernetes Services (Amazon EKS) workload secrets in AWS Secrets Manager, and then retrieve them by integrating Secrets Manager with Amazon EKS using External Secrets Operator to define more fine-grained and dynamic AWS Identity and Access Management (IAM) permission policies for accessing secrets.

It’s common to manage numerous workloads in an EKS cluster, each necessitating access to a distinct set of secrets. You can verify adherence to the principle of least privilege by creating separate permission policies for each workload to restrict their access. To scale and reduce overhead, Amazon Web Services (AWS) recommends using ABAC to manage workloads’ access to secrets. ABAC helps reduce the number of permission policies needed to scale with your environment.

What is ABAC?

In IAM, a traditional authorization approach is known as role-based access control (RBAC). RBAC sets permissions based on a person’s job function, commonly known as IAM roles. To enforce RBAC in IAM, distinct policies for various job roles are created. As a best practice, only the minimum permissions required for a specific role are granted (principle of least privilege), which is achieved by specifying the resources that the role can access. A limitation of the RBAC model is its lack of flexibility. Whenever new resources are introduced, you must modify policies to permit access to the newly added resources.

Attribute-based access control (ABAC) is an approach to authorization that assigns permissions in accordance with attributes, which in the context of AWS are referred to as tags. You create and add tags to your IAM resources. You then create and configure ABAC policies to permit operations requested by a principal when there’s a match between the tags of the principal and the resource. When a principal uses temporary credentials to make a request, its associated tags come from session tags, incoming transitive sessions tags, and IAM tags. The principal’s IAM tags are persistent, but session tags, and incoming transitive session tags are temporary and set when the principal assumes an IAM role. Note that AWS tags are attached to AWS resources, whereas session tags are only valid for the current session and expire with the session.

How External Secrets Operator works

External Secrets Operator (ESO) is a Kubernetes operator that integrates external secret management systems including Secrets Manager with Kubernetes. ESO provides Kubernetes custom resources to extend Kubernetes and integrate it with Secrets Manager. It fetches secrets and makes them available to other Kubernetes resources by creating Kubernetes Secrets. At a basic level, you need to create an ESO SecretStore resource and one or more ESO ExternalSecret resources. The SecretStore resource specifies how to access the external secret management system (Secrets Manager) and allows you to define ABAC related properties (for example, session tags and transitive tags).

You declare what data (secret) to fetch and how the data should be transformed and saved as a Kubernetes Secret in the ExternalSecret resource. The following figure shows an overview of the process for creating Kubernetes Secrets. Later in this post, we review the steps in more detail.

Figure 1: ESO process

Figure 1: ESO process

How to use ESO for ABAC

Before creating any ESO resources, you must make sure that the operator has sufficient permissions to access Secrets Manager. ESO offers multiple ways to authenticate to AWS. For the purpose of this solution, you will use the controller’s pod identity. To implement this method, you configure the ESO service account to assume an IAM role for service accounts (IRSA), which is used by ESO to make requests to AWS.

To adhere to the principle of least privilege and verify that each Kubernetes workload can access only its designated secrets, you will use ABAC policies. As we mentioned, tags are the attributes used for ABAC in the context of AWS. For example, principal and secret tags can be compared to create ABAC policies to deny or allow access to secrets. Secret tags are static tags assigned to secrets symbolizing the workload consuming the secret. On the other hand, principal (requester) tags are dynamically modified, incorporating workload specific tags. The only viable option to dynamically modifying principal tags is to use session tags and incoming transitive session tags. However, as of this writing, there is no way to add session and transitive tags when assuming an IRSA. The workaround for this issue is role chaining and passing session tags when assuming downstream roles. ESO offers role chaining, meaning that you can refer to one or more IAM roles with access to Secrets Manager in the SecretStore resource definition, and ESO will chain them with its IRSA to access secrets. It also allows you to define session tags and transitive tags to be passed when ESO assumes the IAM roles with its primary IRSA. The ability to pass session tags allows you to implement ABAC and compare principal tags (including session tags) with secret tags every time ESO sends a request to Secrets Manager to fetch a secret. The following figure shows ESO authentication process with role chaining in one Kubernetes namespace.

Figure 2: ESO AWS authentication process with role chaining (single namespace)

Figure 2: ESO AWS authentication process with role chaining (single namespace)

Architecture overview

Let’s review implementing ABAC with a real-world example. When you have multiple workloads and services in your Amazon EKS cluster, each service is deployed in its own unique namespace, and service secrets are stored in Secrets Manager and tagged with a service name (key=service, value=service name). The following figure shows the required resources to implement ABAC with EKS and Secrets Manager.

Figure 3: Amazon EKS secrets management with ABAC

Figure 3: Amazon EKS secrets management with ABAC


Deploy the solution

Begin by installing ESO:

  1. From a terminal where you usually run your helm commands, run the following helm command to add an ESO helm repository.
    helm repo add external-secrets https://charts.external-secrets.io

  2. Install ESO using the following helm command in a terminal that has access to your target Amazon EKS cluster:
    helm install external-secrets \
       external-secrets/external-secrets \
        -n external-secrets \
        --create-namespace \
       --set installCRDs=true 

  3. To verify ESO installation, run the following command. Make sure you pass the same namespace as the one you used when installing ESO:
    kubectl get pods -n external-secrets

See the ESO Getting started documentation page for more information on other installation methods, installation options, and how to uninstall ESO.

Create an IAM role to access Secrets Manager secrets

You must create an IAM role with access to Secrets Manager secrets. Start by creating a customer managed policy to attach to your role. Your policy should allow reading secrets from Secrets Manager. The following example shows a policy that you can create for your role:

	"Version": "2012-10-17",
	"Statement": [
			"Effect": "Allow",k
			"Action": [
			"Resource": "*"
			"Effect": "Allow",
			"Action": [
			"Resource": <KMS Key ARN>
			"Effect": "Allow",
			"Action": [ 
			"Resource": "*",
			"Condition": {
				"StringEquals": {
					"secretsmanager:ResourceTag/ekssecret": "${aws:PrincipalTag/ekssecret}"

Consider the following in this policy:

  • Secrets Manager uses an AWS managed key for Secrets Manager by default to encrypt your secrets. It’s recommended to specify another encryption key during secret creation and have separate keys for separate workloads. Modify the resource element of the second policy statement and replace <KMS Key ARN> with the KMS key ARNs used to encrypt your secrets. If you use the default key to encrypt your secrets, you can remove this statement.
  • The policy statement conditionally allows access to all secrets. The condition element permits access only when the value of the principal tag, identified by the key service, matches the value of the secret tag with the same key. You can include multiple conditions (in separate statements) to match multiple tags.

After you create your policy, follow the guide for Creating IAM roles to create your role, attaching the policy you created. Use the default value for your role’s trust relationship for now, you will update the trust relationship in the next step. Note the role’s ARN after creation.

Create an IAM role for the ESO service account

Use eksctl to create the IAM role for the ESO service account (IRSA). Before creating the role, you must create an IAM policy. ESO IRSA only needs permission to assume the Secrets Manager access role that you created in the previous step.

  1. Use the following example of an IAM policy that you can create. Replace <Secrets Manager Access Role ARN> with the ARN of the role you created in the previous step and follow creating a customer managed policy to create the policy. After creating the policy, note the policy ARN.
        "Version": "2012-10-17",
        "Statement": [
                "Effect": "Allow",
                "Action": [
                "Resource": "<Secrets Manager Access Role ARN>"

  2. Next, run the following command to get the account name of the ESO service. You will see a list of service accounts, pick the one that has the same name as your helm release, in this example, the service account is external-secrets.
    kubectl get serviceaccounts -n external-secrets

  3. Next, create an IRSA and configure an ESO service account to assume the role. Run the following command to create a new role and associate it with the ESO service account. Replace the variables in brackets (<example>) with your specific information:
    eksctl create iamserviceaccount --name <ESO service account> \
    --namespace <ESO namespace> --cluster <cluster name> \
    --role-name <IRSA name> --override-existing-serviceaccounts \
    --attach-policy-arn <policy arn you created earlier> --approve

    You can validate the operation by following the steps listed in Configuring a Kubernetes service account to assume an IAM role. Note that you had to pass the ‑‑override-existing-serviceaccounts argument because the ESO service account was already created.

  4. After you’ve validated the operation, run the following command to retrieve the IRSA ARN (replace <IRSA name> with the name you used in the previous step):
    aws iam get-role --role-name <IRSA name> --query Role.Arn

  5. Modify the trust relationship of the role you created previously and limit it to your newly created IRSA. The following should resemble your trust relationship. Replace <IRSA Arn> with the IRSA ARN returned in the previous step:
        "Version": "2012-10-17",
        "Statement": [
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<AWS ACCOUNT ID>:root"
                "Action": "sts:AssumeRole",
                "Condition": {
                    "ArnEquals": {
                        "aws:PrincipalArn": "<IRSA Arn>"
                "Effect": "Allow",
                "Principal": {
                    "AWS": "<IRSA Arn>"
                "Action": "sts:TagSession",
                "Condition": {
                    "StringLike": {
                        "aws:RequestTag/ekssecret": "*"

Note that you will be using session tags to implement ABAC. When using session tags, trust policies for all roles connected to the identity provider (IdP) passing the tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.

Moreover, the condition block of the second statement limits ESO’s ability to pass session tags with the key name ekssecret. We’re using this condition to verify that the ESO role can only create session tags used for accessing secrets manager, and doesn’t gain the ability to set principal tags that might be used for any other purpose. This way, you’re creating a namespace to help prevent further privilege escalations or escapes.

Create secrets in Secrets Manager

You can create two secrets in Secrets Manager and tag them.

  1. Follow the steps in Create an AWS Secrets Manager secret to create two secrets named service1_secret and service2_secret. Add the following tags to your secrets:
    • service1_secret:
      • key=ekssecret, value=service1
    • service2_secret:
      • key=ekssecret, value=service2
  2. Run the following command to verify both secrets are created and tagged properly:
    aws secretsmanager list-secrets --query 'SecretList[*].{Name:Name, Tags:Tags}'

Create ESO objects in your cluster

  1. Create two namespaces in your cluster:
    ❯ kubectl create ns service1-ns
    ❯ kubectl create ns service2-ns

Assume that service1-ns hosts service1 and service2-ns hosts service2. After creating the namespaces for your services, verify that each service is restricted to accessing secrets that are tagged with a specific key-value pair. In this example the key should be ekssecret and the value should match the name of the corresponding service. This means that service1 should only have access to service1_secret, while service2 should only have access to service2_secret. Next, declare session tags in SecretStore object definitions.

  1. Edit the following command snippet using the text editor of your choice and replace every instance of <Secrets Manager Access Role ARN> with the ARN of the IAM role you created earlier to access Secrets Manager secrets. Copy and paste the edited command in your terminal and run it to create a .yaml file in your working directory that contains the SecretStore definitions. Make sure to change the AWS Region to reflect the Region of your Secrets Manager.
    cat > secretstore.yml <<EOF
    apiVersion: external-secrets.io/v1beta1
    kind: SecretStore
      name: aws-secretsmanager
      namespace: service1-ns
          service: SecretsManager
          role: <Secrets Manager Access Role ARN>
          region: us-west-2
            - key: ekssecret
              value: service1
    apiVersion: external-secrets.io/v1beta1
    kind: SecretStore
      name: aws-secretsmanager
      namespace: service2-ns
          service: SecretsManager
          role: <Secrets Manager Access Role ARN>
          region: us-west-2
            - key: ekssecret
              value: service2

  2. Create SecretStore objects by running the following command:
    kubectl apply -f secretstore.yml

  3. Validate object creation by running the following command:
    kubectl describe secretstores.external-secrets.io -A

  4. Check the status and events section for each object and make sure the store is validated.
  5. Next, create two ExternalSecret objects requesting service1_secret and service2_secret. Copy and paste the following command in your terminal and run it. The command will create a .yaml file in your working directory that contains ExternalSecret definitions.
    cat > exrternalsecret.yml <<EOF
    apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
      name: service1-es1
      namespace: service1-ns
      refreshInterval: 1h
        name: aws-secretsmanager
        kind: SecretStore
        name: service1-ns-secret1
        creationPolicy: Owner
      - secretKey: service1_secret
          key: "service1_secret"
    apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
      name: service2-es2
      namespace: service2-ns
      refreshInterval: 1h
        name: aws-secretsmanager
        kind: SecretStore
        name: service1-ns-secret2
        creationPolicy: Owner
      - secretKey: service2_secret
          key: "service2_secret"

  6. Run the following command to create objects:
    kubectl apply -f exrternalsecret.yml

  7. Verify the objects are created by running following command:
    kubectl get externalsecrets.external-secrets.io -A

  8. Each ExternalSecret object should create a Kubernetes secret in the same namespace it was created in. Kubernetes secrets are accessible to services in the same namespace. To demonstrate that both Service A and Service B has access to their secrets, run the following command.
    kubectl get secrets -A

You should see service1-ns-secret1 created in service1-ns namespace which is accessible to Service 1, and service1-ns-secret2 created in service2-ns which is accessible to Service2.

Try creating an ExternalSecrets object in service1-ns referencing service2_secret. Notice that your object shows SecretSyncedError status. This is the expected behavior, because ESO passes different session tags for ExternalSecret objects in each namespace, and when the tag where key is ekssecret doesn’t match the secret tag with the same key, the request will be rejected.

What about AWS Secrets and Configuration Provider (ASCP)?

Amazon offers a capability called AWS Secrets and Configuration Provider (ASCP), which allows applications to consume secrets directly from external stores, including Secrets Manager, without modifying the application code. ASCP is actively maintained by AWS, which makes sure that it remains up to date and aligned with the latest features introduced in Secrets Manager. See How to use AWS Secrets & Configuration Provider with your Kubernetes Secrets Store CSI driver to learn more about how to use ASCP to retrieve secrets from Secrets Manager.

Today, customers who use AWS Fargate with Amazon EKS can’t use the ASCP method due to the incompatibility of daemonsets on Fargate. Kubernetes also doesn’t provide a mechanism to add specific claims to JSON web tokens (JWT) used to assume IAM roles. Today, when using ASCP in Kubernetes, which assumes IAM roles through IAM roles for service accounts (IRSA), there’s a constraint in appending session tags during the IRSA assumption due to JWT claim restrictions, limiting the ability to implement ABAC.

With ESO, you can create Kubernetes Secrets and have your pods retrieve secrets from them instead of directly mounting secrets as volumes in your pods. ESO is also capable of using its controller pod’s IRSA to retrieve secrets, so you don’t need to set up IRSA for each pod. You can also role chain and specify secondary roles to be assumed by ESO IRSA and pass session tags to be used with ABAC policies. ESO’s role chaining and ABAC capabilities help decrease the number of IAM roles required for secrets retrieval. See Leverage AWS secrets stores from EKS Fargate with External Secrets Operator on the AWS Containers blog to learn how to use ESO on an EKS Fargate cluster to consume secrets stored in Secrets Manager.


In this blog post, we walked you through how to implement ABAC with Amazon EKS and Secrets Manager using External Secrets Operator. Implementing ABAC allows you to create a single IAM role for accessing Secrets Manager secrets while implementing granular permissions. ABAC also decreases your team’s overhead and reduces the risk of misconfigurations. With ABAC, you require fewer policies and don’t need to update existing policies to allow access to new services and workloads.

If you have feedback about this post, submit comments in the Comments section below.

Nima Fotouhi

Nima Fotouhi

Nima is a Security Consultant at AWS. He’s a builder with a passion for infrastructure as code (IaC) and policy as code (PaC) and helps customers build secure infrastructure on AWS. In his spare time, he loves to hit the slopes and go snowboarding.

Sandeep Singh

Sandeep is a DevOps Consultant at AWS Professional Services. He focuses on helping customers in their journey to the cloud and within the cloud ecosystem by building performant, resilient, scalable, secure, and cost-efficient solutions.

Amazon OpenSearch Service search enhancements: 2023 roundup

Post Syndicated from Dagney Braun original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-search-enhancements-2023-roundup/

What users expect from search engines has evolved over the years. Just returning lexically relevant results quickly is no longer enough for most users. Now users seek methods that allow them to get even more relevant results through semantic understanding or even search through image visual similarities instead of textual search of metadata. Amazon OpenSearch Service includes many features that allow you to enhance your search experience. We are excited about the OpenSearch Service features and enhancements we’ve added to that toolkit in 2023.

2023 was a year of rapid innovation within the artificial intelligence (AI) and machine learning (ML) space, and search has been a significant beneficiary of that progress. Throughout 2023, Amazon OpenSearch Service invested in enabling search teams to use the latest AI/ML technologies to improve and augment your existing search experiences, without having to rewrite your applications or build bespoke orchestrations, resulting in unlocking rapid development, iteration, and productization. These investments include the introduction of new search methods as well as functionality to simplify implementation of the methods available, which we review in this post.

Background: Lexical and semantic search

Before we get started, let’s review lexical and semantic search.

Lexical search

In lexical search, the search engine compares the words in the search query to the words in the documents, matching word for word. Only items that have words the user typed match the query. Traditional lexical search, based on term frequency models like BM25, is widely used and effective for many search applications. However, lexical search techniques struggle to go beyond the words included in the user’s query, resulting in highly relevant potential results not always being returned.

Semantic search

In semantic search, the search engine uses an ML model to encode text or other media (such as images and videos) from the source documents as a dense vector in a high-dimensional vector space. This is also called embedding the text into the vector space. It similarly codes the query as a vector and then uses a distance metric to find nearby vectors in the multi-dimensional space to find matches. The algorithm for finding nearby vectors is called k-nearest neighbors (k-NN). Semantic search doesn’t match individual query terms—it finds documents whose vector embedding is near the query’s embedding in the vector space and therefore semantically similar to the query. This allows you to return highly relevant items even if they don’t contain any of the words that were in the query.

OpenSearch has provided vector similarity search (k-NN and approximate k-NN) for several years, which has been valuable for customers who adopted it. However, not all customers who have the opportunity to benefit from k-NN have adopted it, due to the significant engineering effort and resources required to do so.

2023 releases: Fundamentals

In 2023 several features and improvements were launched on OpenSearch Service, including new features which are fundamental building blocks for continued search enhancements.

The OpenSearch Compare Search Results tool

The Compare Search Results tool, generally available in OpenSearch Service version 2.11, allows you to compare search results from two ranking techniques side by side, in OpenSearch Dashboards, to determine whether one query produces better results than the other. For customers who are interested in experimenting with the latest search methods powered by ML-assisted models, the ability to compare search results is critical. This can include comparing lexical search, semantic search, and hybrid search techniques to understand the benefits of each technique against your corpus, or adjustments such as field weighting and different stemming or lemmatization strategies.

The following screenshot shows an example of using the Compare Search Results tool.

To learn more about semantic search and cross-modal search and experiment with a demo of the Compare Search Results tool, refer to Try semantic search with the Amazon OpenSearch Service vector engine.

Search pipelines

Search practitioners are looking to introduce new ways to enhance search queries as well as results. With the general availability of search pipelines, starting in OpenSearch Service version 2.9, you can build search query and result processing as a composition of modular processing steps, without complicating your application software. By integrating processors for functions such as filters, and with the ability to add a script to run on newly indexed documents, you can make your search applications more accurate and efficient and reduce the need for custom development.

Search pipelines incorporate three built-in processors: filter_query, rename_field, and script request, as well as new developer-focused APIs to enable developers who want to build their own processors to do so. OpenSearch will continue adding additional built-in processors to further expand on this functionality in the coming releases.

The following diagram illustrates the search pipelines architecture.

Byte-sized vectors in Lucene

Until now, the k-NN plugin in OpenSearch has supported indexing and querying vectors of type float, with each vector element occupying 4 bytes. This can be expensive in memory and storage, especially for large-scale use cases. With the new byte vector feature in OpenSearch Service version 2.9, you can reduce memory requirements by a factor of 4 and significantly reduce search latency, with minimal loss in quality (recall). To learn more, refer to Byte-quantized vectors in OpenSearch.

Support for new language analyzers

OpenSearch Service previously supported language analyzer plugins such as IK (Chinese), Kuromoji (Japanese), and Seunjeon (Korean), among several others. We added support for Nori (Korean), Sudachi (Japanese), Pinyin (Chinese), and STConvert Analysis (Chinese). These new plugins are available as a new package type, ZIP-PLUGIN, along with the previously supported TXT-DICTIONARY package type. You can navigate to the Packages page of the OpenSearch Service console to associate these plugins to your cluster, or use the AssociatePackage API.

2023 releases: Ease-of-use enhancements

The OpenSearch Service also made improvements in 2023 to enhance ease of use within key search features.

Semantic search with neural search

Previously, implementing semantic search meant that your application was responsible for the middleware to integrate text embedding models into search and ingest, orchestrating the encoding the corpus, and then using a k-NN search at query time.

OpenSearch Service introduced neural search in version 2.9, enabling builders to create and operationalize semantic search applications with significantly reduced undifferentiated heavy lifting. Your application no longer needs to deal with the vectorization of documents and queries; semantic search does that, and invokes k-NN during query time. Semantic search via the neural search feature transforms documents or other media into vector embeddings and indexes both the text and its vector embeddings in a vector index. When you use a neural query during search, neural search converts the query text into a vector embedding, uses vector search to compare the query and document embeddings, and returns the closest results. This functionality was initially released as experimental in OpenSearch Service version 2.4, and is now generally available with version 2.9.

AI/ML connectors to enable AI-powered search features

With OpenSearch Service 2.9, you can use out-of-the-box AI connectors to AWS AI and ML services and third-party alternatives to power features like neural search. For instance, you can connect to external ML models hosted on Amazon SageMaker, which provides comprehensive capabilities to manage models successfully in production. If you want to use the latest foundation models via a fully managed experience, you can use connectors for Amazon Bedrock to power use cases like multimodal search. Our initial release includes a connector to Cohere Embed, and through SageMaker and Amazon Bedrock, you have access to more third-party options. You can configure some of these integrations on your domains through the OpenSearch Service console integrations (see the following screenshot), and even automate model deployment to SageMaker.

Integrated models are cataloged in your OpenSearch Service domain, so that your team can discover the variety of models that are integrated and readily available for use. You even have the option to enable granular security controls on your model and connector resources to govern model and connector level access.

To foster an open ecosystem, we created a framework to empower partners to easily build and publish AI connectors. Technology providers can simply create a blueprint, which is a JSON document that describes secure RESTful communication between OpenSearch and your service. Technology partners can publish their connectors on our community site, and you can immediately use these AI connectors—whether for a self-managed cluster or on OpenSearch Service. You can find blueprints for each connector in the ML Commons GitHub repository.

Hybrid search supported by score combination

Semantic technologies such as vector embeddings for neural search and generative AI large language models (LLMs) for natural language processing have revolutionized search, reducing the need for manual synonym list management and fine-tuning. On the other hand, text-based (lexical) search outperforms semantic search in some important cases, such as part numbers or brand names. Hybrid search, the combination of the two methods, gives 14% higher search relevancy (as measured by NDCG@10—a measure of ranking quality) than BM25 alone, so customers want to use hybrid search to get the best of both. For more information about detailed benchmarking score accuracy and performance, refer to Improve search relevance with hybrid search, generally available in OpenSearch 2.10.

Until now, combining them has been challenging given the different relevancy scales for each method. Previously, to implement a hybrid approach, you had to run multiple queries independently, then normalize and combine scores outside of OpenSearch. With the launch of the new hybrid score combination and normalization query type in OpenSearch Service 2.11, OpenSearch handles score normalization and combination in one query, making hybrid search easier to implement and a more efficient way to improve search relevance.

New search methods

Lastly, OpenSearch Service now features new search methods.

Neural sparse retrieval

OpenSearch Service 2.11 introduced neural sparse search, a new kind of sparse embedding method that is similar in many ways to classic term-based indexing, but with low-frequency words and phrases better represented. Sparse semantic retrieval uses transformer models (such as BERT) to build information-rich embeddings that solve for the vocabulary mismatch problem in a scalable way, while having similar computational cost and latency to lexical search. This new sparse retrieval functionality with OpenSearch offers two modes with different advantages: a document-only mode and a bi-encoder mode. The document-only mode can deliver low-latency performance more comparable to BM25 search, with limitations for advanced syntax as compared to dense methods. The bi-encoder mode can maximize search relevance while performing at higher latencies. With this update, you can now choose the method that works best for your performance, accuracy, and cost requirements.

Multi-modal search

OpenSearch Service 2.11 introduces text and image multimodal search using neural search. This functionality allows you to search image and text pairs, like product catalog items (product image and description), based on visual and semantic similarity. This enables new search experiences that can deliver more relevant results. For instance, you can search for “white blouse” to retrieve products with images that match that description, even if the product title is “cream colored shirt.” The ML model that powers this experience is able to associate semantics and visual characteristics. You can also search by image to retrieve visually similar products or search by both text and image to find the products most similar to a particular product catalog item.

You can now build these capabilities into your application to connect directly to multimodal models and run multimodal search queries without having to build custom middleware. The Amazon Titan Multimodal Embeddings model can be integrated with OpenSearch Service to support this method. Refer to Multimodal search for guidance on how to get started with multimodal semantic search, and look out for more input types to be added in future releases. You can also try out the demo of cross-modal textual and image search, which shows searching for images using textual descriptions.


OpenSearch Service offers an array of different tools to build your search application, but the best implementation will depend on your corpus and your business needs and goals. We encourage search practitioners to begin testing the search methods available in order to find the right fit for your use case. In 2024 and beyond, you can expect to continue to see this fast pace of search innovation in order to keep the latest and greatest search technologies at the fingertips of OpenSearch search practitioners.

About the Authors

Dagney Braun is a Senior Manager of Product at Amazon Web Services OpenSearch Team. She is passionate about improving the ease of use of OpenSearch, and expanding the tools available to better support all customer use-cases.

Stavros Macrakis is a Senior Technical Product Manager on the OpenSearch project of Amazon Web Services. He is passionate about giving customers the tools to improve the quality of their search results.

Dylan Tong is a Senior Product Manager at Amazon Web Services. He leads the product initiatives for AI and machine learning (ML) on OpenSearch including OpenSearch’s vector database capabilities. Dylan has decades of experience working directly with customers and creating products and solutions in the database, analytics and AI/ML domain. Dylan holds a BSc and MEng degree in Computer Science from Cornell University.

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/big-data/architectural-patterns-for-real-time-analytics-using-amazon-kinesis-data-streams-part-1/

We’re living in the age of real-time data and insights, driven by low-latency data streaming applications. Today, everyone expects a personalized experience in any application, and organizations are constantly innovating to increase their speed of business operation and decision making. The volume of time-sensitive data produced is increasing rapidly, with different formats of data being introduced across new businesses and customer use cases. Therefore, it is critical for organizations to embrace a low-latency, scalable, and reliable data streaming infrastructure to deliver real-time business applications and better customer experiences.

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. It aims to provide a framework to create low-latency streaming applications on the AWS Cloud using Amazon Kinesis Data Streams and AWS purpose-built data analytics services.

In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices. In the subsequent post in our series, we will explore the architectural patterns in building streaming pipelines for real-time BI dashboards, contact center agent, ledger data, personalized real-time recommendation, log analytics, IoT data, Change Data Capture, and real-time marketing data. All these architecture patterns are integrated with Amazon Kinesis Data Streams.

Real-time streaming with Kinesis Data Streams

Amazon Kinesis Data Streams is a cloud-native, serverless streaming data service that makes it easy to capture, process, and store real-time data at any scale. With Kinesis Data Streams, you can collect and process hundreds of gigabytes of data per second from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time. The collected data is available in milliseconds to allow real-time analytics use cases, such as real-time dashboards, real-time anomaly detection, and dynamic pricing. By default, the data within the Kinesis Data Stream is stored for 24 hours with an option to increase the data retention to 365 days. If customers want to process the same data in real-time with multiple applications, then they can use the Enhanced Fan-Out (EFO) feature. Prior to this feature, every application consuming data from the stream shared the 2MB/second/shard output. By configuring stream consumers to use enhanced fan-out, each data consumer receives dedicated 2MB/second pipe of read throughput per shard to further reduce the latency in data retrieval.

For high availability and durability, Kinesis Data Streams achieves high durability by synchronously replicating the streamed data across three Availability Zones in an AWS Region and gives you the option to retain data for up to 365 days. For security, Kinesis Data Streams provide server-side encryption so you can meet strict data management requirements by encrypting your data at rest and Amazon Virtual Private Cloud (VPC) interface endpoints to keep traffic between your Amazon VPC and Kinesis Data Streams private.

Kinesis Data Streams has native integrations with other AWS services such as AWS Glue and Amazon EventBridge to build real-time streaming applications on AWS. Refer to Amazon Kinesis Data Streams integrations for additional details.

Modern data streaming architecture with Kinesis Data Streams

A modern streaming data architecture with Kinesis Data Streams can be designed as a stack of five logical layers; each layer is composed of multiple purpose-built components that address specific requirements, as illustrated in the following diagram:

The architecture consists of the following key components:

  • Streaming sources – Your source of streaming data includes data sources like clickstream data, sensors, social media, Internet of Things (IoT) devices, log files generated by using your web and mobile applications, and mobile devices that generate semi-structured and unstructured data as continuous streams at high velocity.
  • Stream ingestion – The stream ingestion layer is responsible for ingesting data into the stream storage layer. It provides the ability to collect data from tens of thousands of data sources and ingest in real time. You can use the Kinesis SDK for ingesting streaming data through APIs, the Kinesis Producer Library for building high-performance and long-running streaming producers, or a Kinesis agent for collecting a set of files and ingesting them into Kinesis Data Streams. In addition, you can use many pre-build integrations such as AWS Database Migration Service (AWS DMS), Amazon DynamoDB, and AWS IoT Core to ingest data in a no-code fashion. You can also ingest data from third-party platforms such as Apache Spark and Apache Kafka Connect
  • Stream storage – Kinesis Data Streams offer two modes to support the data throughput: On-Demand and Provisioned. On-Demand mode, now the default choice, can elastically scale to absorb variable throughputs, so that customers do not need to worry about capacity management and pay by data throughput. The On-Demand mode automatically scales up 2x the stream capacity over its historic maximum data ingestion to provide sufficient capacity for unexpected spikes in data ingestion. Alternatively, customers who want granular control over stream resources can use the Provisioned mode and proactively scale up and down the number of Shards to meet their throughput requirements. Additionally, Kinesis Data Streams can store streaming data up to 24 hours by default, but can extend to 7 days or 365 days depending upon use cases. Multiple applications can consume the same stream.
  • Stream processing – The stream processing layer is responsible for transforming data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. The streaming records are read in the order they are produced, allowing for real-time analytics, building event-driven applications or streaming ETL (extract, transform, and load). You can use Amazon Managed Service for Apache Flink for complex stream data processing, AWS Lambda for stateless stream data processing, and AWS Glue & Amazon EMR for near-real-time compute. You can also build customized consumer applications with Kinesis Consumer Library, which will take care of many complex tasks associated with distributed computing.
  • Destination – The destination layer is like a purpose-built destination depending on your use case. You can stream data directly to Amazon Redshift for data warehousing and Amazon EventBridge for building event-driven applications. You can also use Amazon Kinesis Data Firehose for streaming integration where you can light stream processing with AWS Lambda, and then deliver processed streaming into destinations like Amazon S3 data lake, OpenSearch Service for operational analytics, a Redshift data warehouse, No-SQL databases like Amazon DynamoDB, and relational databases like Amazon RDS to consume real-time streams into business applications. The destination can be an event-driven application for real-time dashboards, automatic decisions based on processed streaming data, real-time altering, and more.

Real-time analytics architecture for time series

Time series data is a sequence of data points recorded over a time interval for measuring events that change over time. Examples are stock prices over time, webpage clickstreams, and device logs over time. Customers can use time series data to monitor changes over time, so that they can detect anomalies, identify patterns, and analyze how certain variables are influenced over time. Time series data is typically generated from multiple sources in high volumes, and it needs to be cost-effectively collected in near real time.

Typically, there are three primary goals that customers want to achieve in processing time-series data:

  • Gain insights real-time into system performance and detect anomalies
  • Understand end-user behavior to track trends and query/build visualizations from these insights
  • Have a durable storage solution to ingest and store both archival and frequently accessed data.

With Kinesis Data Streams, customers can continuously capture terabytes of time series data from thousands of sources for cleaning, enrichment, storage, analysis, and visualization.

The following architecture pattern illustrates how real time analytics can be achieved for Time Series data with Kinesis Data Streams:

Build a serverless streaming data pipeline for time series data

The workflow steps are as follows:

  1. Data Ingestion & Storage – Kinesis Data Streams can continuously capture and store terabytes of data from thousands of sources.
  2. Stream Processing – An application created with Amazon Managed Service for Apache Flink can read the records from the data stream to detect and clean any errors in the time series data and enrich the data with specific metadata to optimize operational analytics. Using a data stream in the middle provides the advantage of using the time series data in other processes and solutions at the same time. A Lambda function is then invoked with these events, and can perform time series calculations in memory.
  3. Destinations – After cleaning and enrichment, the processed time series data can be streamed to Amazon Timestream database for real-time dashboarding and analysis, or stored in databases such as DynamoDB for end-user query. The raw data can be streamed to Amazon S3 for archiving.
  4. Visualization & Gain insights – Customers can query, visualize, and create alerts using Amazon Managed Service for Grafana. Grafana supports data sources that are storage backends for time series data. To access your data from Timestream, you need to install the Timestream plugin for Grafana. End-users can query data from the DynamoDB table with Amazon API Gateway acting as a proxy.

Refer to Near Real-Time Processing with Amazon Kinesis, Amazon Timestream, and Grafana showcasing a serverless streaming pipeline to process and store device telemetry IoT data into a time series optimized data store such as Amazon Timestream.

Enriching & replaying data in real time for event-sourcing microservices

Microservices are an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. When building event-driven microservices, customers want to achieve 1. high scalability to handle the volume of incoming events and 2. reliability of event processing and maintain system functionality in the face of failures.

Customers utilize microservice architecture patterns to accelerate innovation and time-to-market for new features, because it makes applications easier to scale and faster to develop. However, it is challenging to enrich and replay the data in a network call to another microservice because it can impact the reliability of the application and make it difficult to debug and trace errors. To solve this problem, event-sourcing is an effective design pattern that centralizes historic records of all state changes for enrichment and replay, and decouples read from write workloads. Customers can use Kinesis Data Streams as the centralized event store for event-sourcing microservices, because KDS can 1/ handle gigabytes of data throughput per second per stream and stream the data in milliseconds, to meet the requirement on high scalability and near real-time latency, 2/ integrate with Flink and S3 for data enrichment and achieving while being completely decoupled from the microservices, and 3/ allow retry and asynchronous read in a later time, because KDS retains the data record for a default of 24 hours, and optionally up to 365 days.

The following architectural pattern is a generic illustration of how Kinesis Data Streams can be used for Event-Sourcing Microservices:

The steps in the workflow are as follows:

  1. Data Ingestion and Storage – You can aggregate the input from your microservices to your Kinesis Data Streams for storage.
  2. Stream processing Apache Flink Stateful Functions simplifies building distributed stateful event-driven applications. It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream. You can create a stateful functions cluster with Apache Flink based on your application business logic.
  3. State snapshot in Amazon S3 – You can store the state snapshot in Amazon S3 for tracking.
  4. Output streams – The output streams can be consumed through Lambda remote functions through HTTP/gRPC protocol through API Gateway.
  5. Lambda remote functions – Lambda functions can act as microservices for various application and business logic to serve business applications and mobile apps.

To learn how other customers built their event-based microservices with Kinesis Data Streams, refer to the following:

Key considerations and best practices

The following are considerations and best practices to keep in mind:

  • Data discovery should be your first step in building modern data streaming applications. You must define the business value and then identify your streaming data sources and user personas to achieve the desired business outcomes.
  • Choose your streaming data ingestion tool based on your steaming data source. For example, you can use the Kinesis SDK for ingesting streaming data through APIs, the Kinesis Producer Library for building high-performance and long-running streaming producers, a Kinesis agent for collecting a set of files and ingesting them into Kinesis Data Streams, AWS DMS for CDC streaming use cases, and AWS IoT Core for ingesting IoT device data into Kinesis Data Streams. You can ingest streaming data directly into Amazon Redshift to build low-latency streaming applications. You can also use third-party libraries like Apache Spark and Apache Kafka to ingest streaming data into Kinesis Data Streams.
  • You need to choose your streaming data processing services based on your specific use case and business requirements. For example, you can use Amazon Kinesis Managed Service for Apache Flink for advanced streaming use cases with multiple streaming destinations and complex stateful stream processing or if you want to monitor business metrics in real time (such as every hour). Lambda is good for event-based and stateless processing. You can use Amazon EMR for streaming data processing to use your favorite open source big data frameworks. AWS Glue is good for near-real-time streaming data processing for use cases such as streaming ETL.
  • Kinesis Data Streams on-demand mode charges by usage and automatically scales up resource capacity, so it’s good for spiky streaming workloads and hands-free maintenance. Provisioned mode charges by capacity and requires proactive capacity management, so it’s good for predictable streaming workloads.
  • You can use the Kinesis Shared Calculator to calculate the number of shards needed for provisioned mode. You don’t need to be concerned about shards with on-demand mode.
  • When granting permissions, you decide who is getting what permissions to which Kinesis Data Streams resources. You enable specific actions that you want to allow on those resources. Therefore, you should grant only the permissions that are required to perform a task. You can also encrypt the data at rest by using a KMS customer managed key (CMK).
  • You can update the retention period via the Kinesis Data Streams console or by using the IncreaseStreamRetentionPeriod and the DecreaseStreamRetentionPeriod operations based on your specific use cases.
  • Kinesis Data Streams supports resharding. The recommended API for this function is UpdateShardCount, which allows you to modify the number of shards in your stream to adapt to changes in the rate of data flow through the stream. The resharding APIs (Split and Merge) are typically used to handle hot shards.


This post demonstrated various architectural patterns for building low-latency streaming applications with Kinesis Data Streams. You can build your own low-latency steaming applications with Kinesis Data Streams using the information in this post.

For detailed architectural patterns, refer to the following resources:

If you want to build a data vision and strategy, check out the AWS Data-Driven Everything (D2E) program.

About the Authors

Raghavarao Sodabathina is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and cloud security. He engages with customers to create innovative solutions that address customer business problems and to accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

Hang Zuo is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Shwetha Radhakrishnan is a Solutions Architect for AWS with a focus in Data Analytics. She has been building solutions that drive cloud adoption and help organizations make data-driven decisions within the public sector. Outside of work, she loves dancing, spending time with friends and family, and traveling.

Brittany Ly is a Solutions Architect at AWS. She is focused on helping enterprise customers with their cloud adoption and modernization journey and has an interest in the security and analytics field. Outside of work, she loves to spend time with her dog and play pickleball.

Using Amazon GuardDuty ECS runtime monitoring with Fargate and Amazon EC2

Post Syndicated from Luke Notley original https://aws.amazon.com/blogs/security/using-amazon-guardduty-ecs-runtime-monitoring-with-fargate-and-amazon-ec2/

Containerization technologies such as Docker and orchestration solutions such as Amazon Elastic Container Service (Amazon ECS) are popular with customers due to their portability and scalability advantages. Container runtime monitoring is essential for customers to monitor the health, performance, and security of containers. AWS services such as Amazon GuardDuty, Amazon Inspector, and AWS Security Hub play a crucial role in enhancing container security by providing threat detection, vulnerability assessment, centralized security management, and native Amazon Web Services (AWS) container runtime monitoring.

GuardDuty is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity and delivers detailed security findings for visibility and remediation. GuardDuty analyzes tens of billions of events per minute across multiple AWS data sources and provides runtime monitoring using a GuardDuty security agent for Amazon Elastic Kubernetes Service (Amazon EKS), Amazon ECS and Amazon Elastic Compute Cloud (Amazon EC2) workloads. Findings are available in the GuardDuty console, and by using APIs, a copy of every GuardDuty finding is sent to Amazon EventBridge so that you can incorporate these findings into your operational workflows. GuardDuty findings are also sent to Security Hub helping you to aggregate and corelate GuardDuty findings across accounts and AWS Regions in addition to findings from other security services.

We recently announced the general availability of GuardDuty Runtime Monitoring for Amazon ECS and the public preview of GuardDuty Runtime Monitoring for Amazon EC2 to detect runtime threats from over 30 security findings to protect your AWS Fargate or Amazon EC2 ECS clusters.

In this blog post, we provide an overview of the AWS Shared Responsibility Model and how it’s related to securing your container workloads running on AWS. We look at the steps to configure and use the new GuardDuty Runtime Monitoring for ECS, EC2, and EKS features. If you’re already using GuardDuty EKS Runtime Monitoring, this post provides the steps to migrate to GuardDuty Runtime Monitoring.

AWS Shared Responsibility Model and containers

Understanding the AWS Shared Responsibility Model is important in relation to Amazon ECS workloads. For Amazon ECS, AWS is responsible for the ECS control plane and the underlying infrastructure data plane. When using Amazon ECS on an EC2 instance, you have a greater share of security responsibilities compared to using ECS on Fargate. Specifically, you’re responsible for overseeing the ECS agent and worker node configuration on the EC2 instances.

Figure 1: AWS Shared Responsibility Model – Amazon ECS on EC2

Figure 1: AWS Shared Responsibility Model – Amazon ECS on EC2

In Fargate, each task operates within its dedicated virtual machine (VM), and there’s no sharing of the operating system or kernel resources between tasks. With Fargate, AWS is responsible for the security of the underlying instance in the cloud and the runtime used to run your tasks.

Figure 2: AWS Shared Responsibility Model – Amazon ECS on Fargate

Figure 2: AWS Shared Responsibility Model – Amazon ECS on Fargate

When deploying container runtime images, your responsibilities include configuring applications, ensuring container security, and applying best practices for task runtime security. These best practices help to limit adversaries from expanding their influence beyond the confines of the local container process.

Amazon GuardDuty Runtime Monitoring consolidation

With the new feature launch, EKS Runtime Monitoring has now been consolidated into GuardDuty Runtime Monitoring. With this consolidation, you can manage the configuration for your AWS accounts one time instead of having to manage the Runtime Monitoring configuration separately for each resource type (EC2 instance, ECS cluster, or EKS cluster). A view of each Region is provided so you can enable Runtime Monitoring and manage GuardDuty security agents across each resource type because they now share a common value of either enabled or disabled.

Note: The GuardDuty security agent still must be configured for each supported resource type.

Figure 3: GuardDuty Runtime Monitoring overview

Figure 3: GuardDuty Runtime Monitoring overview

In the following sections, we walk you through how to enable GuardDuty Runtime Monitoring and how you can reconfigure your existing EKS Runtime Monitoring deployment. We also cover how you can enable monitoring for ECS Fargate and EC2 resource types.

If you were using EKS Runtime Monitoring prior to this feature release, you will notice some configuration options in the updated AWS Management Console for GuardDuty. It’s recommended that you enable Runtime Monitoring for each AWS account; to do this, follow these steps:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab and then choose Edit.
  3. Under Runtime Monitoring, select Enable for all accounts.
  4. Under Automated agent configuration – Amazon EKS, ensure Enable for all accounts is selected.
Figure 4: Edit GuardDuty Runtime Monitoring configuration

Figure 4: Edit GuardDuty Runtime Monitoring configuration

If you want to continue using EKS Runtime Monitoring without enabling GuardDuty ECS Runtime Monitoring or if the Runtime Monitoring protection plan isn’t yet available in your Region, you can configure EKS Runtime Monitoring using the AWS Command Line Interface (AWS CLI) or API. For more information on this migration, see Migrating from EKS Runtime Monitoring to GuardDuty Runtime Monitoring.

Amazon GuardDuty ECS Runtime Monitoring for Fargate

For ECS using a Fargate capacity provider, GuardDuty deploys the security agent as a sidecar container alongside the essential task container. This doesn’t require you to make changes to the deployment of your Fargate tasks and verifies that new tasks will have GuardDuty Runtime Monitoring. If the GuardDuty security agent sidecar container is unable to launch in a healthy state, the ECS Fargate task will not be prevented from running.

When using GuardDuty ECS Runtime Monitoring for Fargate, you can install the agent on Amazon ECS Fargate clusters within an AWS account or only on selected clusters. In the following sections, we show you how to enable the service and provision the agents.


If you haven’t activated GuardDuty, learn more about the free trial and pricing and follow the steps in Getting started with GuardDuty to set up the service and start monitoring your account. Alternatively, you can activate GuardDuty by using the AWS CLI. The minimum Fargate environment version and container operating systems supported can be found in the Prerequisites for AWS Fargate (Amazon ECS only) support. The AWS Identity and Access Management (IAM) role used for running an Amazon ECS task must be provided with access to Amazon ECR with the appropriate permissions to download the GuardDuty sidecar container. To learn more about Amazon ECR repositories that host the GuardDuty agent for AWS Fargate, see Repository for GuardDuty agent on AWS Fargate (Amazon ECS only).

Enable Fargate Runtime Monitoring

To enable GuardDuty Runtime Monitoring for ECS Fargate, follow these steps:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab and then in the AWS Fargate (ECS only) section, choose Enable.
Figure 5: GuardDuty Runtime Monitoring configuration

Figure 5: GuardDuty Runtime Monitoring configuration

If your AWS account is managed within AWS Organizations and you’re running ECS Fargate clusters in multiple AWS accounts, only the GuardDuty delegated administrator account can enable or disable GuardDuty ECS Runtime Monitoring for the member accounts. GuardDuty is a regional service and must be enabled within each desired Region. If you’re using multiple accounts and want to centrally manage GuardDuty see Managing multiple accounts in Amazon GuardDuty.

You can use the same process to enable GuardDuty ECS Runtime Monitoring and manage the GuardDuty security agent. It’s recommended to enable GuardDuty ECS Runtime Monitoring automatically for member accounts within your organization.

To automatically enable GuardDuty Runtime Monitoring for ECS Fargate new accounts:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab, and then choose Edit.
  3. Under Runtime Monitoring, ensure Enable for all accounts is selected.
  4. Under Automated agent configuration – AWS Fargate (ECS only), select Enable for all accounts, then choose Save.
Figure 6: Enable ECS GuardDuty Runtime Monitoring for AWS accounts

Figure 6: Enable ECS GuardDuty Runtime Monitoring for AWS accounts

After you enable GuardDuty ECS Runtime Monitoring for Fargate, GuardDuty can start monitoring and analyzing the runtime activity events for ECS tasks in your account. GuardDuty automatically creates a virtual private cloud (VPC) endpoint in your AWS account in the VPCs where you’re deploying your Fargate tasks. The VPC endpoint is used by the GuardDuty agent to send telemetry and configuration data back to the GuardDuty service API. For GuardDuty to receive the runtime events for your ECS Fargate clusters, you can choose one of three approaches to deploy the fully managed security agent:

  • Monitor existing and new ECS Fargate clusters
  • Monitor existing and new ECS Fargate clusters and exclude selective ECS Fargate clusters
  • Monitor selective ECS Fargate clusters

It’s recommended to monitor each ECS Fargate cluster and then exclude clusters on an as-needed basis. To learn more, see Configure GuardDuty ECS Runtime Monitoring.

Monitor all ECS Fargate clusters

Use this method when you want GuardDuty to automatically deploy and manage the security agent across each ECS Fargate cluster within your account. GuardDuty will automatically install the security agent when new ECS Fargate clusters are created.

To enable GuardDuty Runtime Monitoring for ECS Fargate across each ECS cluster:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab.
  3. Under the Automated agent configuration for AWS Fargate (ECS only), select Enable.
Figure 7: Enable GuardDuty Runtime Monitoring for ECS clusters

Figure 7: Enable GuardDuty Runtime Monitoring for ECS clusters

Monitor all ECS Fargate clusters and exclude selected ECS Fargate clusters

GuardDuty automatically installs the security agent on each ECS Fargate cluster. To exclude an ECS Fargate cluster from GuardDuty Runtime Monitoring, you can use the key-value pair GuardDutyManaged:false as a tag. Add this exclusion tag to your ECS Fargate cluster either before enabling Runtime Monitoring or during cluster creation to prevent automatic GuardDuty monitoring.

To add an exclusion tag to an ECS cluster:

  1. In the Amazon ECS console, in the navigation pane under Clusters, select the cluster name.
  2. Select the Tags tab.
  3. Select Manage Tags and enter the key GuardDutyManaged and value false, then choose Save.
Figure 8: GuardDuty Runtime Monitoring ECS cluster exclusion tags

Figure 8: GuardDuty Runtime Monitoring ECS cluster exclusion tags

To make sure that these tags aren’t modified, you can prevent tags from being modified except by authorized principals.

Monitor selected ECS Fargate clusters

You can monitor selected ECS Fargate clusters when you want GuardDuty to handle the deployment and updates of the security agent exclusively for specific ECS Fargate clusters within your account. This could be a use case where you want to evaluate GuardDuty ECS Runtime Monitoring for Fargate. By using inclusion tags, GuardDuty automatically deploys and manages the security agent only for the ECS Fargate clusters that are tagged with the key-value pair GuardDutyManaged:true. To use inclusion tags, verify that the automated agent configuration for AWS Fargate (ECS) hasn’t been enabled.

To add an inclusion tag to an ECS cluster:

  1. In the Amazon ECS console, in the navigation pane under Clusters, select the cluster name.
  2. Select the Tags tab.
  3. Select Manage Tags and enter the key GuardDutyManaged and value true, then choose Save.
Figure 9: GuardDuty inclusion tags

Figure 9: GuardDuty inclusion tags

To make sure that these tags aren’t modified, you can prevent tags from being modified except by authorized principals.

Fargate task level rollout

After you’re enabled GuardDuty ECS Runtime Monitoring for Fargate, newly launched tasks will include the GuardDuty agent sidecar container. For pre-existing long running tasks, you might want to consider a targeted deployment for task refresh to activate the GuardDuty sidecar security container. This can be achieved using either a rolling update (ECS deployment type) or a blue/green deployment with AWS CodeDeploy.

To verify the GuardDuty agent is running for a task, you can check for an additional container prefixed with aws-guardduty-agent-. Successful deployment will change the container’s status to Running.

To view the GuardDuty agent container running as part of your ECS task:

  1. In the Amazon ECS console, in the navigation pane under Clusters, select the cluster name.
  2. Select the Tasks tab.
  3. Select the Task GUID you want to review.
  4. Under the Containers section, you can view the GuardDuty agent container.
Figure 10: View status of the GuardDuty sidecar container

Figure 10: View status of the GuardDuty sidecar container

GuardDuty ECS on Fargate coverage monitoring

Coverage status of your ECS Fargate clusters is evaluated regularly and can be classified as either healthy or unhealthy. An unhealthy cluster signals a configuration issue, and you can find more details in the GuardDuty Runtime Monitoring notifications section. When you enable GuardDuty ECS Runtime Monitoring and deploy the security agent in your clusters, you can view the coverage status of new ECS Fargate clusters and tasks in the GuardDuty console.

To view coverage status:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Runtime coverage tab, and then select ECS clusters runtime coverage.
Figure 11: GuardDuty Runtime ECS coverage status overview

Figure 11: GuardDuty Runtime ECS coverage status overview

Troubleshooting steps for cluster coverage issues such as clusters reporting as unhealthy and a sample notification schema are available at Coverage for Fargate (Amazon ECS only) resource. More information regarding monitoring can be found in the next section.

Amazon GuardDuty Runtime Monitoring for EC2

Amazon EC2 Runtime Monitoring in GuardDuty helps you provide threat detection for Amazon EC2 instances and supports Amazon ECS managed EC2 instances. The GuardDuty security agent, which GuardDuty uses to send telemetry and configuration data back to the GuardDuty service API, is required to be installed onto each EC2 instance.


If you haven’t activated Amazon GuardDuty, learn more about the free trial and pricing and follow the steps in Getting started with GuardDuty to set up the service and start monitoring your account. Alternatively, you can activate GuardDuty by using the AWS CLI.

To use Amazon EC2 Runtime Monitoring to monitor your ECS container instances, your operating environment must meet the prerequisites for EC2 instance support and the GuardDuty security agent must be installed manually onto the EC2 instances you want to monitor. GuardDuty Runtime Monitoring for EC2 requires you to create the Amazon VPC endpoint manually. If the VPC already has the GuardDuty VPC endpoint created from a previous deployment, you don’t need to create the VPC endpoint again.

If you plan to deploy the agent to Amazon EC2 instances using AWS Systems Manager, an Amazon owned Systems Manager document named AmazonGuardDuty-ConfigureRuntimeMonitoringSsmPlugin is available for use. Alternatively, you can use RPM installation scripts whether or not your Amazon ECS instances are managed by AWS Systems Manager.

Enable GuardDuty Runtime Monitoring for EC2

GuardDuty Runtime Monitoring for EC2 is automatically enabled when you enable GuardDuty Runtime Monitoring.

To enable GuardDuty Runtime Monitoring:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab, and then in the Runtime Monitoring section, choose Enable.
Figure 12: Enable GuardDuty runtime monitoring

Figure 12: Enable GuardDuty runtime monitoring

After the prerequisites have been met and you enable GuardDuty Runtime Monitoring, GuardDuty starts monitoring and analyzing the runtime activity events for the EC2 instances.

If your AWS account is managed within AWS Organizations and you’re running ECS on EC2 clusters in multiple AWS accounts, only the GuardDuty delegated administrator can enable or disable GuardDuty ECS Runtime Monitoring for the member accounts. If you’re using multiple accounts and want to centrally manage GuardDuty, see Managing multiple accounts in Amazon GuardDuty.

GuardDuty EC2 coverage monitoring

When you enable GuardDuty Runtime Monitoring and deploy the security agent on your Amazon EC2 instances, you can view the coverage status of the instances.

To view EC2 instance coverage status:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Runtime coverage tab, and then select EC2 instance runtime coverage.
Figure 13: GuardDuty Runtime Monitoring coverage for EC2 overview

Figure 13: GuardDuty Runtime Monitoring coverage for EC2 overview

Cluster coverage status notifications can be configured using the notification schema available under Configuring coverage status change notifications. More information regarding monitoring can be found in the following section.

GuardDuty Runtime Monitoring notifications

If the coverage status of your ECS cluster or EC2 instance becomes unhealthy, there are a number of recommended troubleshooting steps that you can follow.

To stay informed about changes in the coverage status of an ECS cluster or EC2 instance, it’s recommended that you set up status change notifications. Because GuardDuty publishes these status changes on the EventBridge bus associated with your AWS account, you can do this by setting up an Amazon EventBridge rule to receive notifications.

In the following example AWS CloudFormation template, you can use an EventBridge rule to send notifications to Amazon Simple Notification Service (Amazon SNS) and subscribe to the SNS topic using email.

AWSTemplateFormatVersion: "2010-09-09"
Description: CloudFormation template for Amazon EventBridge rules to monitor Healthy/Unhealthy status of GuardDuty Runtime Monitoring coverage status. This template creates the EventBridge and Amazon SNS topics to be notified via email on state change of security agents
    Description: a simple naming convention for the SNS & EventBridge rules
    Type: String
    Default: GuardDuty-Runtime-Agent-Status
    MinLength: 1
    MaxLength: 50
    AllowedPattern: ^[a-zA-Z0-9\-_]*$
    ConstraintDescription: Maximum 50 characters of numbers, lower/upper case letters, -,_.
    Type: String
    Description: Email address to notify if there are security agent status state changes
    AllowedPattern: "([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\\]?)"
    ConstraintDescription: must be a valid email address.
    Type: AWS::Events::Rule
      EventBusName: default
          - aws.guardduty
          - GuardDuty Runtime Protection Unhealthy
      Name: !Join [ '-', [ 'Rule', !Ref namePrefix, 'Unhealthy' ] ]
      State: ENABLED
        - Id: "GDUnhealthyTopic"
          Arn: !Ref notificationTopicUnhealthy
    Type: AWS::Events::Rule
      EventBusName: default
          - aws.guardduty
          - GuardDuty Runtime Protection Healthy
      Name: !Join [ '-', [ 'Rule', !Ref namePrefix, 'Healthy' ] ]
      State: ENABLED
        - Id: "GDHealthyTopic"
          Arn: !Ref notificationTopicHealthy
    Type: 'AWS::SNS::TopicPolicy'
          - Effect: Allow
              Service: events.amazonaws.com
            Action: 'sns:Publish'
            Resource: '*'
        - !Ref notificationTopicHealthy
        - !Ref notificationTopicUnhealthy
    Type: AWS::SNS::Topic
      TopicName: !Join [ '-', [ 'Topic', !Ref namePrefix, 'Healthy' ] ]
      DisplayName: GD-Healthy-State
      - Endpoint:
          Ref: operatorEmail
        Protocol: email
    Type: AWS::SNS::Topic
      TopicName: !Join [ '-', [ 'Topic', !Ref namePrefix, 'Unhealthy' ] ]
      DisplayName: GD-Unhealthy-State
      - Endpoint:
          Ref: operatorEmail
        Protocol: email

GuardDuty findings

When GuardDuty detects a potential threat and generates a security finding, you can view the details of the corresponding finding. The GuardDuty agent collects kernel-space and user-space events from the hosts and the containers. See Finding types for detailed information and recommended remediation activities regarding each finding type. You can generate sample GuardDuty Runtime Monitoring findings using the GuardDuty console or you can use this GitHub script to generate some basic detections within GuardDuty.

Example ECS findings

GuardDuty security findings can indicate either a compromised container workload or ECS cluster or a set of compromised credentials in your AWS environment.

To view a full description and remediation recommendations regarding a finding:

  1. In the GuardDuty console, in the navigation pane, select Findings.
  2. Select a finding in the navigation pane, and then choose the Info hyperlink.
Figure 14: GuardDuty example finding

Figure 14: GuardDuty example finding

The ResourceType for an ECS Fargate finding could be an ECS cluster or container. If the resource type in the finding details is ECSCluster, it indicates that either a task or a container inside an ECS Fargate cluster is potentially compromised. You can identify the Name and Amazon Resource Name (ARN) of the ECS cluster paired with the task ARN and task Definition ARN details in the cluster.

To view affected resources, ECS cluster details, task details and instance details regarding a finding:

  1. In the GuardDuty console, in the navigation pane, select Findings.
  2. Select a finding related to an ECS cluster in the navigation pane and then scroll down in the right-hand pane to view the different section headings.
Figure 15: GuardDuty finding details for Fargate

Figure 15: GuardDuty finding details for Fargate

The Action and Runtime details provide information about the potentially suspicious activity. The example finding in Figure 16 tells you that the listed ECS container in your environment is querying a domain that is associated with Bitcoin or other cryptocurrency-related activity. This can lead to threat actors attempting to take control over the compute resource to repurpose it for unauthorized cryptocurrency mining.

Figure 16: GuardDuty ECS example finding with action and process details

Figure 16: GuardDuty ECS example finding with action and process details

Example ECS on EC2 findings

When a finding is generated from EC2, additional information is shown including the instance details, IAM profile details, and instance tags (as shown in Figure 17), which can be used to help identify the affected EC2 instance.

Figure 17: GuardDuty EC2 instance details for a finding

Figure 17: GuardDuty EC2 instance details for a finding

This additional instance-level information can help you focus your remediation efforts.

GuardDuty finding remediation

When you’re actively monitoring the runtime behavior of containers within your tasks and GuardDuty identifies potential security issues within your AWS environment, you should consider taking the following suggested remediation actions. This helps to address potential security issues and to contain the potential threat in your AWS account.

  1. Identify the potentially impacted Amazon ECS Cluster – The runtime monitoring finding provides the potentially impacted Amazon ECS cluster details in the finding details panel.
  2. Evaluate the source of potential compromise – Evaluate if the detected finding was in the container’s image. If the resource was in the container image, identify all other tasks that are using this image and evaluate the source of the image.
  3. Isolate the impacted tasks – To isolate the affected tasks, restrict both incoming and outgoing traffic to the tasks by implementing VPC network rules that deny all traffic. This approach can be effective in halting an ongoing attack by cutting off all connections to the affected tasks. Be aware that terminating the tasks could eliminate crucial evidence related to the finding that you might need for further analysis.If the task’s container has accessed the underlying Amazon EC2 host, its associated instance credentials might have been compromised. For more information, see Remediating compromised AWS credentials.

Each GuardDuty Runtime Monitoring finding provides specific prescriptive guidance regarding finding remediation. Within each finding, you can choose the Remediating Runtime Monitoring findings link for more information.

To view the recommended remediation actions:

  1. In the GuardDuty console, in the navigation pane, select Findings.
  2. Select a finding in the navigation pane and then choose the Info hyperlink and scroll down in the right-hand pane to view the remediation recommendations section.
Figure 18: GuardDuty Runtime Monitoring finding remediation

Figure 18: GuardDuty Runtime Monitoring finding remediation


You can now use Amazon GuardDuty for ECS Runtime Monitoring to monitor your Fargate and EC2 workloads. For a full list of Regions where ECS Runtime Monitoring is available, see Region-specific feature availability.

It’s recommended that you asses your container application using the AWS Well-Architected Tool to ensure adherence to best practices. The recently launched AWS Well-Architected Amazon ECS Lens offers a specialized assessment for container-based operations and troubleshooting of Amazon ECS applications, aligning with the ECS best practices guide. You can integrate this lens into the AWS Well-Architected Tool available in the console.

For more information regarding security monitoring and threat detection, visit the AWS Online Tech Talks. For hands-on experience and learn more regarding AWS security services, visit our AWS Activation Days website to find a workshop in your Region.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Luke Notley

Luke Notley

Luke is a Senior Solutions Architect with Amazon Web Services and is based in Western Australia. Luke has a passion for helping customers connect business outcomes with technology and assisting customers throughout their cloud journey, helping them design scalable, flexible, and resilient architectures. In his spare time, he enjoys traveling, coaching basketball teams, and DJing.

Arran Peterson

Arran Peterson

Arran, a Solutions Architect based in Adelaide, South Australia, collaborates closely with customers to deeply understand their distinct business needs and goals. His role extends to assisting customers in recognizing both the opportunities and risks linked to their decisions related to cloud solutions.

Access AWS using a Google Cloud Platform native workload identity

Post Syndicated from Simran Singh original https://aws.amazon.com/blogs/security/access-aws-using-a-google-cloud-platform-native-workload-identity/

Organizations undergoing cloud migrations and business transformations often find themselves managing IT operations in hybrid or multicloud environments. This can make it more complex to safeguard workloads, applications, and data, and to securely handle identities and permissions across Amazon Web Services (AWS), hybrid, and multicloud setups.

In this post, we show you how to assume an AWS Identity and Access Management (IAM) role in your AWS accounts to securely issue temporary credentials for applications that run on the Google Cloud Platform (GCP). We also present best practices and key considerations in this authentication flow. Furthermore, this post provides references to supplementary GCP documentation that offer additional context and provide steps relevant to setup on GCP.

Access control across security realms

As your multicloud environment grows, managing access controls across providers becomes more complex. By implementing the right access controls from the beginning, you can help scale your cloud operations effectively without compromising security. When you deploy apps across multiple cloud providers, you should implement a homogenous and consistent authentication and authorization mechanism across both cloud environments, to help maintain a secure and cost-effective environment. In the following sections, you’ll learn how to enforce such objectives across AWS and workloads hosted on GCP, as shown in Figure 1.

Figure 1: Authentication flow between GCP and AWS

Figure 1: Authentication flow between GCP and AWS


To follow along with this walkthrough, complete the following prerequisites.

  1. Create a service account in GCP. Resources in GCP use service accounts to make API calls. When you create a GCP resource, such as a compute engine instance in GCP, a default service account gets created automatically. Although you can use this default service account in the solution described in this post, we recommend that you create a dedicated user-managed service account, because you can control what permissions to assign to the service account within GCP.

    To learn more about best practices for service accounts, see Best practices for using service accounts in the Google documentation. In this post, we use a GCP virtual machine (VM) instance for demonstration purposes. To attach service accounts to other GCP resources, see Attach service accounts to resources.

  2. Create a VM instance in GCP and attach the service account that you created in Step 1. Resources in GCP store their metadata information in a metadata server, and you can request an instance’s identity token from the server. You will use this identity token in the authentication flow later in this post.
  3. Install the AWS Command Line Interface (AWS CLI) on the GCP VM instance that you created in Step 2.
  4. Install jq and curl.

GCP VM identity authentication flow

Obtaining temporary AWS credentials for workloads that run on GCP is a multi-step process. In this flow, you use the identity token from the GCP compute engine metadata server to call the AssumeRoleWithWebIdentity API to request AWS temporary credentials. This flow gives your application greater flexibility to request credentials for an IAM role that you have configured with a sufficient trust policy, and the corresponding Amazon Resource Name (ARN) for the IAM role must be known to the application.

Define an IAM role on AWS

Because AWS already supports OpenID Connect (OIDC) federation, you can use the OIDC token provided in GCP as described in Step 2 of the Prerequisites, and you don’t need to create a separate OIDC provider in your AWS account. Instead, to create an IAM role for OIDC federation, follow the steps in Creating a role for web identity or OpenID Connect Federation (console). Using an OIDC principal without a condition can be overly permissive. To make sure that only the intended identity provider assumes the role, you need to provide a StringEquals condition in the trust policy for this IAM role. Add the condition keys accounts.google.com:aud, accounts.google.com:oaud, and accounts.google.com:sub to the role’s trust policy, as shown in the following.

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {"Federated": "accounts.google.com"},
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "accounts.google.com:aud": "<azp-value>",
                    "accounts.google.com:oaud": "<aud-value>",
                    "accounts.google.com:sub": "<sub-value>"

Make sure to replace the <placeholder values> with your values from the Google ID Token. The ID token issued for the service accounts has the azp (AUTHORIZED_PARTY) field set, so condition keys are mapped to the Google ID Token fields as follows:

  • accounts.google.com:oaud condition key matches the aud (AUDIENCE) field on the Google ID token.
  • accounts.google.com:aud condition key matches the azp (AUTHORIZED_PARTY) field on the Google ID token.
  • accounts.google.com:sub condition key matches the sub (SUBJECT) field on the Google ID token.

For more information about the Google aud and azp fields, see the Google Identity Platform OpenID Connect guide.

Authentication flow

The authentication flow for the scenario is shown in Figure 2.

Figure 2: Detailed authentication flow with AssumeRoleWithWebIdentity API

Figure 2: Detailed authentication flow with AssumeRoleWithWebIdentity API

The authentication flow has the following steps:

  1. On AWS, you can source external credentials by configuring the credential_process setting in the config file. For the syntax and operating system requirements, see Source credentials with an external process. For this post, we have created a custom profile TeamA-S3ReadOnlyAccess as follows in the config file:
    [profile TeamA-S3ReadOnlyAccess]
    credential_process = /opt/bin/credentials.sh

    To use different settings, you can create and reference additional profiles.

  2. Specify a program or a script that credential_process will invoke. For this post, credential_process invokes the script /opt/bin/credentials.sh which has the following code. Make sure to replace <111122223333> with your own account ID.
    jwt_token=$(curl -sH "Metadata-Flavor: Google" "http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=${AUDIENCE}&format=full&licenses=FALSE")
    jwt_sub=$(jq -R 'split(".") | .[1] | @base64d | fromjson' <<< "$jwt_token" | jq -r '.sub')
    credentials=$(aws sts assume-role-with-web-identity --role-arn $ROLE_ARN --role-session-name $jwt_sub --web-identity-token $jwt_token | jq '.Credentials' | jq '.Version=1')
    echo $credentials

    The script performs the following steps:

    1. Google generates a new unique instance identity token in the JSON Web Token (JWT) format.
      jwt_token=$(curl -sH "Metadata-Flavor: Google" "http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=${AUDIENCE}&format=full&licenses=FALSE")

      The payload of the token includes several details about the instance and the audience URI, as shown in the following.

         "iss": "[TOKEN_ISSUER]",
         "iat": [ISSUED_TIME],
         "exp": [EXPIRED_TIME],
         "aud": "[AUDIENCE]",
         "sub": "[SUBJECT]",
         "azp": "[AUTHORIZED_PARTY]",
         "google": {
          "compute_engine": {
            "project_id": "[PROJECT_ID]",
            "project_number": [PROJECT_NUMBER],
            "zone": "[ZONE]",
            "instance_id": "[INSTANCE_ID]",
            "instance_name": "[INSTANCE_NAME]",
            "instance_creation_timestamp": [CREATION_TIMESTAMP],
            "instance_confidentiality": [INSTANCE_CONFIDENTIALITY],
            "license_id": [

      The IAM trust policy uses the aud (AUDIENCE), azp (AUTHORIZED_PARTY) and sub (SUBJECT) values from the JWT token to help ensure that the IAM role defined in the section Define an IAM role in AWS can be assumed only by the intended GCP service account.

    2. The script invokes the AssumeRoleWithWebIdentity API call, passing in the identity token from the previous step and specifying which IAM role to assume. The script uses the Identity subject claim as the session name, which can facilitate auditing or forensic operations on this AssumeRoleWithWebIdentity API call. AWS verifies the authenticity of the token before returning temporary credentials. In addition, you can verify the token in your credential program by using the process described at Obtaining the instance identity token.

      The script then returns the temporary credentials to the credential_process as the JSON output on STDOUT; we used jq to parse the output in the desired JSON format.

      jwt_sub=$(jq -R 'split(".") | .[1] | @base64d | fromjson' <<< "$jwt_token" | jq -r '.sub')
      credentials=$(aws sts assume-role-with-web-identity --role-arn $ROLE_ARN --role-session-name $jwt_sub --web-identity-token $jwt_token | jq '.Credentials' | jq '.Version=1')
      echo $credentials

    The following is an example of temporary credentials returned by the credential_process script:

      "Version": 1,
      "AccessKeyId": "AKIAIOSFODNN7EXAMPLE",
      "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "SessionToken": "FwoGZXIvYXdzEBUaDOSY+1zJwXi29+/reyLSASRJwSogY/Kx7NomtkCoSJyipWuu6sbDIwFEYtZqg9knuQQyJa9fP68/LCv4jH/efuo1WbMpjh4RZpbVCOQx/zggZTyk2H5sFvpVRUoCO4dc7eqftMhdKtcq67vAUljmcDkC9l0Fei5tJBvVpQ7jzsYeduX/5VM6uReJaSMeOXnIJnQZce6PI3GBiLfaX7Co4o216oS8yLNusTK1rrrwrY2g5e3Zuh1oXp/Q8niFy2FSLN62QHfniDWGO8rCEV9ZnZX0xc4ZN68wBc1N24wKgT+xfCjamcCnBjJYHI2rEtJdkE6bRQc2WAUtccsQk5u83vWae+SpB9ycE/dzfXurqcjCP0urAp4k9aFZFsRIGfLAI1cOABX6CzF30qrcEBnEXAMPLESESSIONTOKEN==",
      "Expiration": "2023-08-31T04:45:30Z"

Note that AWS SDKs store the returned AWS credentials in memory when they call credential_process. AWS SDKs keep track of the credential expiration and generate new AWS session credentials through the credential process. In contrast, the AWS CLI doesn’t cache external process credentials; instead, the AWS CLI calls the credential_process for every CLI request, which creates a new role session and could result in slight delays when you run commands.

Test access in the AWS CLI

After you configure the config file for the credential_process, verify your setup by running the following command.

aws sts get-caller-identity --profile TeamA-S3ReadOnlyAccess

The output will look similar to the following.

   "UserId":"AIDACKCEVSQ6C2EXAMPLE:[Identity subject claim]",
   "Arn":"arn:aws:iam::111122223333:role/RoleForAccessFromGCPTeamA:[Identity subject claim]"

Amazon CloudTrail logs the AssumeRoleWithWebIdentity API call, as shown in Figure 3. The log captures the audience in the identity token as well as the IAM role that is being assumed. It also captures the session name with a reference to the Identity subject claim, which can help simplify auditing or forensic operations on this AssumeRoleWithWebIdentity API call.

Figure 3: CloudTrail event for AssumeRoleWithWebIdentity API call from GCP VM

Figure 3: CloudTrail event for AssumeRoleWithWebIdentity API call from GCP VM

Test access in the AWS SDK

The next step is to test access in the AWS SDK. The following Python program shows how you can refer to the custom profile configured for the credential process.

import boto3

session = boto3.Session(profile_name='TeamA-S3ReadOnlyAccess')
client = session.client('s3')

response = client.list_buckets()
for _bucket in response['Buckets']:

Before you run this program, run pip install boto3. Create an IAM role that has the AmazonS3ReadOnlyAccess policy attached to it. This program prints the names of the existing S3 buckets in your account. For example, if your AWS account has two S3 buckets named DOC-EXAMPLE-BUCKET1 and DOC-EXAMPLE-BUCKET2, then the output of the preceding program shows the following:


If you don’t have an existing S3 bucket, then create an S3 bucket before you run the preceding program.

The list_bucket API call is also logged in CloudTrail, capturing the identity and source of the calling application, as shown in Figure 4.

Figure 4: CloudTrail event for S3 API call made with federated identity session

Figure 4: CloudTrail event for S3 API call made with federated identity session

Clean up

If you don’t need to further use the resources that you created for this walkthrough, delete them to avoid future charges for the deployed resources:

  • Delete the VM instance and service account created in GCP.
  • Delete the resources that you provisioned on AWS to test the solution.


In this post, you learned how to exchange the identity token of a virtual machine running on a GCP compute engine to assume a role on AWS, so that you can seamlessly and securely access AWS resources from GCP hosted workloads.

We walked you through the steps required to set up the credential process and shared best practices to consider in this authentication flow. You can also apply the same pattern to workloads deployed on GCP functions or Google Kubernetes Engine (GKE) when they request access to AWS resources.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Simran Singh

Simran Singh

Simran is a Senior Solutions Architect at AWS. In this role, he assists large enterprise customers in meeting their key business objectives on AWS. His areas of expertise include artificial intelligence/machine learning, security, and improving the experience of developers building on AWS. He has also earned a coveted golden jacket for achieving all currently offered AWS certifications.

Rashmi Iyer

Rashmi Iyer

Rashmi is a Solutions Architect at AWS supporting financial services enterprises. She helps customers build secure, resilient, and scalable architectures on AWS while adhering to architectural best practices. Before joining AWS, Rashmi worked for over a decade to architect and design complex telecom solutions in the packet core domain.

New Amazon CloudWatch log class to cost-effectively scale your AWS Glue workloads

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/new-amazon-cloudwatch-log-class-to-cost-effectively-scale-your-aws-glue-workloads/

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores.

One of the most common questions we get from customers is how to effectively optimize costs on AWS Glue. Over the years, we have built multiple features and tools to help customers manage their AWS Glue costs. For example, AWS Glue Auto Scaling and AWS Glue Flex can help you reduce the compute cost associated with processing your data. AWS Glue interactive sessions and notebooks can help you reduce the cost of developing your ETL jobs. For more information about cost-saving best practices, refer to Monitor and optimize cost on AWS Glue for Apache Spark. Additionally, to understand data transfer costs, refer to the Cost Optimization Pillar defined in AWS Well-Architected Framework. For data storage, you can apply general best practices defined for each data source. For a cost optimization strategy using Amazon Simple Storage Service (Amazon S3), refer to Optimizing storage costs using Amazon S3.

In this post, we tackle the remaining piece—the cost of logs written by AWS Glue.

Before we get into the cost analysis of logs, let’s understand the reasons to enable logging for your AWS Glue job and the current options available. When you start an AWS Glue job, it sends the real-time logging information to Amazon CloudWatch (every 5 seconds and before each executor stops) during the Spark application starts running. You can view the logs on the AWS Glue console or the CloudWatch console dashboard. These logs provide you with insights into your job runs and help you optimize and troubleshoot your AWS Glue jobs. AWS Glue offers a variety of filters and settings to reduce the verbosity of your logs. As the number of job runs increases, so does the volume of logs generated.

To optimize CloudWatch Logs costs, AWS recently announced a new log class for infrequently accessed logs called Amazon CloudWatch Logs Infrequent Access (Logs IA). This new log class offers a tailored set of capabilities at a lower cost for infrequently accessed logs, enabling you to consolidate all your logs in one place in a cost-effective manner. This class provides a more cost-effective option for ingesting logs that only need to be accessed occasionally for auditing or debugging purposes.

In this post, we explain what the Logs IA class is, how it can help reduce costs compared to the standard log class, and how to configure your AWS Glue resources to use this new log class. By routing logs to Logs IA, you can achieve significant savings in your CloudWatch Logs spend without sacrificing access to important debugging information when you need it.

CloudWatch log groups used by AWS Glue job continuous logging

When continuous logging is enabled, AWS Glue for Apache Spark writes Spark driver/executor logs and progress bar information into the following log group:


If a security configuration is enabled for CloudWatch logs, AWS Glue for Apache Spark will create a log group named as follows for continuous logs:


The default and custom log groups will be as follows:

  • The default continuous log group will be /aws-glue/jobs/logs-v2-<Security-Configuration-Name>
  • The custom continuous log group will be <custom-log-group-name>-<Security-Configuration-Name>

You can provide a custom log group name through the job parameter –continuous-log-logGroup.

Getting started with the new Infrequent Access log class for AWS Glue workload

To gain the benefits from Logs IA for your AWS Glue workloads, you need to complete the following two steps:

  1. Create a new log group using the new Log IA class.
  2. Configure your AWS Glue job to point to the new log group

Complete the following steps to create a new log group using the new Infrequent Access log class:

  1. On the CloudWatch console, choose Log groups under Logs in the navigation pane.
  2. Choose Create log group.
  3. For Log group name, enter /aws-glue/jobs/logs-v2-infrequent-access.
  4. For Log class, choose Infrequent Access.
  5. Choose Create.

Complete the following steps to configure your AWS Glue job to point to the new log group:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Choose your job.
  3. On the Job details tab, choose Add new parameter under Job parameters.
  4. For Key, enter --continuous-log-logGroup.
  5. For Value, enter /aws-glue/jobs/logs-v2-infrequent-access.
  6. Choose Save.
  7. Choose Run to trigger the job.

New log events are written into the new log group.

View the logs with the Infrequent Access log class

Now you’re ready to view the logs with the Infrequent Access log class. Open the log group /aws-glue/jobs/logs-v2-infrequent-access on the CloudWatch console.

When you choose one of the log streams, you will notice that it redirects you to the CloudWatch console Logs Insight page with a pre-configured default command and your log stream selected by default. By choosing Run query, you can view the actual log events on the Logs Insights page.


Keep in mind the following considerations:

  • You cannot change the log class of a log group after it’s created. You need to create a new log group to configure the Infrequent Access class.
  • The Logs IA class offers a subset of CloudWatch Logs capabilities, including managed ingestion, storage, cross-account log analytics, and encryption with a lower ingestion price per GB. For example, you can’t view log events through the standard CloudWatch Logs console. To learn more about the features offered across both log classes, refer to Log Classes.


This post provided step-by-step instructions to guide you through enabling Logs IA for your AWS Glue job logs. If your AWS Glue ETL jobs generate large volumes of log data that makes it a challenge as you scale your applications, the best practices demonstrated in this post can help you cost-effectively scale while centralizing all your logs in CloudWatch Logs. Start using the Infrequent Access class with your AWS Glue workloads today and enjoy the cost benefits.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Abeetha Bala is a Senior Product Manager for Amazon CloudWatch, primarily focused on logs. Being customer obsessed, she solves observability challenges through innovative and cost-effective ways.

Kinshuk Pahare is a leader in AWS Glue’s product management team. He drives efforts on the platform, developer experience, and big data processing frameworks like Apache Spark, Ray, and Python Shell.

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

Post Syndicated from Manikanta Gona original https://aws.amazon.com/blogs/big-data/automatically-detect-personally-identifiable-information-in-amazon-redshift-using-aws-glue/

With the exponential growth of data, companies are handling huge volumes and a wide variety of data including personally identifiable information (PII). PII is a legal term pertaining to information that can identify, contact, or locate a single person. Identifying and protecting sensitive data at scale has become increasingly complex, expensive, and time-consuming. Organizations have to adhere to data privacy, compliance, and regulatory requirements such as GDPR and CCPA, and it’s important to identify and protect PII to maintain compliance. You need to identify sensitive data, including PII such as name, Social Security Number (SSN), address, email, driver’s license, and more. Even after identification, it’s cumbersome to implement redaction, masking, or encryption of sensitive data at scale.

Many companies identify and label PII through manual, time-consuming, and error-prone reviews of their databases, data warehouses and data lakes, thereby rendering their sensitive data unprotected and vulnerable to regulatory penalties and breach incidents.

In this post, we provide an automated solution to detect PII data in Amazon Redshift using AWS Glue.

Solution overview

With this solution, we detect PII in data on our Redshift data warehouse so that the we take and protect the data. We use the following services:

  • Amazon Redshift is a cloud data warehousing service that uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning (ML) to deliver the best price/performance at any scale. For our solution, we use Amazon Redshift to store the data.
  • AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, and combine data for analytics, ML, and application development. We use AWS Glue to discover the PII data that is stored in Amazon Redshift.
  • Amazon Simple Storage Services (Amazon S3) is a storage service offering industry-leading scalability, data availability, security, and performance.

The following diagram illustrates our solution architecture.

The solution includes the following high-level steps:

  1. Set up the infrastructure using an AWS CloudFormation template.
  2. Load data from Amazon S3 to the Redshift data warehouse.
  3. Run an AWS Glue crawler to populate the AWS Glue Data Catalog with tables.
  4. Run an AWS Glue job to detect the PII data.
  5. Analyze the output using Amazon CloudWatch.


The resources created in this post assume that a VPC is in place along with a private subnet and both their identifiers. This ensures that we don’t substantially change the VPC and subnet configuration. Therefore, we want to set up our VPC endpoints based on the VPC and subnet we choose to expose it in.

Before you get started, create the following resources as prerequisites:

  • An existing VPC
  • A private subnet in that VPC
  • A VPC gateway S3 endpoint
  • A VPC STS gateway endpoint

Set up the infrastructure with AWS CloudFormation

To create your infrastructure with a CloudFormation template, complete the following steps:

  1. Open the AWS CloudFormation console in your AWS account.
  2. Choose Launch Stack:
  3. Choose Next.
  4. Provide the following information:
    1. Stack name
    2. Amazon Redshift user name
    3. Amazon Redshift password
    4. VPC ID
    5. Subnet ID
    6. Availability Zones for the subnet ID
  5. Choose Next.
  6. On the next page, choose Next.
  7. Review the details and select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create stack.
  9. Note the values for S3BucketName and RedshiftRoleArn on the stack’s Outputs tab.

Load data from Amazon S3 to the Redshift Data warehouse

With the COPY command, we can load data from files located in one or more S3 buckets. We use the FROM clause to indicate how the COPY command locates the files in Amazon S3. You can provide the object path to the data files as part of the FROM clause, or you can provide the location of a manifest file that contains a list of S3 object paths. COPY from Amazon S3 uses an HTTPS connection.

For this post, we use a sample personal health dataset. Load the data with the following steps:

  1. On the Amazon S3 console, navigate to the S3 bucket created from the CloudFormation template and check the dataset.
  2. Connect to the Redshift data warehouse using the Query Editor v2 by establishing a connection with the database you creating using the CloudFormation stack along with the user name and password.

After you’re connected, you can use the following commands to create the table in the Redshift data warehouse and copy the data.

  1. Create a table with the following query:
    CREATE TABLE personal_health_identifiable_information (
        mpi char (10),
        firstName VARCHAR (30),
        lastName VARCHAR (30),
        email VARCHAR (75),
        gender CHAR (10),
        mobileNumber VARCHAR(20),
        clinicId VARCHAR(10),
        creditCardNumber VARCHAR(50),
        driverLicenseNumber VARCHAR(40),
        patientJobTitle VARCHAR(100),
        ssn VARCHAR(15),
        geo VARCHAR(250),
        mbi VARCHAR(50)    

  2. Load the data from the S3 bucket:
    COPY personal_health_identifiable_information
    FROM 's3://<S3BucketName>/personal_health_identifiable_information.csv'
    IAM_ROLE '<RedshiftRoleArn>'
    delimiter ','
    region '<aws region>'

Provide values for the following placeholders:

  • RedshiftRoleArn – Locate the ARN on the CloudFormation stack’s Outputs tab
  • S3BucketName – Replace with the bucket name from the CloudFormation stack
  • aws region – Change to the Region where you deployed the CloudFormation template
  1. To verify the data was loaded, run the following command:
    SELECT * FROM personal_health_identifiable_information LIMIT 10;

Run an AWS Glue crawler to populate the Data Catalog with tables

On the AWS Glue console, select the crawler that you deployed as part of the CloudFormation stack with the name crawler_pii_db, then choose Run crawler.

When the crawler is complete, the tables in the database with the name pii_db are populated in the AWS Glue Data Catalog, and the table schema looks like the following screenshot.

Run an AWS Glue job to detect PII data and mask the corresponding columns in Amazon Redshift

On the AWS Glue console, choose ETL Jobs in the navigation pane and locate the detect-pii-data job to understand its configuration. The basic and advanced properties are configured using the CloudFormation template.

The basic properties are as follows:

  • Type – Spark
  • Glue version – Glue 4.0
  • Language – Python

For demonstration purposes, the job bookmarks option is disabled, along with the auto scaling feature.

We also configure advanced properties regarding connections and job parameters.
To access data residing in Amazon Redshift, we created an AWS Glue connection that utilizes the JDBC connection.

We also provide custom parameters as key-value pairs. For this post, we sectionalize the PII into five different detection categories:

  • custom – Coordinates

If you’re trying this solution from other countries, you can specify the custom PII fields using the custom category, because this solution is created based on US regions.

For demonstration purposes, we use a single table and pass it as the following parameter:

--table_name: table_name

For this post, we name the table personal_health_identifiable_information.

You can customize these parameters based on the individual business use case.

Run the job and wait for the Success status.

The job has two goals. The first goal is to identify PII data-related columns in the Redshift table and produce a list of these column names. The second goal is the obfuscation of data in those specific columns of the target table. As a part of the second goal, it reads the table data, applies a user-defined masking function to those specific columns, and updates the data in the target table using a Redshift staging table (stage_personal_health_identifiable_information) for the upserts.

Alternatively, you can also use dynamic data masking (DDM) in Amazon Redshift to protect sensitive data in your data warehouse.

Analyze the output using CloudWatch

When the job is complete, let’s review the CloudWatch logs to understand how the AWS Glue job ran. We can navigate to the CloudWatch logs by choosing Output logs on the job details page on the AWS Glue console.

The job identified every column that contains PII data, including custom fields passed using the AWS Glue job sensitive data detection fields.

Clean up

To clean up the infrastructure and avoid additional charges, complete the following steps:

  1. Empty the S3 buckets.
  2. Delete the endpoints you created.
  3. Delete the CloudFormation stack via the AWS CloudFormation console to delete the remaining resources.


With this solution, you can automatically scan the data located in Redshift clusters using an AWS Glue job, identify PII, and take necessary actions. This could help your organization with security, compliance, governance, and data protection features, which contribute towards the data security and data governance.

About the Authors

Manikanta Gona is a Data and ML Engineer at AWS Professional Services. He joined AWS in 2021 with 6+ years of experience in IT. At AWS, he is focused on Data Lake implementations, and Search, Analytical workloads using Amazon OpenSearch Service. In his spare time, he love to garden, and go on hikes and biking with his husband.

Denys Novikov is a Senior Data Lake Architect with the Professional Services team at Amazon Web Services. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems for Enterprise customers.

Anjan Mukherjee is a Data Lake Architect at AWS, specializing in big data and analytics solutions. He helps customers build scalable, reliable, secure and high-performance applications on the AWS platform.

Governance at scale: Enforce permissions and compliance by using policy as code

Post Syndicated from Roland Odorfer original https://aws.amazon.com/blogs/security/governance-at-scale-enforce-permissions-and-compliance-by-using-policy-as-code/

AWS Identity and Access Management (IAM) policies are at the core of access control on AWS. They enable the bundling of permissions, helping to provide effective and modular access control for AWS services. Service control policies (SCPs) complement IAM policies by helping organizations enforce permission guardrails at scale across their AWS accounts.

The use of access control policies isn’t limited to AWS resources. Customer applications running on AWS infrastructure can also use policies to help control user access. This often involves implementing custom authorization logic in the program code itself, which can complicate audits and policy changes.

To address this, AWS developed Amazon Verified Permissions, which helps implement fine-grained authorizations and permissions management for customer applications. This service uses Cedar, an open-source policy language, to define permissions separately from application code.

In addition to access control, you can also use policies to help monitor your organization’s individual governance rules for security, operations and compliance. One example of such a rule is the regular rotation of cryptographic keys to help reduce the impact in the event of a key leak.

However, manually checking and enforcing such rules is complex and doesn’t scale, particularly in fast-growing IT organizations. Therefore, organizations should aim for an automated implementation of such rules. In this blog post, I will show you how to use policy as code to help you govern your AWS landscape.

Policy as code

Similar to infrastructure as code (IaC), policy as code is an approach in which you treat policies like regular program code. You define policies in the form of structured text files (policy documents), which policy engines can automatically evaluate.

The main advantage of this approach is the ability to automate key governance tasks, such as policy deployment, enforcement, and auditing. By storing policy documents in a central repository, you can use versioning, simplify audits, and track policy changes. Furthermore, you can subject new policies to automated testing through integration into a continuous integration and continuous delivery (CI/CD) pipeline. Policy as code thus forms one of the key pillars of a modern automated IT governance strategy.

The following sections describe how you can combine different AWS services and functions to integrate policy as code into existing IT governance processes.

Access control – AWS resources

Every request to AWS control plane resources (specifically, AWS APIs)—whether through the AWS Management Console, AWS Command Line Interface (AWS CLI), or SDK — is authenticated and authorized by IAM. To determine whether to approve or deny a specific request, IAM evaluates both the applicable policies associated with the requesting principal (human user or workload) and the respective request context. These policies come in the form of JSON documents and follow a specific schema that allows for automated evaluation.

IAM supports a range of different policy types that you can use to help protect your AWS resources and implement a least privilege approach. For an overview of the individual policy types and their purpose, see Policies and permissions in IAM. For some practical guidance on how and when to use them, see IAM policy types: How and when to use them. To learn more about the IAM policy evaluation process and the order in which IAM reviews individual policy types, see Policy evaluation logic.

Traditionally, IAM relied on role-based access control (RBAC) for authorization. With RBAC, principals are assigned predefined roles that grant only the minimum permissions needed to perform their duties (also known as a least privilege approach). RBAC can seem intuitive initially, but it can become cumbersome at scale. Every new resource that you add to AWS requires the IAM administrator to manually update each role’s permissions – a tedious process that can hamper agility in dynamic environments.

In contrast, attribute-based access control (ABAC) bases permissions on the attributes assigned to users and resources. IAM administrators define a policy that allows access when certain tags match. ABAC is especially advantageous for dynamic, fast-growing organizations that have outgrown the RBAC model. To learn more about how to implement ABAC in an AWS environment, see Define permissions to access AWS resources based on tags.

For a list of AWS services that IAM supports and whether each service supports ABAC, see AWS services that work with IAM.

Access control – Customer applications

Customer applications that run on AWS resources often require an authorization mechanism that can control access to the application itself and its individual functions in a fine-grained manner.

Many customer applications come with custom authorization mechanisms in the application code itself, making it challenging to implement policy changes. This approach can also hinder monitoring and auditing because the implementation of authorization logic often differs between applications, and there is no uniform standard.

To address this challenge, AWS developed Amazon Verified Permissions and the associated open-source policy language Cedar. Amazon Verified Permissions replaces the custom authorization logic in the application code with a simple IsAuthorized API call, so that you can control and monitor authorization logic centrally by using Cedar-based policies. To learn how to integrate Amazon Verified Permissions into your applications and define custom access control policies with Cedar, see How to use Amazon Verified Permissions for authorization.


In addition to access control, you can also use policies to help monitor and enforce your organization’s individual governance rules for security, operations and compliance. AWS Config and AWS Security Hub play a central role in compliance because they enable the setup of multi-account environments that follow best practices (known as landing zones). AWS Config continuously tracks resource configurations and changes, while Security Hub aggregates and prioritizes security findings. With these services, you can create controls that enable automated audits and conformity checks. Alternatively, you can also choose from ready-to-use controls that cover individual compliance objectives such as encryption at rest, or entire frameworks, such as PCI-DSS and NIST 800-53.

AWS Control Tower builds on top of AWS Config and Security Hub to help simplify governance and compliance for multi-account environments. AWS Control Tower incorporates additional controls with the existing ones from AWS Config and Security Hub, presenting them together through a unified interface. These controls apply at different resource life cycle stages, as shown in Figure 1, and you define them through policies.

Figure 1: Resource life cycle

Figure 1: Resource life cycle

The controls can be categorized according to their behavior:

  • Proactive controls scan IaC templates before deployment to help identify noncompliance issues early.
  • Preventative controls restrict actions within an AWS environment to help prevent noncompliant actions. For example, these controls can help prevent deployment of large Amazon Elastic Compute Cloud (Amazon EC2) instances or restrict the available AWS Regions for some users.
  • Detective controls monitor deployed resources to help identify noncompliant resources that proactive and preventative controls might have missed. They also detect when deployed resources are changed or drift out of compliance over time.

Categorizing controls this way allows for a more comprehensive compliance framework that encompasses the entire resource life cycle. The stage at which each control applies determines how it may help enforce policies and governance rules.

With AWS Control Tower, you can enable hundreds of preconfigured security, compliance, and operational controls through the console with a single click, without needing to write code. You can also implement your own custom controls beyond what AWS Control Tower provides out of the box. The process for implementing custom controls varies depending on the type of control. In the following sections, I will explain how to set up custom controls for each type.

Proactive controls

Proactive controls are mechanisms that scan resources and their configuration to confirm that they adhere to compliance requirements before they are deployed. AWS provides a range of tools and services that you can use, both in isolation and in combination with each other, to implement proactive controls. The following diagram provides an overview of the available mechanisms and an example of their integration into a CI/CD pipeline for AWS Cloud Development Kit (CDK) projects.

Figure 2: CI/CD pipeline in AWS CDK projects

Figure 2: CI/CD pipeline in AWS CDK projects

As shown in Figure 2, you can use the following mechanisms as proactive controls:

  1. You can validate artifacts such as IaC templates locally on your machine by using the AWS CloudFormation Guard CLI, which facilitates a shift-left testing strategy. The advantage of this approach is the relatively early testing in the deployment cycle. This supports rapid iterative development and thus reduces waiting times.

    Alternatively, you can use the CfnGuardValidator plugin for AWS CDK, which integrates CloudFormation Guard rules into the AWS CDK CLI. This streamlines local development by applying policies and best practices directly within the CDK project.

  2. To centrally enforce validation checks, integrate the CfnGuardValidator plugin into a CDK CI/CD pipeline.
  3. You can also invoke the CloudFormation Guard CLI from within AWS CodeBuild buildspecs to embed CloudFormation Guard scans in a CI/CD pipeline.
  4. With CloudFormation hooks, you can impose policies on resources before CloudFormation deploys them.

AWS CloudFormation Guard uses a policy-as-code approach to evaluate IaC documents such as AWS CloudFormation templates and Terraform configuration files. The tool defines validation rules in the Guard language to check that these JSON or YAML documents align with best practices and organizational policies around provisioning cloud resources. By codifying rules and scanning infrastructure definitions programmatically, CloudFormation Guard automates policy enforcement and helps promote consistency and security across infrastructure deployments.

In the following example, you will use CloudFormation Guard to validate the name of an Amazon Simple Storage Service (Amazon S3) bucket in a CloudFormation template through a simple Guard rule:

To validate the S3 bucket

  1. Install CloudFormation Guard locally. For instructions, see Setting up AWS CloudFormation Guard.
  2. Create a YAML file named template.yaml with the following content and replace <DOC-EXAMPLE-BUCKET> with a bucket name of your choice (this file is a CloudFormation template, which creates an S3 bucket):
        Type: 'AWS::S3::Bucket'
          BucketName: '<DOC-EXAMPLE-BUCKET>'

  3. Create a text file named rules.guard with the following content:
    rule checkBucketName {
        Resources.S3Bucket.Properties.BucketName == '<DOC-EXAMPLE-BUCKET>'

  4. To validate your CloudFormation template against your Guard rules, run the following command in your local terminal:
    cfn-guard validate --rules rules.guard --data template.yaml

  5. If CloudFormation Guard successfully validates the template, the validate command produces an exit status of 0 ($? in bash). Otherwise, it returns a status report listing the rules that failed. You can test this yourself by changing the bucket name.

To accelerate the writing of Guard rules, use the CloudFormation Guard rulegen command, which takes a CloudFormation template file as an input and autogenerates Guard rules that match the properties of the template resources. To learn more about the structure of CloudFormation Guard rules and how to write them, see Writing AWS CloudFormation Guard rules.

The AWS Guard Rules Registry provides ready-to-use CloudFormation Guard rule files to accelerate your compliance journey, so that you don’t have to write them yourself.

Through the CDK plugin interface for policy validation, the CfnGuardValidator plugin integrates CloudFormation Guard rules into the AWS CDK and validates generated CloudFormation templates automatically during its synthesis step. For more details, see the plugin documentation and Accelerating development with AWS CDK plugin – CfnGuardValidator.

CloudFormation Guard alone can’t necessarily prevent the provisioning of noncompliant resources. This is because CloudFormation Guard can’t detect when templates or other documents change after validation. Therefore, I recommend that you combine CloudFormation Guard with a more authoritative mechanism.

One such mechanism is CloudFormation hooks, which you can use to validate AWS resources before you deploy them. You can configure hooks to cancel the deployment process with an alert if CloudFormation templates aren’t compliant, or just initiate an alert but complete the process. To learn more about CloudFormation hooks, see the following blog posts:

CloudFormation hooks provide a way to authoritatively enforce rules for resources deployed through CloudFormation. However, they don’t control resource creation that occurs outside of CloudFormation, such as through the console, CLI, SDK, or API. Terraform is one example that provisions resources directly through the AWS API rather than through CloudFormation templates. Because of this, I recommend that you implement additional detective controls by using AWS Config. AWS Config can continuously check resource configurations after deployment, regardless of the provisioning method. Using AWS Config rules complements the preventative capabilities of CloudFormation hooks.

Preventative controls

Preventative controls can help maintain compliance by applying guardrails that disallow policy-violating actions. AWS Control Tower integrates with AWS Organizations to implement preventative controls with SCPs. By using SCPs, you can restrict IAM permissions granted in a given organization or organizational unit (OU). One example of this is the selective activation of certain AWS Regions to meet data residency requirements.

SCPs are particularly valuable for managing IAM permissions across large environments with multiple AWS accounts. Organizations with many accounts might find it challenging to monitor and control IAM permissions. SCPs help address this challenge by applying centralized permission guardrails automatically to the accounts of an organization or organizational unit (OU). As new accounts are added, the SCPs are enforced without the need for extra configuration.

You can define SCPs through CloudFormation or CDK templates and deploy them through a CI/CD pipeline, similar to other AWS resources. Because misconfigured SCPs can negatively affect an organization’s operations, it’s vital that you test and simulate the effects of new policies in a sandbox environment before broader deployment. For an example of how to implement a pipeline for SCP testing, see the aws-service-control-policies-deployment GitHub repository.

To learn more about SCPs and how to implement them, see Service control policies (SCPs) and Best Practices for AWS Organizations Service Control Policies in a Multi-Account Environment.

Detective controls

Detective controls help detect noncompliance with existing resources. You can implement detective controls by using AWS Config rules, with both managed rules (provided by AWS) and custom rules available. You can implement custom rules either by using the domain-specific language Guard or Lambda functions. To learn more about the Guard option, see Evaluate custom configurations using AWS Config Custom Policy rules and the open source sample repository. For guidance on creating custom rules using Lambda functions, see AWS Config Rule Development Kit library: Build and operate rules at scale and Deploying Custom AWS Config Rules in an AWS Organization Environment.

To simplify audits for compliance frameworks such as PCI-DSS, HIPAA, and SOC2, AWS Config also offers conformance packs that bundle rules and remediation actions. To learn more about conformance packs, see Conformance Packs and Introducing AWS Config Conformance Packs.

When a resource’s configuration shifts to a noncompliant state that preventive controls didn’t avert, detective controls can help remedy the noncompliant state by implementing predefined actions, such as alerting an operator or reconfiguring the resource. You can implement these controls with AWS Config, which integrates with AWS Systems Manager Automation to help enable the remediation of noncompliant resources.

Security Hub can help centralize the detection of noncompliant resources across multiple AWS accounts. Using AWS Config and third-party tools for detection, Security Hub sends findings of noncompliance to Amazon EventBridge, which can then send notifications or launch automated remediations. You can also use the security controls and standards in Security Hub to monitor the configuration of your AWS infrastructure. This complements the conformance packs in AWS Config.


Many large and fast-growing organizations are faced with the challenge that manual IT governance processes are difficult to scale and can hinder growth. Policy-as-code services help to manage permissions and resource configurations at scale by automating key IT governance processes and, at the same time, increasing the quality and transparency of those processes. This helps to reconcile large environments with key governance objectives such as compliance.

In this post, you learned how to use policy as code to enhance IT governance. A first step is to activate AWS Control Tower, which provides preconfigured guardrails (SCPs) for each AWS account within an organization. These guardrails help enforce baseline compliance across infrastructure. You can then layer on additional controls to further strengthen governance in line with your needs. As a second step, you can select AWS Config conformance packs and Security Hub standards to complement the controls that AWS Control Tower offers. Finally, you can secure applications built on AWS by using Amazon Verified Permissions and Cedar for fine-grained authorization.


If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Roland Odorfer

Roland Odorfer

Roland is a Solutions Architect at AWS, based in Berlin, Germany. He works with German industry and manufacturing customers, helping them architect secure and scalable solutions. Roland is interested in distributed systems and security. He enjoys helping customers use the cloud to solve complex challenges.

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Post Syndicated from Sivaprasad Mahamkali original https://aws.amazon.com/blogs/big-data/modernize-your-etl-platform-with-aws-glue-studio-a-case-study-from-bms/

This post is co-written with Ramesh Daddala, Jitendra Kumar Dash and Pavan Kumar Bijja from Bristol Myers Squibb.

Bristol Myers Squibb (BMS) is a global biopharmaceutical company whose mission is to discover, develop, and deliver innovative medicines that help patients prevail over serious diseases. BMS is consistently innovating, achieving significant clinical and regulatory successes. In collaboration with AWS, BMS identified a business need to migrate and modernize their custom extract, transform, and load (ETL) platform to a native AWS solution to reduce complexities, resources, and investment to upgrade when new Spark, Python, or AWS Glue versions are released. In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine. AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor ETL jobs in AWS Glue. Offering this service reduced BMS’s operational maintenance and cost, and offered flexibility to business users to perform ETL jobs with ease.

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. Although this framework met their ETL objectives, it was difficult to maintain and upgrade. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). Each time the newer version of Apache Spark (and corresponding AWS Glue version) was released, it required significant operational support and time-consuming manual changes to upgrade existing ETL jobs. Manually upgrading, testing, and deploying over 5,000 jobs every few quarters was time consuming, error prone, costly, and not sustainable. Because another release for the EDLS framework was pending, BMS decided to assess alternate managed solutions to reduce their operational and upgrade challenges.

In this post, we share how BMS will modernize leveraging the success of the proof of concept targeting BMS’s ETL platform using AWS Glue Studio.

Solution overview

This solution addresses BMS’s EDLS requirements to overcome challenges using a custom-built ETL framework that required frequent maintenance and component upgrades (requiring extensive testing cycles), avoid complexity, and reduce the overall cost of the underlying infrastructure derived from the proof of concept. BMS had the following goals:

  • Develop ETL jobs using visual workflows provided by the AWS Glue Studio visual editor. The AWS Glue Studio visual editor is a low-code environment that allows you to compose data transformation workflows, seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine, and inspect the schema and data results in each step of the job.
  • Migrate over 5,000 existing ETL jobs using native AWS Glue Studio in an automated and scalable manner.

EDLS job steps and metadata

Every EDLS job comprises one or more job steps chained together and run in a predefined order orchestrated by the custom ETL framework. Each job step incorporates the following ETL functions:

  • File ingest – File ingestion enables you to ingest or list files from multiple file sources, like Amazon Simple Storage Service (Amazon S3), SFTP, and more. The metadata holds configurations for the file ingestion step to connect to Amazon S3 or SFTP endpoints and ingest files to target location. It retrieves the specified files and available metadata to show on the UI.
  • Data quality check – The data quality module enables you to perform quality checks on a huge amount of data and generate reports that describe and validate the data quality. The data quality step uses an EDLS ingested source object from Amazon S3 and runs one to many data conformance checks that are configured by the tenant.
  • Data transform join – This is one of the submodules of the data transform module that can perform joins between the datasets using a custom SQL based on the metadata configuration.
  • Database ingest – The database ingestion step is one of the important service components in EDLS, which facilitates you to obtain and import the desired data from the database and export it to a specific file in the location of your choice.
  • Data transform – The data transform module performs various data transformations against the source data using JSON-driven rules. Each data transform capability has its own JSON rule and, based on the specific JSON rule you provide, EDLS performs the data transformation on the files available in the Amazon S3 location.
  • Data persistence – The data persistence module is one of the important service components in EDLS, which enables you to obtain the desired data from the source and persist it to an Amazon Relational Database Service (Amazon RDS) database.

The metadata corresponding to each job step includes ingest sources, transformation rules, data quality checks, and data destinations stored in an RDS instance.

Migration utility

The solution involves building a Python utility that reads EDLS metadata from the RDS database and translating each of the job steps into an equivalent AWS Glue Studio visual editor JSON node representation.

AWS Glue Studio provides two types of transforms:

  • AWS Glue-native transforms – These are available to all users and are managed by AWS Glue.
  • Custom visual transforms – This new functionality allows you to upload custom-built transforms used in AWS Glue Studio. Custom visual transforms expand the managed transforms, enabling you to search and use transforms from the AWS Glue Studio interface.

The following is a high-level diagram depicting the sequence flow of migrating a BMS EDLS job to an AWS Glue Studio visual editor job.

Migrating BMS EDLS jobs to AWS Glue Studio includes the following steps:

  1. The Python utility reads existing metadata from the EDLS metadata database.
  2. For each job step type, based on the job metadata, the Python utility selects either the native AWS Glue transform, if available, or a custom-built visual transform (when the native functionality is missing).
  3. The Python utility parses the dependency information from metadata and builds a JSON object representing a visual workflow represented as a Directed Acyclic Graph (DAG).
  4. The JSON object is sent to the AWS Glue API, creating the AWS Glue ETL job. These jobs are visually represented in the AWS Glue Studio visual editor using a series of sources, transforms (native and custom), and targets.

Sample ETL job generation using AWS Glue Studio

The following flow diagram depicts a sample ETL job that incrementally ingests the source RDBMS data in AWS Glue based on modified timestamps using a custom SQL and merges it into the target data on Amazon S3.

The preceding ETL flow can be represented using the AWS Glue Studio visual editor through a combination of native and custom visual transforms.

Custom visual transform for incremental ingestion

Post POC, BMS and AWS identified there will be a need to leverage custom transforms to execute a subset of jobs leveraging their current EDLS Service where Glue Studio functionality will not be a natural fit. The BMS team’s requirement was to ingest data from various databases without depending on the existence of transaction logs or specific schema, so AWS Database Migration Service (AWS DMS) wasn’t an option for them. AWS Glue Studio provides the native SQL query visual transform, where a custom SQL query can be used to transform the source data. However, in order to query the source database table based on a modified timestamp column to retrieve new and modified records since the last ETL run, the previous timestamp column state needs to be persisted so it can be used in the current ETL run. This needs to be a recurring process and can also be abstracted across various RDBMS sources, including Oracle, MySQL, Microsoft SQL Server, SAP Hana, and more.

AWS Glue provides a job bookmark feature to track the data that has already been processed during a previous ETL run. An AWS Glue job bookmark supports one or more columns as the bookmark keys to determine new and processed data, and it requires that the keys are sequentially increasing or decreasing without gaps. Although this works for many incremental load use cases, the requirement is to ingest data from different sources without depending on any specific schema, so we didn’t use an AWS Glue job bookmark in this use case.

The SQL-based incremental ingestion pull can be developed in a generic way using a custom visual transform using a sample incremental ingestion job from a MySQL database. The incremental data is merged into the target Amazon S3 location in Apache Hudi format using an upsert write operation.

In the following example, we’re using the MySQL data source node to define the connection but the DynamicFrame of the data source itself is not used. The custom transform node (DB incremental ingestion) will act as the source for reading the data incrementally using the custom SQL query and the previously persisted timestamp from the last ingestion.

The transform accepts as input parameters the preconfigured AWS Glue connection name, database type, table name, and custom SQL (parameterized timestamp field).

The following is the sample visual transform Python code:

import boto3
from awsglue import DynamicFrame
from datetime import datetime

region_name = "us-east-1"

dyna_client = boto3.client('dynamodb')
HISTORIC_DATE = datetime(1970,1,1).strftime("%Y-%m-%d %H:%M:%S")
DYNAMODB_TABLE = "edls_run_stats"

def db_incremental(self, transformation_node, con_name, con_type, table_name, sql_query):
    logger = self.glue_ctx.get_logger()

    last_updt_tmst = get_table_last_updt_tmst(logger, DYNAMODB_TABLE, transformation_node)

    logger.info(f"Last updated timestamp from the DynamoDB-> {last_updt_tmst}")

    sql_query = sql_query.format(**{"lastmdfdtmst": last_updt_tmst})

    connection_options_source = {
        "useConnectionProperties": "true",
        "connectionName": con_name,
        "dbtable": table_name,
        "sampleQuery": sql_query

    df = self.glue_ctx.create_dynamic_frame.from_options(connection_type= con_type, connection_options= connection_options_source )
    return df

DynamicFrame.db_incremental = db_incremental

def get_table_last_updt_tmst(logger, table_name, transformation_node):
    response = dyna_client.get_item(TableName=table_name,
                                    Key={'transformation_node': {'S': transformation_node}}
    if 'Item' in response and 'last_updt_tmst' in response['Item']:
        return response['Item']['last_updt_tmst']['S']
        return HISTORIC_DATE

To merge the source data into the Amazon S3 target, a data lake framework like Apache Hudi or Apache Iceberg can be used, which is natively supported in AWS Glue 3.0 and later.

You can also use Amazon EventBridge to detect the final AWS Glue job state change and update the Amazon DynamoDB table’s last ingested timestamp accordingly.

Build the AWS Glue Studio job using the AWS SDK for Python (Boto3) and AWS Glue API

For the sample ETL flow and the corresponding AWS Glue Studio ETL job we showed earlier, the underlying CodeGenConfigurationNode struct (an AWS Glue job definition pulled using the AWS Command Line Interface (AWS CLI) command aws glue get-job –job-name <jobname>) is represented as a JSON object, shown in the following code:

"CodeGenConfigurationNodes": {<br />"node-1679802581077": {<br />"DynamicTransform": {<br />"Name": "DB Incremental Ingestion",<br />"TransformName": "db_incremental",<br />"Inputs": [<br />"node-1679707801419"<br />],<br />"Parameters": [<br />{<br />"Name": "node_name",<br />"Type": "str",<br />"Value": [<br />"job_123_incr_ingst_table1"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "jdbc_url",<br />"Type": "str",<br />"Value": [<br />"jdbc:mysql://database.xxxx.us-west-2.rds.amazonaws.com:3306/db_schema"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "db_creds",<br />"Type": "str",<br />"Value": [<br />"creds"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "table_name",<br />"Type": "str",<br />"Value": [<br />"tables"<br />],<br />"IsOptional": false<br />}<br />]<br />}<br />}<br />}<br />}

The JSON object (ETL job DAG) represented in the CodeGenConfigurationNode is generated through a series of native and custom transforms with the respective input parameter arrays. This can be accomplished using Python JSON encoders that serialize the class objects to JSON and subsequently create the AWS Glue Studio visual editor job using the Boto3 library and AWS Glue API.

Inputs required to configure the AWS Glue transforms are sourced from the EDLS jobs metadata database. The Python utility reads the metadata information, parses it, and configures the nodes automatically.

The order and sequencing of the nodes is sourced from the EDLS jobs metadata, with one node becoming the input to one or more downstream nodes building the DAG flow.

Benefits of the solution

The migration path will help BMS achieve their core objectives of decomposing their existing custom ETL framework to modular, visually configurable, less complex, and easily manageable pipelines using visual ETL components. The utility aids the migration of the legacy ETL pipelines to native AWS Glue Studio jobs in an automated and scalable manner.

With consistent out-of-the box visual ETL transforms in the AWS Glue Studio interface, BMS will be able to build sophisticated data pipelines without having to write code.

The custom visual transforms will extend AWS Glue Studio capabilities and fulfill some of the BMS ETL requirements where the native transforms are missing that functionality. Custom transforms will help define, reuse, and share business-specific ETL logic among all the teams. The solution increases the consistency between teams and keeps the ETL pipelines up to date by minimizing duplicate effort and code.

With minor modifications, the migration utility can be reused to automate migration of pipelines during future AWS Glue version upgrades.


The successful outcome of this proof of concept has shown that migrating over 5,000 jobs from BMS’s custom application to native AWS services can deliver significant productivity gains and cost savings. By moving to AWS, BMS will be able to reduce the effort required to support AWS Glue, improve DevOps delivery, and save an estimated 58% on AWS Glue spend.

These results are very promising, and BMS is excited to embark on the next phase of the migration. We believe that this project will have a positive impact on BMS’s business and help us achieve our strategic goals.

About the authors

Sivaprasad Mahamkali is a Senior Streaming Data Engineer at AWS Professional Services. Siva leads customer engagements related to real-time streaming solutions, data lakes, analytics using opensource and AWS services. Siva enjoys listening to music and loves to spend time with his family.

Dan Gibbar is a Senior Engagement Manager at AWS Professional Services. Dan leads healthcare and life science engagements collaborating with customers and partners to deliver outcomes. Dan enjoys the outdoors, attempting triathlons, music and spending time with family.

Shrinath Parikh as a Senior Cloud Data Architect with AWS. He works with customers around the globe to assist them with their data analytics, data lake, data lake house, serverless, governance and NoSQL use cases. In Shrinath’s off time, he enjoys traveling, spending time with family and learning/building new tools using cutting edge technologies.

Ramesh Daddala is a Associate Director at BMS. Ramesh leads enterprise data engineering engagements related to enterprise data lake services (EDLs) and collaborating with Data partners to deliver and support enterprise data engineering and ML capabilities. Ramesh enjoys the outdoors, traveling and loves to spend time with family.

Jitendra Kumar Dash is a Senior Cloud Architect at BMS with expertise in hybrid cloud services, Infrastructure Engineering, DevOps, Data Engineering, and Data Analytics solutions. He is passionate about food, sports, and adventure.

Pavan Kumar Bijja is a Senior Data Engineer at BMS. Pavan enables data engineering and analytical services to BMS Commercial domain using enterprise capabilities. Pavan leads enterprise metadata capabilities at BMS. Pavan loves to spend time with his family, playing Badminton and Cricket.

Shovan Kanjilal is a Senior Data Lake Architect working with strategic accounts in AWS Professional Services. Shovan works with customers to design data and machine learning solutions on AWS.