Tag Archives: Intermediate (200)

Using Amazon GuardDuty ECS runtime monitoring with Fargate and Amazon EC2

Post Syndicated from Luke Notley original https://aws.amazon.com/blogs/security/using-amazon-guardduty-ecs-runtime-monitoring-with-fargate-and-amazon-ec2/

Containerization technologies such as Docker and orchestration solutions such as Amazon Elastic Container Service (Amazon ECS) are popular with customers due to their portability and scalability advantages. Container runtime monitoring is essential for customers to monitor the health, performance, and security of containers. AWS services such as Amazon GuardDuty, Amazon Inspector, and AWS Security Hub play a crucial role in enhancing container security by providing threat detection, vulnerability assessment, centralized security management, and native Amazon Web Services (AWS) container runtime monitoring.

GuardDuty is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity and delivers detailed security findings for visibility and remediation. GuardDuty analyzes tens of billions of events per minute across multiple AWS data sources and provides runtime monitoring using a GuardDuty security agent for Amazon Elastic Kubernetes Service (Amazon EKS), Amazon ECS and Amazon Elastic Compute Cloud (Amazon EC2) workloads. Findings are available in the GuardDuty console, and by using APIs, a copy of every GuardDuty finding is sent to Amazon EventBridge so that you can incorporate these findings into your operational workflows. GuardDuty findings are also sent to Security Hub helping you to aggregate and corelate GuardDuty findings across accounts and AWS Regions in addition to findings from other security services.

We recently announced the general availability of GuardDuty Runtime Monitoring for Amazon ECS and the public preview of GuardDuty Runtime Monitoring for Amazon EC2 to detect runtime threats from over 30 security findings to protect your AWS Fargate or Amazon EC2 ECS clusters.

In this blog post, we provide an overview of the AWS Shared Responsibility Model and how it’s related to securing your container workloads running on AWS. We look at the steps to configure and use the new GuardDuty Runtime Monitoring for ECS, EC2, and EKS features. If you’re already using GuardDuty EKS Runtime Monitoring, this post provides the steps to migrate to GuardDuty Runtime Monitoring.

AWS Shared Responsibility Model and containers

Understanding the AWS Shared Responsibility Model is important in relation to Amazon ECS workloads. For Amazon ECS, AWS is responsible for the ECS control plane and the underlying infrastructure data plane. When using Amazon ECS on an EC2 instance, you have a greater share of security responsibilities compared to using ECS on Fargate. Specifically, you’re responsible for overseeing the ECS agent and worker node configuration on the EC2 instances.

Figure 1: AWS Shared Responsibility Model – Amazon ECS on EC2

Figure 1: AWS Shared Responsibility Model – Amazon ECS on EC2

In Fargate, each task operates within its dedicated virtual machine (VM), and there’s no sharing of the operating system or kernel resources between tasks. With Fargate, AWS is responsible for the security of the underlying instance in the cloud and the runtime used to run your tasks.

Figure 2: AWS Shared Responsibility Model – Amazon ECS on Fargate

Figure 2: AWS Shared Responsibility Model – Amazon ECS on Fargate

When deploying container runtime images, your responsibilities include configuring applications, ensuring container security, and applying best practices for task runtime security. These best practices help to limit adversaries from expanding their influence beyond the confines of the local container process.

Amazon GuardDuty Runtime Monitoring consolidation

With the new feature launch, EKS Runtime Monitoring has now been consolidated into GuardDuty Runtime Monitoring. With this consolidation, you can manage the configuration for your AWS accounts one time instead of having to manage the Runtime Monitoring configuration separately for each resource type (EC2 instance, ECS cluster, or EKS cluster). A view of each Region is provided so you can enable Runtime Monitoring and manage GuardDuty security agents across each resource type because they now share a common value of either enabled or disabled.

Note: The GuardDuty security agent still must be configured for each supported resource type.

Figure 3: GuardDuty Runtime Monitoring overview

Figure 3: GuardDuty Runtime Monitoring overview

In the following sections, we walk you through how to enable GuardDuty Runtime Monitoring and how you can reconfigure your existing EKS Runtime Monitoring deployment. We also cover how you can enable monitoring for ECS Fargate and EC2 resource types.

If you were using EKS Runtime Monitoring prior to this feature release, you will notice some configuration options in the updated AWS Management Console for GuardDuty. It’s recommended that you enable Runtime Monitoring for each AWS account; to do this, follow these steps:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab and then choose Edit.
  3. Under Runtime Monitoring, select Enable for all accounts.
  4. Under Automated agent configuration – Amazon EKS, ensure Enable for all accounts is selected.
     
Figure 4: Edit GuardDuty Runtime Monitoring configuration

Figure 4: Edit GuardDuty Runtime Monitoring configuration

If you want to continue using EKS Runtime Monitoring without enabling GuardDuty ECS Runtime Monitoring or if the Runtime Monitoring protection plan isn’t yet available in your Region, you can configure EKS Runtime Monitoring using the AWS Command Line Interface (AWS CLI) or API. For more information on this migration, see Migrating from EKS Runtime Monitoring to GuardDuty Runtime Monitoring.

Amazon GuardDuty ECS Runtime Monitoring for Fargate

For ECS using a Fargate capacity provider, GuardDuty deploys the security agent as a sidecar container alongside the essential task container. This doesn’t require you to make changes to the deployment of your Fargate tasks and verifies that new tasks will have GuardDuty Runtime Monitoring. If the GuardDuty security agent sidecar container is unable to launch in a healthy state, the ECS Fargate task will not be prevented from running.

When using GuardDuty ECS Runtime Monitoring for Fargate, you can install the agent on Amazon ECS Fargate clusters within an AWS account or only on selected clusters. In the following sections, we show you how to enable the service and provision the agents.

Prerequisites

If you haven’t activated GuardDuty, learn more about the free trial and pricing and follow the steps in Getting started with GuardDuty to set up the service and start monitoring your account. Alternatively, you can activate GuardDuty by using the AWS CLI. The minimum Fargate environment version and container operating systems supported can be found in the Prerequisites for AWS Fargate (Amazon ECS only) support. The AWS Identity and Access Management (IAM) role used for running an Amazon ECS task must be provided with access to Amazon ECR with the appropriate permissions to download the GuardDuty sidecar container. To learn more about Amazon ECR repositories that host the GuardDuty agent for AWS Fargate, see Repository for GuardDuty agent on AWS Fargate (Amazon ECS only).

Enable Fargate Runtime Monitoring

To enable GuardDuty Runtime Monitoring for ECS Fargate, follow these steps:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab and then in the AWS Fargate (ECS only) section, choose Enable.
     
Figure 5: GuardDuty Runtime Monitoring configuration

Figure 5: GuardDuty Runtime Monitoring configuration

If your AWS account is managed within AWS Organizations and you’re running ECS Fargate clusters in multiple AWS accounts, only the GuardDuty delegated administrator account can enable or disable GuardDuty ECS Runtime Monitoring for the member accounts. GuardDuty is a regional service and must be enabled within each desired Region. If you’re using multiple accounts and want to centrally manage GuardDuty see Managing multiple accounts in Amazon GuardDuty.

You can use the same process to enable GuardDuty ECS Runtime Monitoring and manage the GuardDuty security agent. It’s recommended to enable GuardDuty ECS Runtime Monitoring automatically for member accounts within your organization.

To automatically enable GuardDuty Runtime Monitoring for ECS Fargate new accounts:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab, and then choose Edit.
  3. Under Runtime Monitoring, ensure Enable for all accounts is selected.
  4. Under Automated agent configuration – AWS Fargate (ECS only), select Enable for all accounts, then choose Save.
     
Figure 6: Enable ECS GuardDuty Runtime Monitoring for AWS accounts

Figure 6: Enable ECS GuardDuty Runtime Monitoring for AWS accounts

After you enable GuardDuty ECS Runtime Monitoring for Fargate, GuardDuty can start monitoring and analyzing the runtime activity events for ECS tasks in your account. GuardDuty automatically creates a virtual private cloud (VPC) endpoint in your AWS account in the VPCs where you’re deploying your Fargate tasks. The VPC endpoint is used by the GuardDuty agent to send telemetry and configuration data back to the GuardDuty service API. For GuardDuty to receive the runtime events for your ECS Fargate clusters, you can choose one of three approaches to deploy the fully managed security agent:

  • Monitor existing and new ECS Fargate clusters
  • Monitor existing and new ECS Fargate clusters and exclude selective ECS Fargate clusters
  • Monitor selective ECS Fargate clusters

It’s recommended to monitor each ECS Fargate cluster and then exclude clusters on an as-needed basis. To learn more, see Configure GuardDuty ECS Runtime Monitoring.

Monitor all ECS Fargate clusters

Use this method when you want GuardDuty to automatically deploy and manage the security agent across each ECS Fargate cluster within your account. GuardDuty will automatically install the security agent when new ECS Fargate clusters are created.

To enable GuardDuty Runtime Monitoring for ECS Fargate across each ECS cluster:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab.
  3. Under the Automated agent configuration for AWS Fargate (ECS only), select Enable.
     
Figure 7: Enable GuardDuty Runtime Monitoring for ECS clusters

Figure 7: Enable GuardDuty Runtime Monitoring for ECS clusters

Monitor all ECS Fargate clusters and exclude selected ECS Fargate clusters

GuardDuty automatically installs the security agent on each ECS Fargate cluster. To exclude an ECS Fargate cluster from GuardDuty Runtime Monitoring, you can use the key-value pair GuardDutyManaged:false as a tag. Add this exclusion tag to your ECS Fargate cluster either before enabling Runtime Monitoring or during cluster creation to prevent automatic GuardDuty monitoring.

To add an exclusion tag to an ECS cluster:

  1. In the Amazon ECS console, in the navigation pane under Clusters, select the cluster name.
  2. Select the Tags tab.
  3. Select Manage Tags and enter the key GuardDutyManaged and value false, then choose Save.
     
Figure 8: GuardDuty Runtime Monitoring ECS cluster exclusion tags

Figure 8: GuardDuty Runtime Monitoring ECS cluster exclusion tags

To make sure that these tags aren’t modified, you can prevent tags from being modified except by authorized principals.

Monitor selected ECS Fargate clusters

You can monitor selected ECS Fargate clusters when you want GuardDuty to handle the deployment and updates of the security agent exclusively for specific ECS Fargate clusters within your account. This could be a use case where you want to evaluate GuardDuty ECS Runtime Monitoring for Fargate. By using inclusion tags, GuardDuty automatically deploys and manages the security agent only for the ECS Fargate clusters that are tagged with the key-value pair GuardDutyManaged:true. To use inclusion tags, verify that the automated agent configuration for AWS Fargate (ECS) hasn’t been enabled.

To add an inclusion tag to an ECS cluster:

  1. In the Amazon ECS console, in the navigation pane under Clusters, select the cluster name.
  2. Select the Tags tab.
  3. Select Manage Tags and enter the key GuardDutyManaged and value true, then choose Save.
     
Figure 9: GuardDuty inclusion tags

Figure 9: GuardDuty inclusion tags

To make sure that these tags aren’t modified, you can prevent tags from being modified except by authorized principals.

Fargate task level rollout

After you’re enabled GuardDuty ECS Runtime Monitoring for Fargate, newly launched tasks will include the GuardDuty agent sidecar container. For pre-existing long running tasks, you might want to consider a targeted deployment for task refresh to activate the GuardDuty sidecar security container. This can be achieved using either a rolling update (ECS deployment type) or a blue/green deployment with AWS CodeDeploy.

To verify the GuardDuty agent is running for a task, you can check for an additional container prefixed with aws-guardduty-agent-. Successful deployment will change the container’s status to Running.

To view the GuardDuty agent container running as part of your ECS task:

  1. In the Amazon ECS console, in the navigation pane under Clusters, select the cluster name.
  2. Select the Tasks tab.
  3. Select the Task GUID you want to review.
  4. Under the Containers section, you can view the GuardDuty agent container.
     
Figure 10: View status of the GuardDuty sidecar container

Figure 10: View status of the GuardDuty sidecar container

GuardDuty ECS on Fargate coverage monitoring

Coverage status of your ECS Fargate clusters is evaluated regularly and can be classified as either healthy or unhealthy. An unhealthy cluster signals a configuration issue, and you can find more details in the GuardDuty Runtime Monitoring notifications section. When you enable GuardDuty ECS Runtime Monitoring and deploy the security agent in your clusters, you can view the coverage status of new ECS Fargate clusters and tasks in the GuardDuty console.

To view coverage status:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Runtime coverage tab, and then select ECS clusters runtime coverage.
     
Figure 11: GuardDuty Runtime ECS coverage status overview

Figure 11: GuardDuty Runtime ECS coverage status overview

Troubleshooting steps for cluster coverage issues such as clusters reporting as unhealthy and a sample notification schema are available at Coverage for Fargate (Amazon ECS only) resource. More information regarding monitoring can be found in the next section.

Amazon GuardDuty Runtime Monitoring for EC2

Amazon EC2 Runtime Monitoring in GuardDuty helps you provide threat detection for Amazon EC2 instances and supports Amazon ECS managed EC2 instances. The GuardDuty security agent, which GuardDuty uses to send telemetry and configuration data back to the GuardDuty service API, is required to be installed onto each EC2 instance.

Prerequisites

If you haven’t activated Amazon GuardDuty, learn more about the free trial and pricing and follow the steps in Getting started with GuardDuty to set up the service and start monitoring your account. Alternatively, you can activate GuardDuty by using the AWS CLI.

To use Amazon EC2 Runtime Monitoring to monitor your ECS container instances, your operating environment must meet the prerequisites for EC2 instance support and the GuardDuty security agent must be installed manually onto the EC2 instances you want to monitor. GuardDuty Runtime Monitoring for EC2 requires you to create the Amazon VPC endpoint manually. If the VPC already has the GuardDuty VPC endpoint created from a previous deployment, you don’t need to create the VPC endpoint again.

If you plan to deploy the agent to Amazon EC2 instances using AWS Systems Manager, an Amazon owned Systems Manager document named AmazonGuardDuty-ConfigureRuntimeMonitoringSsmPlugin is available for use. Alternatively, you can use RPM installation scripts whether or not your Amazon ECS instances are managed by AWS Systems Manager.

Enable GuardDuty Runtime Monitoring for EC2

GuardDuty Runtime Monitoring for EC2 is automatically enabled when you enable GuardDuty Runtime Monitoring.

To enable GuardDuty Runtime Monitoring:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Configuration tab, and then in the Runtime Monitoring section, choose Enable.
     
Figure 12: Enable GuardDuty runtime monitoring

Figure 12: Enable GuardDuty runtime monitoring

After the prerequisites have been met and you enable GuardDuty Runtime Monitoring, GuardDuty starts monitoring and analyzing the runtime activity events for the EC2 instances.

If your AWS account is managed within AWS Organizations and you’re running ECS on EC2 clusters in multiple AWS accounts, only the GuardDuty delegated administrator can enable or disable GuardDuty ECS Runtime Monitoring for the member accounts. If you’re using multiple accounts and want to centrally manage GuardDuty, see Managing multiple accounts in Amazon GuardDuty.

GuardDuty EC2 coverage monitoring

When you enable GuardDuty Runtime Monitoring and deploy the security agent on your Amazon EC2 instances, you can view the coverage status of the instances.

To view EC2 instance coverage status:

  1. In the GuardDuty console, in the navigation pane under Protection plans, select Runtime Monitoring.
  2. Select the Runtime coverage tab, and then select EC2 instance runtime coverage.
     
Figure 13: GuardDuty Runtime Monitoring coverage for EC2 overview

Figure 13: GuardDuty Runtime Monitoring coverage for EC2 overview

Cluster coverage status notifications can be configured using the notification schema available under Configuring coverage status change notifications. More information regarding monitoring can be found in the following section.

GuardDuty Runtime Monitoring notifications

If the coverage status of your ECS cluster or EC2 instance becomes unhealthy, there are a number of recommended troubleshooting steps that you can follow.

To stay informed about changes in the coverage status of an ECS cluster or EC2 instance, it’s recommended that you set up status change notifications. Because GuardDuty publishes these status changes on the EventBridge bus associated with your AWS account, you can do this by setting up an Amazon EventBridge rule to receive notifications.

In the following example AWS CloudFormation template, you can use an EventBridge rule to send notifications to Amazon Simple Notification Service (Amazon SNS) and subscribe to the SNS topic using email.

AWSTemplateFormatVersion: "2010-09-09"
Description: CloudFormation template for Amazon EventBridge rules to monitor Healthy/Unhealthy status of GuardDuty Runtime Monitoring coverage status. This template creates the EventBridge and Amazon SNS topics to be notified via email on state change of security agents
Parameters:
  namePrefix:	
    Description: a simple naming convention for the SNS & EventBridge rules
    Type: String
    Default: GuardDuty-Runtime-Agent-Status
    MinLength: 1
    MaxLength: 50
    AllowedPattern: ^[a-zA-Z0-9\-_]*$
    ConstraintDescription: Maximum 50 characters of numbers, lower/upper case letters, -,_.
  operatorEmail:
    Type: String
    Description: Email address to notify if there are security agent status state changes
    AllowedPattern: "([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\\]?)"
    ConstraintDescription: must be a valid email address.
Resources:
  eventRuleUnhealthy:
    Type: AWS::Events::Rule
    Properties:
      EventBusName: default
      EventPattern:
        source:
          - aws.guardduty
        detail-type:
          - GuardDuty Runtime Protection Unhealthy
      Name: !Join [ '-', [ 'Rule', !Ref namePrefix, 'Unhealthy' ] ]
      State: ENABLED
      Targets:
        - Id: "GDUnhealthyTopic"
          Arn: !Ref notificationTopicUnhealthy
  eventRuleHealthy:
    Type: AWS::Events::Rule
    Properties:
      EventBusName: default
      EventPattern:
        source:
          - aws.guardduty
        detail-type:
          - GuardDuty Runtime Protection Healthy
      Name: !Join [ '-', [ 'Rule', !Ref namePrefix, 'Healthy' ] ]
      State: ENABLED
      Targets:
        - Id: "GDHealthyTopic"
          Arn: !Ref notificationTopicHealthy
  eventTopicPolicy:
    Type: 'AWS::SNS::TopicPolicy'
    Properties:
      PolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: events.amazonaws.com
            Action: 'sns:Publish'
            Resource: '*'
      Topics:
        - !Ref notificationTopicHealthy
        - !Ref notificationTopicUnhealthy
  notificationTopicHealthy:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Join [ '-', [ 'Topic', !Ref namePrefix, 'Healthy' ] ]
      DisplayName: GD-Healthy-State
      Subscription:
      - Endpoint:
          Ref: operatorEmail
        Protocol: email
  notificationTopicUnhealthy:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Join [ '-', [ 'Topic', !Ref namePrefix, 'Unhealthy' ] ]
      DisplayName: GD-Unhealthy-State
      Subscription:
      - Endpoint:
          Ref: operatorEmail
        Protocol: email

GuardDuty findings

When GuardDuty detects a potential threat and generates a security finding, you can view the details of the corresponding finding. The GuardDuty agent collects kernel-space and user-space events from the hosts and the containers. See Finding types for detailed information and recommended remediation activities regarding each finding type. You can generate sample GuardDuty Runtime Monitoring findings using the GuardDuty console or you can use this GitHub script to generate some basic detections within GuardDuty.

Example ECS findings

GuardDuty security findings can indicate either a compromised container workload or ECS cluster or a set of compromised credentials in your AWS environment.

To view a full description and remediation recommendations regarding a finding:

  1. In the GuardDuty console, in the navigation pane, select Findings.
  2. Select a finding in the navigation pane, and then choose the Info hyperlink.
     
Figure 14: GuardDuty example finding

Figure 14: GuardDuty example finding

The ResourceType for an ECS Fargate finding could be an ECS cluster or container. If the resource type in the finding details is ECSCluster, it indicates that either a task or a container inside an ECS Fargate cluster is potentially compromised. You can identify the Name and Amazon Resource Name (ARN) of the ECS cluster paired with the task ARN and task Definition ARN details in the cluster.

To view affected resources, ECS cluster details, task details and instance details regarding a finding:

  1. In the GuardDuty console, in the navigation pane, select Findings.
  2. Select a finding related to an ECS cluster in the navigation pane and then scroll down in the right-hand pane to view the different section headings.
     
Figure 15: GuardDuty finding details for Fargate

Figure 15: GuardDuty finding details for Fargate

The Action and Runtime details provide information about the potentially suspicious activity. The example finding in Figure 16 tells you that the listed ECS container in your environment is querying a domain that is associated with Bitcoin or other cryptocurrency-related activity. This can lead to threat actors attempting to take control over the compute resource to repurpose it for unauthorized cryptocurrency mining.

Figure 16: GuardDuty ECS example finding with action and process details

Figure 16: GuardDuty ECS example finding with action and process details

Example ECS on EC2 findings

When a finding is generated from EC2, additional information is shown including the instance details, IAM profile details, and instance tags (as shown in Figure 17), which can be used to help identify the affected EC2 instance.

Figure 17: GuardDuty EC2 instance details for a finding

Figure 17: GuardDuty EC2 instance details for a finding

This additional instance-level information can help you focus your remediation efforts.

GuardDuty finding remediation

When you’re actively monitoring the runtime behavior of containers within your tasks and GuardDuty identifies potential security issues within your AWS environment, you should consider taking the following suggested remediation actions. This helps to address potential security issues and to contain the potential threat in your AWS account.

  1. Identify the potentially impacted Amazon ECS Cluster – The runtime monitoring finding provides the potentially impacted Amazon ECS cluster details in the finding details panel.
  2. Evaluate the source of potential compromise – Evaluate if the detected finding was in the container’s image. If the resource was in the container image, identify all other tasks that are using this image and evaluate the source of the image.
  3. Isolate the impacted tasks – To isolate the affected tasks, restrict both incoming and outgoing traffic to the tasks by implementing VPC network rules that deny all traffic. This approach can be effective in halting an ongoing attack by cutting off all connections to the affected tasks. Be aware that terminating the tasks could eliminate crucial evidence related to the finding that you might need for further analysis.If the task’s container has accessed the underlying Amazon EC2 host, its associated instance credentials might have been compromised. For more information, see Remediating compromised AWS credentials.

Each GuardDuty Runtime Monitoring finding provides specific prescriptive guidance regarding finding remediation. Within each finding, you can choose the Remediating Runtime Monitoring findings link for more information.

To view the recommended remediation actions:

  1. In the GuardDuty console, in the navigation pane, select Findings.
  2. Select a finding in the navigation pane and then choose the Info hyperlink and scroll down in the right-hand pane to view the remediation recommendations section.
     
Figure 18: GuardDuty Runtime Monitoring finding remediation

Figure 18: GuardDuty Runtime Monitoring finding remediation

Summary

You can now use Amazon GuardDuty for ECS Runtime Monitoring to monitor your Fargate and EC2 workloads. For a full list of Regions where ECS Runtime Monitoring is available, see Region-specific feature availability.

It’s recommended that you asses your container application using the AWS Well-Architected Tool to ensure adherence to best practices. The recently launched AWS Well-Architected Amazon ECS Lens offers a specialized assessment for container-based operations and troubleshooting of Amazon ECS applications, aligning with the ECS best practices guide. You can integrate this lens into the AWS Well-Architected Tool available in the console.

For more information regarding security monitoring and threat detection, visit the AWS Online Tech Talks. For hands-on experience and learn more regarding AWS security services, visit our AWS Activation Days website to find a workshop in your Region.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Luke Notley

Luke Notley

Luke is a Senior Solutions Architect with Amazon Web Services and is based in Western Australia. Luke has a passion for helping customers connect business outcomes with technology and assisting customers throughout their cloud journey, helping them design scalable, flexible, and resilient architectures. In his spare time, he enjoys traveling, coaching basketball teams, and DJing.

Arran Peterson

Arran Peterson

Arran, a Solutions Architect based in Adelaide, South Australia, collaborates closely with customers to deeply understand their distinct business needs and goals. His role extends to assisting customers in recognizing both the opportunities and risks linked to their decisions related to cloud solutions.

Access AWS using a Google Cloud Platform native workload identity

Post Syndicated from Simran Singh original https://aws.amazon.com/blogs/security/access-aws-using-a-google-cloud-platform-native-workload-identity/

Organizations undergoing cloud migrations and business transformations often find themselves managing IT operations in hybrid or multicloud environments. This can make it more complex to safeguard workloads, applications, and data, and to securely handle identities and permissions across Amazon Web Services (AWS), hybrid, and multicloud setups.

In this post, we show you how to assume an AWS Identity and Access Management (IAM) role in your AWS accounts to securely issue temporary credentials for applications that run on the Google Cloud Platform (GCP). We also present best practices and key considerations in this authentication flow. Furthermore, this post provides references to supplementary GCP documentation that offer additional context and provide steps relevant to setup on GCP.

Access control across security realms

As your multicloud environment grows, managing access controls across providers becomes more complex. By implementing the right access controls from the beginning, you can help scale your cloud operations effectively without compromising security. When you deploy apps across multiple cloud providers, you should implement a homogenous and consistent authentication and authorization mechanism across both cloud environments, to help maintain a secure and cost-effective environment. In the following sections, you’ll learn how to enforce such objectives across AWS and workloads hosted on GCP, as shown in Figure 1.

Figure 1: Authentication flow between GCP and AWS

Figure 1: Authentication flow between GCP and AWS

Prerequisites

To follow along with this walkthrough, complete the following prerequisites.

  1. Create a service account in GCP. Resources in GCP use service accounts to make API calls. When you create a GCP resource, such as a compute engine instance in GCP, a default service account gets created automatically. Although you can use this default service account in the solution described in this post, we recommend that you create a dedicated user-managed service account, because you can control what permissions to assign to the service account within GCP.

    To learn more about best practices for service accounts, see Best practices for using service accounts in the Google documentation. In this post, we use a GCP virtual machine (VM) instance for demonstration purposes. To attach service accounts to other GCP resources, see Attach service accounts to resources.

  2. Create a VM instance in GCP and attach the service account that you created in Step 1. Resources in GCP store their metadata information in a metadata server, and you can request an instance’s identity token from the server. You will use this identity token in the authentication flow later in this post.
  3. Install the AWS Command Line Interface (AWS CLI) on the GCP VM instance that you created in Step 2.
  4. Install jq and curl.

GCP VM identity authentication flow

Obtaining temporary AWS credentials for workloads that run on GCP is a multi-step process. In this flow, you use the identity token from the GCP compute engine metadata server to call the AssumeRoleWithWebIdentity API to request AWS temporary credentials. This flow gives your application greater flexibility to request credentials for an IAM role that you have configured with a sufficient trust policy, and the corresponding Amazon Resource Name (ARN) for the IAM role must be known to the application.

Define an IAM role on AWS

Because AWS already supports OpenID Connect (OIDC) federation, you can use the OIDC token provided in GCP as described in Step 2 of the Prerequisites, and you don’t need to create a separate OIDC provider in your AWS account. Instead, to create an IAM role for OIDC federation, follow the steps in Creating a role for web identity or OpenID Connect Federation (console). Using an OIDC principal without a condition can be overly permissive. To make sure that only the intended identity provider assumes the role, you need to provide a StringEquals condition in the trust policy for this IAM role. Add the condition keys accounts.google.com:aud, accounts.google.com:oaud, and accounts.google.com:sub to the role’s trust policy, as shown in the following.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Federated": "accounts.google.com"},
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "accounts.google.com:aud": "<azp-value>",
                    "accounts.google.com:oaud": "<aud-value>",
                    "accounts.google.com:sub": "<sub-value>"
                }
            }
        }
    ]
}

Make sure to replace the <placeholder values> with your values from the Google ID Token. The ID token issued for the service accounts has the azp (AUTHORIZED_PARTY) field set, so condition keys are mapped to the Google ID Token fields as follows:

  • accounts.google.com:oaud condition key matches the aud (AUDIENCE) field on the Google ID token.
  • accounts.google.com:aud condition key matches the azp (AUTHORIZED_PARTY) field on the Google ID token.
  • accounts.google.com:sub condition key matches the sub (SUBJECT) field on the Google ID token.

For more information about the Google aud and azp fields, see the Google Identity Platform OpenID Connect guide.

Authentication flow

The authentication flow for the scenario is shown in Figure 2.

Figure 2: Detailed authentication flow with AssumeRoleWithWebIdentity API

Figure 2: Detailed authentication flow with AssumeRoleWithWebIdentity API

The authentication flow has the following steps:

  1. On AWS, you can source external credentials by configuring the credential_process setting in the config file. For the syntax and operating system requirements, see Source credentials with an external process. For this post, we have created a custom profile TeamA-S3ReadOnlyAccess as follows in the config file:
    [profile TeamA-S3ReadOnlyAccess]
    credential_process = /opt/bin/credentials.sh

    To use different settings, you can create and reference additional profiles.

  2. Specify a program or a script that credential_process will invoke. For this post, credential_process invokes the script /opt/bin/credentials.sh which has the following code. Make sure to replace <111122223333> with your own account ID.
    #!/bin/bash
    
    AUDIENCE="dev-aws-account-teama"
    ROLE_ARN="arn:aws:iam::<111122223333>:role/RoleForAccessFromGCPTeamA"
    
    jwt_token=$(curl -sH "Metadata-Flavor: Google" "http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=${AUDIENCE}&format=full&licenses=FALSE")
    
    jwt_sub=$(jq -R 'split(".") | .[1] | @base64d | fromjson' <<< "$jwt_token" | jq -r '.sub')
    
    credentials=$(aws sts assume-role-with-web-identity --role-arn $ROLE_ARN --role-session-name $jwt_sub --web-identity-token $jwt_token | jq '.Credentials' | jq '.Version=1')
    
    
    echo $credentials

    The script performs the following steps:

    1. Google generates a new unique instance identity token in the JSON Web Token (JWT) format.
      jwt_token=$(curl -sH "Metadata-Flavor: Google" "http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=${AUDIENCE}&format=full&licenses=FALSE")

      The payload of the token includes several details about the instance and the audience URI, as shown in the following.

      {
         "iss": "[TOKEN_ISSUER]",
         "iat": [ISSUED_TIME],
         "exp": [EXPIRED_TIME],
         "aud": "[AUDIENCE]",
         "sub": "[SUBJECT]",
         "azp": "[AUTHORIZED_PARTY]",
         "google": {
          "compute_engine": {
            "project_id": "[PROJECT_ID]",
            "project_number": [PROJECT_NUMBER],
            "zone": "[ZONE]",
            "instance_id": "[INSTANCE_ID]",
            "instance_name": "[INSTANCE_NAME]",
            "instance_creation_timestamp": [CREATION_TIMESTAMP],
            "instance_confidentiality": [INSTANCE_CONFIDENTIALITY],
            "license_id": [
              "[LICENSE_1]",
                ...
              "[LICENSE_N]"
            ]
          }
        }
      }

      The IAM trust policy uses the aud (AUDIENCE), azp (AUTHORIZED_PARTY) and sub (SUBJECT) values from the JWT token to help ensure that the IAM role defined in the section Define an IAM role in AWS can be assumed only by the intended GCP service account.

    2. The script invokes the AssumeRoleWithWebIdentity API call, passing in the identity token from the previous step and specifying which IAM role to assume. The script uses the Identity subject claim as the session name, which can facilitate auditing or forensic operations on this AssumeRoleWithWebIdentity API call. AWS verifies the authenticity of the token before returning temporary credentials. In addition, you can verify the token in your credential program by using the process described at Obtaining the instance identity token.

      The script then returns the temporary credentials to the credential_process as the JSON output on STDOUT; we used jq to parse the output in the desired JSON format.

      jwt_sub=$(jq -R 'split(".") | .[1] | @base64d | fromjson' <<< "$jwt_token" | jq -r '.sub')
      
      credentials=$(aws sts assume-role-with-web-identity --role-arn $ROLE_ARN --role-session-name $jwt_sub --web-identity-token $jwt_token | jq '.Credentials' | jq '.Version=1')
      
      echo $credentials

    The following is an example of temporary credentials returned by the credential_process script:

    {
      "Version": 1,
      "AccessKeyId": "AKIAIOSFODNN7EXAMPLE",
      "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "SessionToken": "FwoGZXIvYXdzEBUaDOSY+1zJwXi29+/reyLSASRJwSogY/Kx7NomtkCoSJyipWuu6sbDIwFEYtZqg9knuQQyJa9fP68/LCv4jH/efuo1WbMpjh4RZpbVCOQx/zggZTyk2H5sFvpVRUoCO4dc7eqftMhdKtcq67vAUljmcDkC9l0Fei5tJBvVpQ7jzsYeduX/5VM6uReJaSMeOXnIJnQZce6PI3GBiLfaX7Co4o216oS8yLNusTK1rrrwrY2g5e3Zuh1oXp/Q8niFy2FSLN62QHfniDWGO8rCEV9ZnZX0xc4ZN68wBc1N24wKgT+xfCjamcCnBjJYHI2rEtJdkE6bRQc2WAUtccsQk5u83vWae+SpB9ycE/dzfXurqcjCP0urAp4k9aFZFsRIGfLAI1cOABX6CzF30qrcEBnEXAMPLESESSIONTOKEN==",
      "Expiration": "2023-08-31T04:45:30Z"
    }

Note that AWS SDKs store the returned AWS credentials in memory when they call credential_process. AWS SDKs keep track of the credential expiration and generate new AWS session credentials through the credential process. In contrast, the AWS CLI doesn’t cache external process credentials; instead, the AWS CLI calls the credential_process for every CLI request, which creates a new role session and could result in slight delays when you run commands.

Test access in the AWS CLI

After you configure the config file for the credential_process, verify your setup by running the following command.

aws sts get-caller-identity --profile TeamA-S3ReadOnlyAccess

The output will look similar to the following.

{
   "UserId":"AIDACKCEVSQ6C2EXAMPLE:[Identity subject claim]",
   "Account":"111122223333",
   "Arn":"arn:aws:iam::111122223333:role/RoleForAccessFromGCPTeamA:[Identity subject claim]"
}

Amazon CloudTrail logs the AssumeRoleWithWebIdentity API call, as shown in Figure 3. The log captures the audience in the identity token as well as the IAM role that is being assumed. It also captures the session name with a reference to the Identity subject claim, which can help simplify auditing or forensic operations on this AssumeRoleWithWebIdentity API call.

Figure 3: CloudTrail event for AssumeRoleWithWebIdentity API call from GCP VM

Figure 3: CloudTrail event for AssumeRoleWithWebIdentity API call from GCP VM

Test access in the AWS SDK

The next step is to test access in the AWS SDK. The following Python program shows how you can refer to the custom profile configured for the credential process.

import boto3

session = boto3.Session(profile_name='TeamA-S3ReadOnlyAccess')
client = session.client('s3')

response = client.list_buckets()
for _bucket in response['Buckets']:
    print(_bucket['Name'])

Before you run this program, run pip install boto3. Create an IAM role that has the AmazonS3ReadOnlyAccess policy attached to it. This program prints the names of the existing S3 buckets in your account. For example, if your AWS account has two S3 buckets named DOC-EXAMPLE-BUCKET1 and DOC-EXAMPLE-BUCKET2, then the output of the preceding program shows the following:

DOC-EXAMPLE-BUCKET1
DOC-EXAMPLE-BUCKET2

If you don’t have an existing S3 bucket, then create an S3 bucket before you run the preceding program.

The list_bucket API call is also logged in CloudTrail, capturing the identity and source of the calling application, as shown in Figure 4.

Figure 4: CloudTrail event for S3 API call made with federated identity session

Figure 4: CloudTrail event for S3 API call made with federated identity session

Clean up

If you don’t need to further use the resources that you created for this walkthrough, delete them to avoid future charges for the deployed resources:

  • Delete the VM instance and service account created in GCP.
  • Delete the resources that you provisioned on AWS to test the solution.

Conclusion

In this post, you learned how to exchange the identity token of a virtual machine running on a GCP compute engine to assume a role on AWS, so that you can seamlessly and securely access AWS resources from GCP hosted workloads.

We walked you through the steps required to set up the credential process and shared best practices to consider in this authentication flow. You can also apply the same pattern to workloads deployed on GCP functions or Google Kubernetes Engine (GKE) when they request access to AWS resources.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Simran Singh

Simran Singh

Simran is a Senior Solutions Architect at AWS. In this role, he assists large enterprise customers in meeting their key business objectives on AWS. His areas of expertise include artificial intelligence/machine learning, security, and improving the experience of developers building on AWS. He has also earned a coveted golden jacket for achieving all currently offered AWS certifications.

Rashmi Iyer

Rashmi Iyer

Rashmi is a Solutions Architect at AWS supporting financial services enterprises. She helps customers build secure, resilient, and scalable architectures on AWS while adhering to architectural best practices. Before joining AWS, Rashmi worked for over a decade to architect and design complex telecom solutions in the packet core domain.

New Amazon CloudWatch log class to cost-effectively scale your AWS Glue workloads

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/new-amazon-cloudwatch-log-class-to-cost-effectively-scale-your-aws-glue-workloads/

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores.

One of the most common questions we get from customers is how to effectively optimize costs on AWS Glue. Over the years, we have built multiple features and tools to help customers manage their AWS Glue costs. For example, AWS Glue Auto Scaling and AWS Glue Flex can help you reduce the compute cost associated with processing your data. AWS Glue interactive sessions and notebooks can help you reduce the cost of developing your ETL jobs. For more information about cost-saving best practices, refer to Monitor and optimize cost on AWS Glue for Apache Spark. Additionally, to understand data transfer costs, refer to the Cost Optimization Pillar defined in AWS Well-Architected Framework. For data storage, you can apply general best practices defined for each data source. For a cost optimization strategy using Amazon Simple Storage Service (Amazon S3), refer to Optimizing storage costs using Amazon S3.

In this post, we tackle the remaining piece—the cost of logs written by AWS Glue.

Before we get into the cost analysis of logs, let’s understand the reasons to enable logging for your AWS Glue job and the current options available. When you start an AWS Glue job, it sends the real-time logging information to Amazon CloudWatch (every 5 seconds and before each executor stops) during the Spark application starts running. You can view the logs on the AWS Glue console or the CloudWatch console dashboard. These logs provide you with insights into your job runs and help you optimize and troubleshoot your AWS Glue jobs. AWS Glue offers a variety of filters and settings to reduce the verbosity of your logs. As the number of job runs increases, so does the volume of logs generated.

To optimize CloudWatch Logs costs, AWS recently announced a new log class for infrequently accessed logs called Amazon CloudWatch Logs Infrequent Access (Logs IA). This new log class offers a tailored set of capabilities at a lower cost for infrequently accessed logs, enabling you to consolidate all your logs in one place in a cost-effective manner. This class provides a more cost-effective option for ingesting logs that only need to be accessed occasionally for auditing or debugging purposes.

In this post, we explain what the Logs IA class is, how it can help reduce costs compared to the standard log class, and how to configure your AWS Glue resources to use this new log class. By routing logs to Logs IA, you can achieve significant savings in your CloudWatch Logs spend without sacrificing access to important debugging information when you need it.

CloudWatch log groups used by AWS Glue job continuous logging

When continuous logging is enabled, AWS Glue for Apache Spark writes Spark driver/executor logs and progress bar information into the following log group:

/aws-glue/jobs/logs-v2

If a security configuration is enabled for CloudWatch logs, AWS Glue for Apache Spark will create a log group named as follows for continuous logs:

<Log-Group-Name>-<Security-Configuration-Name>

The default and custom log groups will be as follows:

  • The default continuous log group will be /aws-glue/jobs/logs-v2-<Security-Configuration-Name>
  • The custom continuous log group will be <custom-log-group-name>-<Security-Configuration-Name>

You can provide a custom log group name through the job parameter –continuous-log-logGroup.

Getting started with the new Infrequent Access log class for AWS Glue workload

To gain the benefits from Logs IA for your AWS Glue workloads, you need to complete the following two steps:

  1. Create a new log group using the new Log IA class.
  2. Configure your AWS Glue job to point to the new log group

Complete the following steps to create a new log group using the new Infrequent Access log class:

  1. On the CloudWatch console, choose Log groups under Logs in the navigation pane.
  2. Choose Create log group.
  3. For Log group name, enter /aws-glue/jobs/logs-v2-infrequent-access.
  4. For Log class, choose Infrequent Access.
  5. Choose Create.

Complete the following steps to configure your AWS Glue job to point to the new log group:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Choose your job.
  3. On the Job details tab, choose Add new parameter under Job parameters.
  4. For Key, enter --continuous-log-logGroup.
  5. For Value, enter /aws-glue/jobs/logs-v2-infrequent-access.
  6. Choose Save.
  7. Choose Run to trigger the job.

New log events are written into the new log group.

View the logs with the Infrequent Access log class

Now you’re ready to view the logs with the Infrequent Access log class. Open the log group /aws-glue/jobs/logs-v2-infrequent-access on the CloudWatch console.

When you choose one of the log streams, you will notice that it redirects you to the CloudWatch console Logs Insight page with a pre-configured default command and your log stream selected by default. By choosing Run query, you can view the actual log events on the Logs Insights page.

Considerations

Keep in mind the following considerations:

  • You cannot change the log class of a log group after it’s created. You need to create a new log group to configure the Infrequent Access class.
  • The Logs IA class offers a subset of CloudWatch Logs capabilities, including managed ingestion, storage, cross-account log analytics, and encryption with a lower ingestion price per GB. For example, you can’t view log events through the standard CloudWatch Logs console. To learn more about the features offered across both log classes, refer to Log Classes.

Conclusion

This post provided step-by-step instructions to guide you through enabling Logs IA for your AWS Glue job logs. If your AWS Glue ETL jobs generate large volumes of log data that makes it a challenge as you scale your applications, the best practices demonstrated in this post can help you cost-effectively scale while centralizing all your logs in CloudWatch Logs. Start using the Infrequent Access class with your AWS Glue workloads today and enjoy the cost benefits.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Abeetha Bala is a Senior Product Manager for Amazon CloudWatch, primarily focused on logs. Being customer obsessed, she solves observability challenges through innovative and cost-effective ways.

Kinshuk Pahare is a leader in AWS Glue’s product management team. He drives efforts on the platform, developer experience, and big data processing frameworks like Apache Spark, Ray, and Python Shell.

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

Post Syndicated from Manikanta Gona original https://aws.amazon.com/blogs/big-data/automatically-detect-personally-identifiable-information-in-amazon-redshift-using-aws-glue/

With the exponential growth of data, companies are handling huge volumes and a wide variety of data including personally identifiable information (PII). PII is a legal term pertaining to information that can identify, contact, or locate a single person. Identifying and protecting sensitive data at scale has become increasingly complex, expensive, and time-consuming. Organizations have to adhere to data privacy, compliance, and regulatory requirements such as GDPR and CCPA, and it’s important to identify and protect PII to maintain compliance. You need to identify sensitive data, including PII such as name, Social Security Number (SSN), address, email, driver’s license, and more. Even after identification, it’s cumbersome to implement redaction, masking, or encryption of sensitive data at scale.

Many companies identify and label PII through manual, time-consuming, and error-prone reviews of their databases, data warehouses and data lakes, thereby rendering their sensitive data unprotected and vulnerable to regulatory penalties and breach incidents.

In this post, we provide an automated solution to detect PII data in Amazon Redshift using AWS Glue.

Solution overview

With this solution, we detect PII in data on our Redshift data warehouse so that the we take and protect the data. We use the following services:

  • Amazon Redshift is a cloud data warehousing service that uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning (ML) to deliver the best price/performance at any scale. For our solution, we use Amazon Redshift to store the data.
  • AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, and combine data for analytics, ML, and application development. We use AWS Glue to discover the PII data that is stored in Amazon Redshift.
  • Amazon Simple Storage Services (Amazon S3) is a storage service offering industry-leading scalability, data availability, security, and performance.

The following diagram illustrates our solution architecture.

The solution includes the following high-level steps:

  1. Set up the infrastructure using an AWS CloudFormation template.
  2. Load data from Amazon S3 to the Redshift data warehouse.
  3. Run an AWS Glue crawler to populate the AWS Glue Data Catalog with tables.
  4. Run an AWS Glue job to detect the PII data.
  5. Analyze the output using Amazon CloudWatch.

Prerequisites

The resources created in this post assume that a VPC is in place along with a private subnet and both their identifiers. This ensures that we don’t substantially change the VPC and subnet configuration. Therefore, we want to set up our VPC endpoints based on the VPC and subnet we choose to expose it in.

Before you get started, create the following resources as prerequisites:

  • An existing VPC
  • A private subnet in that VPC
  • A VPC gateway S3 endpoint
  • A VPC STS gateway endpoint

Set up the infrastructure with AWS CloudFormation

To create your infrastructure with a CloudFormation template, complete the following steps:

  1. Open the AWS CloudFormation console in your AWS account.
  2. Choose Launch Stack:
  3. Choose Next.
  4. Provide the following information:
    1. Stack name
    2. Amazon Redshift user name
    3. Amazon Redshift password
    4. VPC ID
    5. Subnet ID
    6. Availability Zones for the subnet ID
  5. Choose Next.
  6. On the next page, choose Next.
  7. Review the details and select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create stack.
  9. Note the values for S3BucketName and RedshiftRoleArn on the stack’s Outputs tab.

Load data from Amazon S3 to the Redshift Data warehouse

With the COPY command, we can load data from files located in one or more S3 buckets. We use the FROM clause to indicate how the COPY command locates the files in Amazon S3. You can provide the object path to the data files as part of the FROM clause, or you can provide the location of a manifest file that contains a list of S3 object paths. COPY from Amazon S3 uses an HTTPS connection.

For this post, we use a sample personal health dataset. Load the data with the following steps:

  1. On the Amazon S3 console, navigate to the S3 bucket created from the CloudFormation template and check the dataset.
  2. Connect to the Redshift data warehouse using the Query Editor v2 by establishing a connection with the database you creating using the CloudFormation stack along with the user name and password.

After you’re connected, you can use the following commands to create the table in the Redshift data warehouse and copy the data.

  1. Create a table with the following query:
    CREATE TABLE personal_health_identifiable_information (
        mpi char (10),
        firstName VARCHAR (30),
        lastName VARCHAR (30),
        email VARCHAR (75),
        gender CHAR (10),
        mobileNumber VARCHAR(20),
        clinicId VARCHAR(10),
        creditCardNumber VARCHAR(50),
        driverLicenseNumber VARCHAR(40),
        patientJobTitle VARCHAR(100),
        ssn VARCHAR(15),
        geo VARCHAR(250),
        mbi VARCHAR(50)    
    );

  2. Load the data from the S3 bucket:
    COPY personal_health_identifiable_information
    FROM 's3://<S3BucketName>/personal_health_identifiable_information.csv'
    IAM_ROLE '<RedshiftRoleArn>'
    CSV
    delimiter ','
    region '<aws region>'
    IGNOREHEADER 1;

Provide values for the following placeholders:

  • RedshiftRoleArn – Locate the ARN on the CloudFormation stack’s Outputs tab
  • S3BucketName – Replace with the bucket name from the CloudFormation stack
  • aws region – Change to the Region where you deployed the CloudFormation template
  1. To verify the data was loaded, run the following command:
    SELECT * FROM personal_health_identifiable_information LIMIT 10;

Run an AWS Glue crawler to populate the Data Catalog with tables

On the AWS Glue console, select the crawler that you deployed as part of the CloudFormation stack with the name crawler_pii_db, then choose Run crawler.

When the crawler is complete, the tables in the database with the name pii_db are populated in the AWS Glue Data Catalog, and the table schema looks like the following screenshot.

Run an AWS Glue job to detect PII data and mask the corresponding columns in Amazon Redshift

On the AWS Glue console, choose ETL Jobs in the navigation pane and locate the detect-pii-data job to understand its configuration. The basic and advanced properties are configured using the CloudFormation template.

The basic properties are as follows:

  • Type – Spark
  • Glue version – Glue 4.0
  • Language – Python

For demonstration purposes, the job bookmarks option is disabled, along with the auto scaling feature.

We also configure advanced properties regarding connections and job parameters.
To access data residing in Amazon Redshift, we created an AWS Glue connection that utilizes the JDBC connection.

We also provide custom parameters as key-value pairs. For this post, we sectionalize the PII into five different detection categories:

  • universalPERSON_NAME, EMAIL, CREDIT_CARD
  • hipaaPERSON_NAME, PHONE_NUMBER, USA_SSN, USA_ITIN, BANK_ACCOUNT, USA_DRIVING_LICENSE, USA_HCPCS_CODE, USA_NATIONAL_DRUG_CODE, USA_NATIONAL_PROVIDER_IDENTIFIER, USA_DEA_NUMBER, USA_HEALTH_INSURANCE_CLAIM_NUMBER, USA_MEDICARE_BENEFICIARY_IDENTIFIER
  • networkingIP_ADDRESS, MAC_ADDRESS
  • united_statesPHONE_NUMBER, USA_PASSPORT_NUMBER, USA_SSN, USA_ITIN, BANK_ACCOUNT
  • custom – Coordinates

If you’re trying this solution from other countries, you can specify the custom PII fields using the custom category, because this solution is created based on US regions.

For demonstration purposes, we use a single table and pass it as the following parameter:

--table_name: table_name

For this post, we name the table personal_health_identifiable_information.

You can customize these parameters based on the individual business use case.

Run the job and wait for the Success status.

The job has two goals. The first goal is to identify PII data-related columns in the Redshift table and produce a list of these column names. The second goal is the obfuscation of data in those specific columns of the target table. As a part of the second goal, it reads the table data, applies a user-defined masking function to those specific columns, and updates the data in the target table using a Redshift staging table (stage_personal_health_identifiable_information) for the upserts.

Alternatively, you can also use dynamic data masking (DDM) in Amazon Redshift to protect sensitive data in your data warehouse.

Analyze the output using CloudWatch

When the job is complete, let’s review the CloudWatch logs to understand how the AWS Glue job ran. We can navigate to the CloudWatch logs by choosing Output logs on the job details page on the AWS Glue console.

The job identified every column that contains PII data, including custom fields passed using the AWS Glue job sensitive data detection fields.

Clean up

To clean up the infrastructure and avoid additional charges, complete the following steps:

  1. Empty the S3 buckets.
  2. Delete the endpoints you created.
  3. Delete the CloudFormation stack via the AWS CloudFormation console to delete the remaining resources.

Conclusion

With this solution, you can automatically scan the data located in Redshift clusters using an AWS Glue job, identify PII, and take necessary actions. This could help your organization with security, compliance, governance, and data protection features, which contribute towards the data security and data governance.


About the Authors

Manikanta Gona is a Data and ML Engineer at AWS Professional Services. He joined AWS in 2021 with 6+ years of experience in IT. At AWS, he is focused on Data Lake implementations, and Search, Analytical workloads using Amazon OpenSearch Service. In his spare time, he love to garden, and go on hikes and biking with his husband.

Denys Novikov is a Senior Data Lake Architect with the Professional Services team at Amazon Web Services. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems for Enterprise customers.

Anjan Mukherjee is a Data Lake Architect at AWS, specializing in big data and analytics solutions. He helps customers build scalable, reliable, secure and high-performance applications on the AWS platform.

Governance at scale: Enforce permissions and compliance by using policy as code

Post Syndicated from Roland Odorfer original https://aws.amazon.com/blogs/security/governance-at-scale-enforce-permissions-and-compliance-by-using-policy-as-code/

AWS Identity and Access Management (IAM) policies are at the core of access control on AWS. They enable the bundling of permissions, helping to provide effective and modular access control for AWS services. Service control policies (SCPs) complement IAM policies by helping organizations enforce permission guardrails at scale across their AWS accounts.

The use of access control policies isn’t limited to AWS resources. Customer applications running on AWS infrastructure can also use policies to help control user access. This often involves implementing custom authorization logic in the program code itself, which can complicate audits and policy changes.

To address this, AWS developed Amazon Verified Permissions, which helps implement fine-grained authorizations and permissions management for customer applications. This service uses Cedar, an open-source policy language, to define permissions separately from application code.

In addition to access control, you can also use policies to help monitor your organization’s individual governance rules for security, operations and compliance. One example of such a rule is the regular rotation of cryptographic keys to help reduce the impact in the event of a key leak.

However, manually checking and enforcing such rules is complex and doesn’t scale, particularly in fast-growing IT organizations. Therefore, organizations should aim for an automated implementation of such rules. In this blog post, I will show you how to use policy as code to help you govern your AWS landscape.

Policy as code

Similar to infrastructure as code (IaC), policy as code is an approach in which you treat policies like regular program code. You define policies in the form of structured text files (policy documents), which policy engines can automatically evaluate.

The main advantage of this approach is the ability to automate key governance tasks, such as policy deployment, enforcement, and auditing. By storing policy documents in a central repository, you can use versioning, simplify audits, and track policy changes. Furthermore, you can subject new policies to automated testing through integration into a continuous integration and continuous delivery (CI/CD) pipeline. Policy as code thus forms one of the key pillars of a modern automated IT governance strategy.

The following sections describe how you can combine different AWS services and functions to integrate policy as code into existing IT governance processes.

Access control – AWS resources

Every request to AWS control plane resources (specifically, AWS APIs)—whether through the AWS Management Console, AWS Command Line Interface (AWS CLI), or SDK — is authenticated and authorized by IAM. To determine whether to approve or deny a specific request, IAM evaluates both the applicable policies associated with the requesting principal (human user or workload) and the respective request context. These policies come in the form of JSON documents and follow a specific schema that allows for automated evaluation.

IAM supports a range of different policy types that you can use to help protect your AWS resources and implement a least privilege approach. For an overview of the individual policy types and their purpose, see Policies and permissions in IAM. For some practical guidance on how and when to use them, see IAM policy types: How and when to use them. To learn more about the IAM policy evaluation process and the order in which IAM reviews individual policy types, see Policy evaluation logic.

Traditionally, IAM relied on role-based access control (RBAC) for authorization. With RBAC, principals are assigned predefined roles that grant only the minimum permissions needed to perform their duties (also known as a least privilege approach). RBAC can seem intuitive initially, but it can become cumbersome at scale. Every new resource that you add to AWS requires the IAM administrator to manually update each role’s permissions – a tedious process that can hamper agility in dynamic environments.

In contrast, attribute-based access control (ABAC) bases permissions on the attributes assigned to users and resources. IAM administrators define a policy that allows access when certain tags match. ABAC is especially advantageous for dynamic, fast-growing organizations that have outgrown the RBAC model. To learn more about how to implement ABAC in an AWS environment, see Define permissions to access AWS resources based on tags.

For a list of AWS services that IAM supports and whether each service supports ABAC, see AWS services that work with IAM.

Access control – Customer applications

Customer applications that run on AWS resources often require an authorization mechanism that can control access to the application itself and its individual functions in a fine-grained manner.

Many customer applications come with custom authorization mechanisms in the application code itself, making it challenging to implement policy changes. This approach can also hinder monitoring and auditing because the implementation of authorization logic often differs between applications, and there is no uniform standard.

To address this challenge, AWS developed Amazon Verified Permissions and the associated open-source policy language Cedar. Amazon Verified Permissions replaces the custom authorization logic in the application code with a simple IsAuthorized API call, so that you can control and monitor authorization logic centrally by using Cedar-based policies. To learn how to integrate Amazon Verified Permissions into your applications and define custom access control policies with Cedar, see How to use Amazon Verified Permissions for authorization.

Compliance

In addition to access control, you can also use policies to help monitor and enforce your organization’s individual governance rules for security, operations and compliance. AWS Config and AWS Security Hub play a central role in compliance because they enable the setup of multi-account environments that follow best practices (known as landing zones). AWS Config continuously tracks resource configurations and changes, while Security Hub aggregates and prioritizes security findings. With these services, you can create controls that enable automated audits and conformity checks. Alternatively, you can also choose from ready-to-use controls that cover individual compliance objectives such as encryption at rest, or entire frameworks, such as PCI-DSS and NIST 800-53.

AWS Control Tower builds on top of AWS Config and Security Hub to help simplify governance and compliance for multi-account environments. AWS Control Tower incorporates additional controls with the existing ones from AWS Config and Security Hub, presenting them together through a unified interface. These controls apply at different resource life cycle stages, as shown in Figure 1, and you define them through policies.

Figure 1: Resource life cycle

Figure 1: Resource life cycle

The controls can be categorized according to their behavior:

  • Proactive controls scan IaC templates before deployment to help identify noncompliance issues early.
  • Preventative controls restrict actions within an AWS environment to help prevent noncompliant actions. For example, these controls can help prevent deployment of large Amazon Elastic Compute Cloud (Amazon EC2) instances or restrict the available AWS Regions for some users.
  • Detective controls monitor deployed resources to help identify noncompliant resources that proactive and preventative controls might have missed. They also detect when deployed resources are changed or drift out of compliance over time.

Categorizing controls this way allows for a more comprehensive compliance framework that encompasses the entire resource life cycle. The stage at which each control applies determines how it may help enforce policies and governance rules.

With AWS Control Tower, you can enable hundreds of preconfigured security, compliance, and operational controls through the console with a single click, without needing to write code. You can also implement your own custom controls beyond what AWS Control Tower provides out of the box. The process for implementing custom controls varies depending on the type of control. In the following sections, I will explain how to set up custom controls for each type.

Proactive controls

Proactive controls are mechanisms that scan resources and their configuration to confirm that they adhere to compliance requirements before they are deployed. AWS provides a range of tools and services that you can use, both in isolation and in combination with each other, to implement proactive controls. The following diagram provides an overview of the available mechanisms and an example of their integration into a CI/CD pipeline for AWS Cloud Development Kit (CDK) projects.

Figure 2: CI/CD pipeline in AWS CDK projects

Figure 2: CI/CD pipeline in AWS CDK projects

As shown in Figure 2, you can use the following mechanisms as proactive controls:

  1. You can validate artifacts such as IaC templates locally on your machine by using the AWS CloudFormation Guard CLI, which facilitates a shift-left testing strategy. The advantage of this approach is the relatively early testing in the deployment cycle. This supports rapid iterative development and thus reduces waiting times.

    Alternatively, you can use the CfnGuardValidator plugin for AWS CDK, which integrates CloudFormation Guard rules into the AWS CDK CLI. This streamlines local development by applying policies and best practices directly within the CDK project.

  2. To centrally enforce validation checks, integrate the CfnGuardValidator plugin into a CDK CI/CD pipeline.
  3. You can also invoke the CloudFormation Guard CLI from within AWS CodeBuild buildspecs to embed CloudFormation Guard scans in a CI/CD pipeline.
  4. With CloudFormation hooks, you can impose policies on resources before CloudFormation deploys them.

AWS CloudFormation Guard uses a policy-as-code approach to evaluate IaC documents such as AWS CloudFormation templates and Terraform configuration files. The tool defines validation rules in the Guard language to check that these JSON or YAML documents align with best practices and organizational policies around provisioning cloud resources. By codifying rules and scanning infrastructure definitions programmatically, CloudFormation Guard automates policy enforcement and helps promote consistency and security across infrastructure deployments.

In the following example, you will use CloudFormation Guard to validate the name of an Amazon Simple Storage Service (Amazon S3) bucket in a CloudFormation template through a simple Guard rule:

To validate the S3 bucket

  1. Install CloudFormation Guard locally. For instructions, see Setting up AWS CloudFormation Guard.
  2. Create a YAML file named template.yaml with the following content and replace <DOC-EXAMPLE-BUCKET> with a bucket name of your choice (this file is a CloudFormation template, which creates an S3 bucket):
    Resources:
      S3Bucket:
        Type: 'AWS::S3::Bucket'
        Properties:
          BucketName: '<DOC-EXAMPLE-BUCKET>'

  3. Create a text file named rules.guard with the following content:
    rule checkBucketName {
        Resources.S3Bucket.Properties.BucketName == '<DOC-EXAMPLE-BUCKET>'
    }

  4. To validate your CloudFormation template against your Guard rules, run the following command in your local terminal:
    cfn-guard validate --rules rules.guard --data template.yaml

  5. If CloudFormation Guard successfully validates the template, the validate command produces an exit status of 0 ($? in bash). Otherwise, it returns a status report listing the rules that failed. You can test this yourself by changing the bucket name.

To accelerate the writing of Guard rules, use the CloudFormation Guard rulegen command, which takes a CloudFormation template file as an input and autogenerates Guard rules that match the properties of the template resources. To learn more about the structure of CloudFormation Guard rules and how to write them, see Writing AWS CloudFormation Guard rules.

The AWS Guard Rules Registry provides ready-to-use CloudFormation Guard rule files to accelerate your compliance journey, so that you don’t have to write them yourself.

Through the CDK plugin interface for policy validation, the CfnGuardValidator plugin integrates CloudFormation Guard rules into the AWS CDK and validates generated CloudFormation templates automatically during its synthesis step. For more details, see the plugin documentation and Accelerating development with AWS CDK plugin – CfnGuardValidator.

CloudFormation Guard alone can’t necessarily prevent the provisioning of noncompliant resources. This is because CloudFormation Guard can’t detect when templates or other documents change after validation. Therefore, I recommend that you combine CloudFormation Guard with a more authoritative mechanism.

One such mechanism is CloudFormation hooks, which you can use to validate AWS resources before you deploy them. You can configure hooks to cancel the deployment process with an alert if CloudFormation templates aren’t compliant, or just initiate an alert but complete the process. To learn more about CloudFormation hooks, see the following blog posts:

CloudFormation hooks provide a way to authoritatively enforce rules for resources deployed through CloudFormation. However, they don’t control resource creation that occurs outside of CloudFormation, such as through the console, CLI, SDK, or API. Terraform is one example that provisions resources directly through the AWS API rather than through CloudFormation templates. Because of this, I recommend that you implement additional detective controls by using AWS Config. AWS Config can continuously check resource configurations after deployment, regardless of the provisioning method. Using AWS Config rules complements the preventative capabilities of CloudFormation hooks.

Preventative controls

Preventative controls can help maintain compliance by applying guardrails that disallow policy-violating actions. AWS Control Tower integrates with AWS Organizations to implement preventative controls with SCPs. By using SCPs, you can restrict IAM permissions granted in a given organization or organizational unit (OU). One example of this is the selective activation of certain AWS Regions to meet data residency requirements.

SCPs are particularly valuable for managing IAM permissions across large environments with multiple AWS accounts. Organizations with many accounts might find it challenging to monitor and control IAM permissions. SCPs help address this challenge by applying centralized permission guardrails automatically to the accounts of an organization or organizational unit (OU). As new accounts are added, the SCPs are enforced without the need for extra configuration.

You can define SCPs through CloudFormation or CDK templates and deploy them through a CI/CD pipeline, similar to other AWS resources. Because misconfigured SCPs can negatively affect an organization’s operations, it’s vital that you test and simulate the effects of new policies in a sandbox environment before broader deployment. For an example of how to implement a pipeline for SCP testing, see the aws-service-control-policies-deployment GitHub repository.

To learn more about SCPs and how to implement them, see Service control policies (SCPs) and Best Practices for AWS Organizations Service Control Policies in a Multi-Account Environment.

Detective controls

Detective controls help detect noncompliance with existing resources. You can implement detective controls by using AWS Config rules, with both managed rules (provided by AWS) and custom rules available. You can implement custom rules either by using the domain-specific language Guard or Lambda functions. To learn more about the Guard option, see Evaluate custom configurations using AWS Config Custom Policy rules and the open source sample repository. For guidance on creating custom rules using Lambda functions, see AWS Config Rule Development Kit library: Build and operate rules at scale and Deploying Custom AWS Config Rules in an AWS Organization Environment.

To simplify audits for compliance frameworks such as PCI-DSS, HIPAA, and SOC2, AWS Config also offers conformance packs that bundle rules and remediation actions. To learn more about conformance packs, see Conformance Packs and Introducing AWS Config Conformance Packs.

When a resource’s configuration shifts to a noncompliant state that preventive controls didn’t avert, detective controls can help remedy the noncompliant state by implementing predefined actions, such as alerting an operator or reconfiguring the resource. You can implement these controls with AWS Config, which integrates with AWS Systems Manager Automation to help enable the remediation of noncompliant resources.

Security Hub can help centralize the detection of noncompliant resources across multiple AWS accounts. Using AWS Config and third-party tools for detection, Security Hub sends findings of noncompliance to Amazon EventBridge, which can then send notifications or launch automated remediations. You can also use the security controls and standards in Security Hub to monitor the configuration of your AWS infrastructure. This complements the conformance packs in AWS Config.

Conclusion

Many large and fast-growing organizations are faced with the challenge that manual IT governance processes are difficult to scale and can hinder growth. Policy-as-code services help to manage permissions and resource configurations at scale by automating key IT governance processes and, at the same time, increasing the quality and transparency of those processes. This helps to reconcile large environments with key governance objectives such as compliance.

In this post, you learned how to use policy as code to enhance IT governance. A first step is to activate AWS Control Tower, which provides preconfigured guardrails (SCPs) for each AWS account within an organization. These guardrails help enforce baseline compliance across infrastructure. You can then layer on additional controls to further strengthen governance in line with your needs. As a second step, you can select AWS Config conformance packs and Security Hub standards to complement the controls that AWS Control Tower offers. Finally, you can secure applications built on AWS by using Amazon Verified Permissions and Cedar for fine-grained authorization.

Resources

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Roland Odorfer

Roland Odorfer

Roland is a Solutions Architect at AWS, based in Berlin, Germany. He works with German industry and manufacturing customers, helping them architect secure and scalable solutions. Roland is interested in distributed systems and security. He enjoys helping customers use the cloud to solve complex challenges.

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Post Syndicated from Sivaprasad Mahamkali original https://aws.amazon.com/blogs/big-data/modernize-your-etl-platform-with-aws-glue-studio-a-case-study-from-bms/

This post is co-written with Ramesh Daddala, Jitendra Kumar Dash and Pavan Kumar Bijja from Bristol Myers Squibb.

Bristol Myers Squibb (BMS) is a global biopharmaceutical company whose mission is to discover, develop, and deliver innovative medicines that help patients prevail over serious diseases. BMS is consistently innovating, achieving significant clinical and regulatory successes. In collaboration with AWS, BMS identified a business need to migrate and modernize their custom extract, transform, and load (ETL) platform to a native AWS solution to reduce complexities, resources, and investment to upgrade when new Spark, Python, or AWS Glue versions are released. In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine. AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor ETL jobs in AWS Glue. Offering this service reduced BMS’s operational maintenance and cost, and offered flexibility to business users to perform ETL jobs with ease.

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. Although this framework met their ETL objectives, it was difficult to maintain and upgrade. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). Each time the newer version of Apache Spark (and corresponding AWS Glue version) was released, it required significant operational support and time-consuming manual changes to upgrade existing ETL jobs. Manually upgrading, testing, and deploying over 5,000 jobs every few quarters was time consuming, error prone, costly, and not sustainable. Because another release for the EDLS framework was pending, BMS decided to assess alternate managed solutions to reduce their operational and upgrade challenges.

In this post, we share how BMS will modernize leveraging the success of the proof of concept targeting BMS’s ETL platform using AWS Glue Studio.

Solution overview

This solution addresses BMS’s EDLS requirements to overcome challenges using a custom-built ETL framework that required frequent maintenance and component upgrades (requiring extensive testing cycles), avoid complexity, and reduce the overall cost of the underlying infrastructure derived from the proof of concept. BMS had the following goals:

  • Develop ETL jobs using visual workflows provided by the AWS Glue Studio visual editor. The AWS Glue Studio visual editor is a low-code environment that allows you to compose data transformation workflows, seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine, and inspect the schema and data results in each step of the job.
  • Migrate over 5,000 existing ETL jobs using native AWS Glue Studio in an automated and scalable manner.

EDLS job steps and metadata

Every EDLS job comprises one or more job steps chained together and run in a predefined order orchestrated by the custom ETL framework. Each job step incorporates the following ETL functions:

  • File ingest – File ingestion enables you to ingest or list files from multiple file sources, like Amazon Simple Storage Service (Amazon S3), SFTP, and more. The metadata holds configurations for the file ingestion step to connect to Amazon S3 or SFTP endpoints and ingest files to target location. It retrieves the specified files and available metadata to show on the UI.
  • Data quality check – The data quality module enables you to perform quality checks on a huge amount of data and generate reports that describe and validate the data quality. The data quality step uses an EDLS ingested source object from Amazon S3 and runs one to many data conformance checks that are configured by the tenant.
  • Data transform join – This is one of the submodules of the data transform module that can perform joins between the datasets using a custom SQL based on the metadata configuration.
  • Database ingest – The database ingestion step is one of the important service components in EDLS, which facilitates you to obtain and import the desired data from the database and export it to a specific file in the location of your choice.
  • Data transform – The data transform module performs various data transformations against the source data using JSON-driven rules. Each data transform capability has its own JSON rule and, based on the specific JSON rule you provide, EDLS performs the data transformation on the files available in the Amazon S3 location.
  • Data persistence – The data persistence module is one of the important service components in EDLS, which enables you to obtain the desired data from the source and persist it to an Amazon Relational Database Service (Amazon RDS) database.

The metadata corresponding to each job step includes ingest sources, transformation rules, data quality checks, and data destinations stored in an RDS instance.

Migration utility

The solution involves building a Python utility that reads EDLS metadata from the RDS database and translating each of the job steps into an equivalent AWS Glue Studio visual editor JSON node representation.

AWS Glue Studio provides two types of transforms:

  • AWS Glue-native transforms – These are available to all users and are managed by AWS Glue.
  • Custom visual transforms – This new functionality allows you to upload custom-built transforms used in AWS Glue Studio. Custom visual transforms expand the managed transforms, enabling you to search and use transforms from the AWS Glue Studio interface.

The following is a high-level diagram depicting the sequence flow of migrating a BMS EDLS job to an AWS Glue Studio visual editor job.

Migrating BMS EDLS jobs to AWS Glue Studio includes the following steps:

  1. The Python utility reads existing metadata from the EDLS metadata database.
  2. For each job step type, based on the job metadata, the Python utility selects either the native AWS Glue transform, if available, or a custom-built visual transform (when the native functionality is missing).
  3. The Python utility parses the dependency information from metadata and builds a JSON object representing a visual workflow represented as a Directed Acyclic Graph (DAG).
  4. The JSON object is sent to the AWS Glue API, creating the AWS Glue ETL job. These jobs are visually represented in the AWS Glue Studio visual editor using a series of sources, transforms (native and custom), and targets.

Sample ETL job generation using AWS Glue Studio

The following flow diagram depicts a sample ETL job that incrementally ingests the source RDBMS data in AWS Glue based on modified timestamps using a custom SQL and merges it into the target data on Amazon S3.

The preceding ETL flow can be represented using the AWS Glue Studio visual editor through a combination of native and custom visual transforms.

Custom visual transform for incremental ingestion

Post POC, BMS and AWS identified there will be a need to leverage custom transforms to execute a subset of jobs leveraging their current EDLS Service where Glue Studio functionality will not be a natural fit. The BMS team’s requirement was to ingest data from various databases without depending on the existence of transaction logs or specific schema, so AWS Database Migration Service (AWS DMS) wasn’t an option for them. AWS Glue Studio provides the native SQL query visual transform, where a custom SQL query can be used to transform the source data. However, in order to query the source database table based on a modified timestamp column to retrieve new and modified records since the last ETL run, the previous timestamp column state needs to be persisted so it can be used in the current ETL run. This needs to be a recurring process and can also be abstracted across various RDBMS sources, including Oracle, MySQL, Microsoft SQL Server, SAP Hana, and more.

AWS Glue provides a job bookmark feature to track the data that has already been processed during a previous ETL run. An AWS Glue job bookmark supports one or more columns as the bookmark keys to determine new and processed data, and it requires that the keys are sequentially increasing or decreasing without gaps. Although this works for many incremental load use cases, the requirement is to ingest data from different sources without depending on any specific schema, so we didn’t use an AWS Glue job bookmark in this use case.

The SQL-based incremental ingestion pull can be developed in a generic way using a custom visual transform using a sample incremental ingestion job from a MySQL database. The incremental data is merged into the target Amazon S3 location in Apache Hudi format using an upsert write operation.

In the following example, we’re using the MySQL data source node to define the connection but the DynamicFrame of the data source itself is not used. The custom transform node (DB incremental ingestion) will act as the source for reading the data incrementally using the custom SQL query and the previously persisted timestamp from the last ingestion.

The transform accepts as input parameters the preconfigured AWS Glue connection name, database type, table name, and custom SQL (parameterized timestamp field).

The following is the sample visual transform Python code:

import boto3
from awsglue import DynamicFrame
from datetime import datetime

region_name = "us-east-1"

dyna_client = boto3.client('dynamodb')
HISTORIC_DATE = datetime(1970,1,1).strftime("%Y-%m-%d %H:%M:%S")
DYNAMODB_TABLE = "edls_run_stats"

def db_incremental(self, transformation_node, con_name, con_type, table_name, sql_query):
    logger = self.glue_ctx.get_logger()

    last_updt_tmst = get_table_last_updt_tmst(logger, DYNAMODB_TABLE, transformation_node)

    logger.info(f"Last updated timestamp from the DynamoDB-> {last_updt_tmst}")

    sql_query = sql_query.format(**{"lastmdfdtmst": last_updt_tmst})

    connection_options_source = {
        "useConnectionProperties": "true",
        "connectionName": con_name,
        "dbtable": table_name,
        "sampleQuery": sql_query
    }

    df = self.glue_ctx.create_dynamic_frame.from_options(connection_type= con_type, connection_options= connection_options_source )
                                         
    return df

DynamicFrame.db_incremental = db_incremental

def get_table_last_updt_tmst(logger, table_name, transformation_node):
    response = dyna_client.get_item(TableName=table_name,
                                    Key={'transformation_node': {'S': transformation_node}}
                                    )
    if 'Item' in response and 'last_updt_tmst' in response['Item']:
        return response['Item']['last_updt_tmst']['S']
    else:
        return HISTORIC_DATE

To merge the source data into the Amazon S3 target, a data lake framework like Apache Hudi or Apache Iceberg can be used, which is natively supported in AWS Glue 3.0 and later.

You can also use Amazon EventBridge to detect the final AWS Glue job state change and update the Amazon DynamoDB table’s last ingested timestamp accordingly.

Build the AWS Glue Studio job using the AWS SDK for Python (Boto3) and AWS Glue API

For the sample ETL flow and the corresponding AWS Glue Studio ETL job we showed earlier, the underlying CodeGenConfigurationNode struct (an AWS Glue job definition pulled using the AWS Command Line Interface (AWS CLI) command aws glue get-job –job-name <jobname>) is represented as a JSON object, shown in the following code:

"CodeGenConfigurationNodes": {<br />"node-1679802581077": {<br />"DynamicTransform": {<br />"Name": "DB Incremental Ingestion",<br />"TransformName": "db_incremental",<br />"Inputs": [<br />"node-1679707801419"<br />],<br />"Parameters": [<br />{<br />"Name": "node_name",<br />"Type": "str",<br />"Value": [<br />"job_123_incr_ingst_table1"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "jdbc_url",<br />"Type": "str",<br />"Value": [<br />"jdbc:mysql://database.xxxx.us-west-2.rds.amazonaws.com:3306/db_schema"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "db_creds",<br />"Type": "str",<br />"Value": [<br />"creds"<br />],<br />"IsOptional": false<br />},<br />{<br />"Name": "table_name",<br />"Type": "str",<br />"Value": [<br />"tables"<br />],<br />"IsOptional": false<br />}<br />]<br />}<br />}<br />}<br />}

The JSON object (ETL job DAG) represented in the CodeGenConfigurationNode is generated through a series of native and custom transforms with the respective input parameter arrays. This can be accomplished using Python JSON encoders that serialize the class objects to JSON and subsequently create the AWS Glue Studio visual editor job using the Boto3 library and AWS Glue API.

Inputs required to configure the AWS Glue transforms are sourced from the EDLS jobs metadata database. The Python utility reads the metadata information, parses it, and configures the nodes automatically.

The order and sequencing of the nodes is sourced from the EDLS jobs metadata, with one node becoming the input to one or more downstream nodes building the DAG flow.

Benefits of the solution

The migration path will help BMS achieve their core objectives of decomposing their existing custom ETL framework to modular, visually configurable, less complex, and easily manageable pipelines using visual ETL components. The utility aids the migration of the legacy ETL pipelines to native AWS Glue Studio jobs in an automated and scalable manner.

With consistent out-of-the box visual ETL transforms in the AWS Glue Studio interface, BMS will be able to build sophisticated data pipelines without having to write code.

The custom visual transforms will extend AWS Glue Studio capabilities and fulfill some of the BMS ETL requirements where the native transforms are missing that functionality. Custom transforms will help define, reuse, and share business-specific ETL logic among all the teams. The solution increases the consistency between teams and keeps the ETL pipelines up to date by minimizing duplicate effort and code.

With minor modifications, the migration utility can be reused to automate migration of pipelines during future AWS Glue version upgrades.

Conclusion

The successful outcome of this proof of concept has shown that migrating over 5,000 jobs from BMS’s custom application to native AWS services can deliver significant productivity gains and cost savings. By moving to AWS, BMS will be able to reduce the effort required to support AWS Glue, improve DevOps delivery, and save an estimated 58% on AWS Glue spend.

These results are very promising, and BMS is excited to embark on the next phase of the migration. We believe that this project will have a positive impact on BMS’s business and help us achieve our strategic goals.


About the authors

Sivaprasad Mahamkali is a Senior Streaming Data Engineer at AWS Professional Services. Siva leads customer engagements related to real-time streaming solutions, data lakes, analytics using opensource and AWS services. Siva enjoys listening to music and loves to spend time with his family.

Dan Gibbar is a Senior Engagement Manager at AWS Professional Services. Dan leads healthcare and life science engagements collaborating with customers and partners to deliver outcomes. Dan enjoys the outdoors, attempting triathlons, music and spending time with family.

Shrinath Parikh as a Senior Cloud Data Architect with AWS. He works with customers around the globe to assist them with their data analytics, data lake, data lake house, serverless, governance and NoSQL use cases. In Shrinath’s off time, he enjoys traveling, spending time with family and learning/building new tools using cutting edge technologies.

Ramesh Daddala is a Associate Director at BMS. Ramesh leads enterprise data engineering engagements related to enterprise data lake services (EDLs) and collaborating with Data partners to deliver and support enterprise data engineering and ML capabilities. Ramesh enjoys the outdoors, traveling and loves to spend time with family.

Jitendra Kumar Dash is a Senior Cloud Architect at BMS with expertise in hybrid cloud services, Infrastructure Engineering, DevOps, Data Engineering, and Data Analytics solutions. He is passionate about food, sports, and adventure.

Pavan Kumar Bijja is a Senior Data Engineer at BMS. Pavan enables data engineering and analytical services to BMS Commercial domain using enterprise capabilities. Pavan leads enterprise metadata capabilities at BMS. Pavan loves to spend time with his family, playing Badminton and Cricket.

Shovan Kanjilal is a Senior Data Lake Architect working with strategic accounts in AWS Professional Services. Shovan works with customers to design data and machine learning solutions on AWS.

AWS Clean Rooms proof of concept scoping part 1: media measurement

Post Syndicated from Shaila Mathias original https://aws.amazon.com/blogs/big-data/aws-clean-rooms-proof-of-concept-scoping-part-1-media-measurement/

Companies are increasingly seeking ways to complement their data with external business partners’ data to build, maintain, and enrich their holistic view of their business at the consumer level. AWS Clean Rooms helps companies more easily and securely analyze and collaborate on their collective datasets—without sharing or copying each other’s underlying data. With AWS Clean Rooms, you can create a secure data clean room in minutes and collaborate with any other company on Amazon Web Services (AWS) to generate unique insights.

One way to quickly get started with AWS Clean Rooms is with a proof of concept (POC) between you and a priority partner. AWS Clean Rooms supports multiple industries and use cases, and this blog is the first of a series on types of proof of concepts that can be conducted with AWS Clean Rooms.

In this post, we outline planning a POC to measure media effectiveness in a paid advertising campaign. The collaborators are a media owner (“CTV.Co,” a connected TV provider) and brand advertiser (“Coffee.Co,” a quick service restaurant company), that are analyzing their collective data to understand the impact on sales as a result of an advertising campaign. We chose to start this series with media measurement because “Results & Measurement” was the top ranked use case for data collaboration by customers in a recent survey the AWS Clean Rooms team conducted.

Important to keep in mind

  • AWS Clean Rooms is generally available so any AWS customer can sign in to the AWS Management Console and start using the service today without additional paperwork.
  • With AWS Clean Rooms, you can perform two types of analyses: SQL queries and machine learning. For the purpose of this blog, we will be focusing only on SQL queries. You can learn more about both types of analyses and their cost structures on the AWS Clean Rooms Features and Pricing webpages. The AWS Clean Rooms team can help you estimate the cost of a POC and can be reached at [email protected].
  • While AWS Clean Rooms supports multiparty collaboration, we assume two members in the AWS Clean Rooms POC collaboration in this blog post.

Overview

Setting up a POC helps define an existing problem of a specific use case for using AWS Clean Rooms with your partners. After you’ve determined who you want to collaborate with, we recommend three steps to set up your POC:

  • Defining the business context and success criteria – Determine which partner, which use case should be tested, and what the success criteria are for the AWS Clean Rooms collaboration.
  • Aligning on the technical choices for this test – Make the technical decisions of who sets up the clean room, who is analyzing the data, which data sets are being used, join keys and what analysis is being run.
  • Outlining the workflow and timing – Create a workback plan, decide on synthetic data testing, and align on production data testing.

In this post, we walk through an example of how a quick service restaurant (QSR) coffee company (Coffee.Co) would set up a POC with a connected TV provider (CTV.Co) to determine the success of an advertising campaign.

Business context and success criteria for the POC

Define the use case to be tested

The first step in setting up the POC is defining the use case being tested with your partner in AWS Clean Rooms. For example, Coffee.Co wants to run a measurement analysis to determine the media exposure on CTV.Co that led to sign up for Coffee.Co’s loyalty program. AWS Clean Rooms allows for Coffee.Co and CTV.Co to collaborate and analyze their collective datasets without copying each other’s underlying data.

Success criteria

It’s important to determine metrics of success and acceptance criteria to move the POC to production upfront. For example, Coffee.Co’s goal is to achieve a sufficient match rate between their data set and CTV.Co’s data set to ensure the efficacy of the measurement analysis. Additionally, Coffee.Co wants ease-of-use for existing Coffee.Co team members to set up the collaboration and action on the insights driven from the collaboration to optimize future media spend to tactics on CTV.Co that will drive more loyalty members.

Technical choices for the POC

Determine the collaboration creator, AWS account IDs, query runner, payor and results receiver

Each AWS Clean Rooms collaboration is created by a single AWS account inviting other AWS accounts. The collaboration creator specifies which accounts are invited to the collaboration, who can run queries, who pays for the compute, who can receive the results, and the optional query logging and cryptographic computing settings. The creator is also able to remove members from a collaboration. In this POC, Coffee.Co initiates the collaboration by inviting CTV.Co. Additionally, Coffee.Co runs the queries and receives the results, but CTV.Co pays for the compute.

Query logging setting

If logging is enabled in the collaboration, AWS Clean Rooms allows each collaboration member to receive query logs. The collaborator running the queries, Coffee.Co, gets logs for all data tables while the other collaborator, CTV.Co, only sees the logs if their data tables are referenced in the query.

Decide the AWS region

The underlying Amazon Simple Storage Service (Amazon S3) and AWS Glue resources for the data tables used in the collaboration must be in the same AWS Region as the AWS Clean Rooms collaboration. For example, Coffee.Co and CTV.Co agree on the US East (Ohio) Region for their collaboration.

Join keys

To join data sets in an AWS Clean Rooms query, each side of the join must share a common key. Key join comparison with the equal to operator (=) must evaluate to True. AND or OR logical operators can be used in the inner join for matching on multiple join columns. Keys such as email address, phone number, or UID2 are often considered. Third party identifiers from LiveRamp, Experian, or Neustar can be used in the join through AWS Clean Rooms specific work flows with each partner.

If sensitive data is being used as join keys, it’s recommended to use an obfuscation technique to mitigate the risk of exposing sensitive data if the data is mishandled. Both parties must use a technique that produces the same obfuscated join key values such as hashing. Cryptographic Computing for Clean Rooms can be used for this propose.

In this POC, Coffee.Co and CTV.Co are joining on hashed email or hashed mobile. Both collaborators are using the SHA256 hash on their plaintext email and phone number when preparing their data sets for the collaboration.

Data schema

The exact data schema must be determined by collaborators to support the agreed upon analysis. In this POC, Coffee.Co is running a conversion analysis to measure media exposures on CTV.Co that led to sign-up for Coffee.Co’s loyalty program. Coffee.Co’s schema includes hashed email, hashed mobile, loyalty sign up date, loyalty membership type, and birthday of member. CTV.Co’s schema includes hashed email, hashed mobile, impressions, clicks, timestamp, ad placement, and ad placement type.

Analysis rule applied to each configured table associated to the collaboration

An AWS Clean Rooms configured table is a reference to an existing table in the AWS Glue Data Catalog that’s used in the collaboration. It contains an analysis rule that determines how the data can be queried in AWS Clean Rooms. Configured tables can be associated to one or more collaborations.

AWS Clean Rooms offers three types of analysis rules: aggregation, list, and custom.

  • Aggregation allows you to run queries that generate an aggregate statistic within the privacy guardrails set by each data owner. For example, how large the intersection of two datasets is.
  • List allows you to run queries that extract the row level list of the intersection of multiple data sets. For example, the overlapped records on two datasets.
  • Custom allows you to create custom queries and reusable templates using most industry standard SQL, as well as review and approve queries prior to your collaborator running them. For example, authoring an incremental lift query that’s the only query permitted to run on your data tables. You can also use AWS Clean Rooms Differential Privacy by selecting a custom analysis rule and then configuring your differential privacy parameters.

In this POC, CTV.Co uses the custom analysis rule and authors the conversion query. Coffee.Co adds this custom analysis rule to their data table, configuring the table for association to the collaboration. Coffee.Co is running the query, and can only run queries that CTV.Co authors on the collective datasets in this collaboration.

Planned query

Collaborators should define the query that will be run by the collaborator determined to run the queries. In this POC, Coffe.Co runs the custom analysis rule query CTV.Co authored to understand who signed up for their loyalty program after being exposed to an ad on CTV.Co. Coffee.Co can specify their desired time window parameter to analyze when the membership sign-up took place within a specific date range, because that parameter has been enabled in the custom analysis rule query.

Workflow and timeline

To determine the workflow and timeline for setting up the POC, the collaborators should set dates for the following activities.

  1. Coffee.Co and CTV.Co align on business context, success criteria, technical details, and prepare their data tables.
    • Example deadline: January 10.
  2. [Optional] Collaborators work to generate representative synthetic datasets for non-production testing prior to production data testing.
    • Example deadline: January 15
  3. [Optional] Each collaborator uses synthetic datasets to create an AWS Clean Rooms collaboration between two of their owned AWS non-production accounts and finalizes analysis rules and queries they want to run in production.
    • Example deadline: January 30
  4. [Optional] Coffee.Co and CTV.Co create an AWS Clean Rooms collaboration between non-production accounts and tests the analysis rules and queries with the synthetic datasets.
    • Example deadline: February 15
  5. Coffee.Co and CTV.Co create a production AWS Clean Rooms collaboration and run the POC queries on production data.
    • Example deadline: Feb 28
  6. Evaluate POC results against success criteria to determine when to move to production.
    • Example deadline March 15

Conclusion

After you’ve defined the business context and success criteria for the POC, aligned on the technical details, and outlined the workflow and timing, the goal of the POC is to run a successful collaboration using AWS Clean Rooms to validate moving to production. After you’ve validated that the collaboration is ready to move to production, AWS can help you identify and implement automation mechanisms to programmatically run AWS Clean Rooms for your production use cases. Watch this video to learn more about privacy-enhanced collaboration and contact an AWS Representative to learn more about AWS Clean Rooms.

About AWS Clean Rooms

AWS Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. With AWS Clean Rooms, customers can create a secure data clean room in minutes, and collaborate with any other company on AWS to generate unique insights about advertising campaigns, investment decisions, and research and development.

Additional resources


About the authors

Shaila Mathias  is a Business Development lead for AWS Clean Rooms at Amazon Web Services.

Allison Milone is a Product Marketer for the Advertising & Marketing Industry at Amazon Web Services.

Ryan Malecky is a Senior Solutions Architect at Amazon Web Services. He is focused on helping customers build gain insights from their data, especially with AWS Clean Rooms.

IAM Access Analyzer simplifies inspection of unused access in your organization

Post Syndicated from Achraf Moussadek-Kabdani original https://aws.amazon.com/blogs/security/iam-access-analyzer-simplifies-inspection-of-unused-access-in-your-organization/

AWS Identity and Access Management (IAM) Access Analyzer offers tools that help you set, verify, and refine permissions. You can use IAM Access Analyzer external access findings to continuously monitor your AWS Organizations organization and Amazon Web Services (AWS) accounts for public and cross-account access to your resources, and verify that only intended external access is granted. Now, you can use IAM Access Analyzer unused access findings to identify unused access granted to IAM roles and users in your organization.

If you lead a security team, your goal is to manage security for your organization at scale and make sure that your team follows best practices, such as the principle of least privilege. When your developers build on AWS, they create IAM roles for applications and team members to interact with AWS services and resources. They might start with broad permissions while they explore AWS services for their use cases. To identify unused access, you can review the IAM last accessed information for a given IAM role or user and refine permissions gradually. If your company has a multi-account strategy, your roles and policies are created in multiple accounts. You then need visibility across your organization to make sure that teams are working with just the required access.

Now, IAM Access Analyzer simplifies inspection of unused access by reporting unused access findings across your IAM roles and users. IAM Access Analyzer continuously analyzes the accounts in your organization to identify unused access and creates a centralized dashboard with findings. From a delegated administrator account for IAM Access Analyzer, you can use the dashboard to review unused access findings across your organization and prioritize the accounts to inspect based on the volume and type of findings. The findings highlight unused roles, unused access keys for IAM users, and unused passwords for IAM users. For active IAM users and roles, the findings provide visibility into unused services and actions. With the IAM Access Analyzer integration with Amazon EventBridge and AWS Security Hub, you can automate and scale rightsizing of permissions by using event-driven workflows.

In this post, we’ll show you how to set up and use IAM Access Analyzer to identify and review unused access in your organization.

Generate unused access findings

To generate unused access findings, you need to create an analyzer. An analyzer is an IAM Access Analyzer resource that continuously monitors your accounts or organization for a given finding type. You can create an analyzer for the following findings:

An analyzer for unused access findings is a new analyzer that continuously monitors roles and users, looking for permissions that are granted but not actually used. This analyzer is different from an analyzer for external access findings; you need to create a new analyzer for unused access findings even if you already have an analyzer for external access findings.

You can centrally view unused access findings across your accounts by creating an analyzer at the organization level. If you operate a standalone account, you can get unused access findings by creating an analyzer at the account level. This post focuses on the organization-level analyzer setup and management by a central team.

Pricing

IAM Access Analyzer charges for unused access findings based on the number of IAM roles and users analyzed per analyzer per month. You can still use IAM Access Analyzer external access findings at no additional cost. For more details on pricing, see IAM Access Analyzer pricing.

Create an analyzer for unused access findings

To enable unused access findings for your organization, you need to create your analyzer by using the IAM Access Analyzer console or APIs in your management account or a delegated administrator account. A delegated administrator is a member account of the organization that you can delegate with administrator access for IAM Access Analyzer. A best practice is to use your management account only for tasks that require the management account and use a delegated administrator for other tasks. For steps on how to add a delegated administrator for IAM Access Analyzer, see Delegated administrator for IAM Access Analyzer.

To create an analyzer for unused access findings (console)

  1. From the delegated administrator account, open the IAM Access Analyzer console, and in the left navigation pane, select Analyzer settings.
  2. Choose Create analyzer.
  3. On the Create analyzer page, do the following, as shown in Figure 1:
    1. For Findings type, select Unused access analysis.
    2. Provide a Name for the analyzer.
    3. Select a Tracking period. The tracking period is the threshold beyond which IAM Access Analyzer considers access to be unused. For example, if you select a tracking period of 90 days, IAM Access Analyzer highlights the roles that haven’t been used in the last 90 days.
    4. Set your Selected accounts. For this example, we select Current organization to review unused access across the organization.
    5. Select Create.
       
    Figure 1: Create analyzer page

    Figure 1: Create analyzer page

Now that you’ve created the analyzer, IAM Access Analyzer starts reporting findings for unused access across the IAM users and roles in your organization. IAM Access Analyzer will periodically scan your IAM roles and users to update unused access findings. Additionally, if one of your roles, users or policies is updated or deleted, IAM Access Analyzer automatically updates existing findings or creates new ones. IAM Access Analyzer uses a service-linked role to review last accessed information for all roles, user access keys, and user passwords in your organization. For active IAM roles and users, IAM Access Analyzer uses IAM service and action last accessed information to identify unused permissions.

Note: Although IAM Access Analyzer is a regional service (that is, you enable it for a specific AWS Region), unused access findings are linked to IAM resources that are global (that is, not tied to a Region). To avoid duplicate findings and costs, enable your analyzer for unused access in the single Region where you want to review and operate findings.

IAM Access Analyzer findings dashboard

Your analyzer aggregates findings from across your organization and presents them on a dashboard. The dashboard aggregates, in the selected Region, findings for both external access and unused access—although this post focuses on unused access findings only. You can use the dashboard for unused access findings to centrally review the breakdown of findings by account or finding types to identify areas to prioritize for your inspection (for example, sensitive accounts, type of findings, type of environment, or confidence in refinement).

Unused access findings dashboard – Findings overview

Review the findings overview to identify the total findings for your organization and the breakdown by finding type. Figure 2 shows an example of an organization with 100 active findings. The finding type Unused access keys is present in each of the accounts, with the most findings for unused access. To move toward least privilege and to avoid long-term credentials, the security team should clean up the unused access keys.

Figure 2: Unused access finding dashboard

Figure 2: Unused access finding dashboard

Unused access findings dashboard – Accounts with most findings

Review the dashboard to identify the accounts with the highest number of findings and the distribution per finding type. In Figure 2, the Audit account has the highest number of findings and might need attention. The account has five unused access keys and six roles with unused permissions. The security team should prioritize this account based on volume of findings and review the findings associated with the account.

Review unused access findings

In this section, we’ll show you how to review findings. We’ll share two examples of unused access findings, including unused access key findings and unused permissions findings.

Finding example: unused access keys

As shown previously in Figure 2, the IAM Access Analyzer dashboard showed that accounts with the most findings were primarily associated with unused access keys. Let’s review a finding linked to unused access keys.

To review the finding for unused access keys

  1. Open the IAM Access Analyzer console, and in the left navigation pane, select Unused access.
  2. Select your analyzer to view the unused access findings.
  3. In the search dropdown list, select the property Findings type, the Equals operator, and the value Unused access key to get only Findings type = Unused access key, as shown in Figure 3.
     
    Figure 3: List of unused access findings

    Figure 3: List of unused access findings

  4. Select one of the findings to get a view of the available access keys for an IAM user, their status, creation date, and last used date. Figure 4 shows an example in which one of the access keys has never been used, and the other was used 137 days ago.
     
    Figure 4: Finding example - Unused IAM user access keys

    Figure 4: Finding example – Unused IAM user access keys

From here, you can investigate further with the development teams to identify whether the access keys are still needed. If they aren’t needed, you should delete the access keys.

Finding example: unused permissions

Another goal that your security team might have is to make sure that the IAM roles and users across your organization are following the principle of least privilege. Let’s walk through an example with findings associated with unused permissions.

To review findings for unused permissions

  1. On the list of unused access findings, apply the filter on Findings type = Unused permissions.
  2. Select a finding, as shown in Figure 5. In this example, the IAM role has 148 unused actions on Amazon Relational Database Service (Amazon RDS) and has not used a service action for 200 days. Similarly, the role has unused actions for other services, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.
     
    Figure 5: Finding example - Unused permissions

    Figure 5: Finding example – Unused permissions

The security team now has a view of the unused actions for this role and can investigate with the development teams to check if those permissions are still required.

The development team can then refine the permissions granted to the role to remove the unused permissions.

Unused access findings notify you about unused permissions for all service-level permissions and for 200 services at the action-level. For the list of supported actions, see IAM action last accessed information services and actions.

Take actions on findings

IAM Access Analyzer categorizes findings as active, resolved, and archived. In this section, we’ll show you how you can act on your findings.

Resolve findings

You can resolve unused access findings by deleting unused IAM roles, IAM users, IAM user credentials, or permissions. After you’ve completed this, IAM Access Analyzer automatically resolves the findings on your behalf.

To speed up the process of removing unused permissions, you can use IAM Access Analyzer policy generation to generate a fine-grained IAM policy based on your access analysis. For more information, see the blog post Use IAM Access Analyzer to generate IAM policies based on access activity found in your organization trail.

Archive findings

You can suppress a finding by archiving it, which moves the finding from the Active tab to the Archived tab in the IAM Access Analyzer console. To archive a finding, open the IAM Access Analyzer console, select a Finding ID, and in the Next steps section, select Archive, as shown in Figure 6.

Figure 6: Archive finding in the AWS management console

Figure 6: Archive finding in the AWS management console

You can automate this process by creating archive rules that archive findings based on their attributes. An archive rule is linked to an analyzer, which means that you can have archive rules exclusively for unused access findings.

To illustrate this point, imagine that you have a subset of IAM roles that you don’t expect to use in your tracking period. For example, you might have an IAM role that is used exclusively for break glass access during your disaster recovery processes—you shouldn’t need to use this role frequently, so you can expect some unused access findings. For this example, let’s call the role DisasterRecoveryRole. You can create an archive rule to automatically archive unused access findings associated with roles named DisasterRecoveryRole, as shown in Figure 7.

Figure 7: Example of an archive rule

Figure 7: Example of an archive rule

Automation

IAM Access Analyzer exports findings to both Amazon EventBridge and AWS Security Hub. Security Hub also forwards events to EventBridge.

Using an EventBridge rule, you can match the incoming events associated with IAM Access Analyzer unused access findings and send them to targets for processing. For example, you can notify the account owners so that they can investigate and remediate unused IAM roles, user credentials, or permissions.

For more information, see Monitoring AWS Identity and Access Management Access Analyzer with Amazon EventBridge.

Conclusion

With IAM Access Analyzer, you can centrally identify, review, and refine unused access across your organization. As summarized in Figure 8, you can use the dashboard to review findings and prioritize which accounts to review based on the volume of findings. The findings highlight unused roles, unused access keys for IAM users, and unused passwords for IAM users. For active IAM roles and users, the findings provide visibility into unused services and actions. By reviewing and refining unused access, you can improve your security posture and get closer to the principle of least privilege at scale.

Figure 8: Process to address unused access findings

Figure 8: Process to address unused access findings

The new IAM Access Analyzer unused access findings and dashboard are available in AWS Regions, excluding the AWS GovCloud (US) Regions and AWS China Regions. To learn more about how to use IAM Access Analyzer to detect unused accesses, see the IAM Access Analyzer documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Achraf Moussadek-Kabdani

Achraf Moussadek-Kabdani

Achraf is a Senior Security Specialist at AWS. He works with global financial services customers to assess and improve their security posture. He is both a builder and advisor, supporting his customers to meet their security objectives while making security a business enabler.

Author

Yevgeniy Ilyin

Yevgeniy is a Solutions Architect at AWS. He has over 20 years of experience working at all levels of software development and solutions architecture and has used programming languages from COBOL and Assembler to .NET, Java, and Python. He develops and code clouds native solutions with a focus on big data, analytics, and data engineering.

Mathangi Ramesh

Mathangi Ramesh

Mathangi is the product manager for IAM. She enjoys talking to customers and working with data to solve problems. Outside of work, Mathangi is a fitness enthusiast and a Bharatanatyam dancer. She holds an MBA degree from Carnegie Mellon University.

Use CodeWhisperer to identify issues and use suggestions to improve code security in your IDE

Post Syndicated from Peter Grainger original https://aws.amazon.com/blogs/security/use-codewhisperer-to-identify-issues-and-use-suggestions-to-improve-code-security-in-your-ide/

I’ve always loved building things, but when I first began as a software developer, my least favorite part of the job was thinking about security. The security of those first lines of code just didn’t seem too important. Only after struggling through security reviews at the end of a project, did I realize that a security focus at the start can save time and money, and prevent a lot of frustration.

This focus on security at the earliest phases of development is known in the DevOps community as DevSecOps. By adopting this approach, you can identify and improve security issues early, avoiding costly rework and reducing vulnerabilities in live systems. By using the security scanning capabilities of Amazon CodeWhisperer, you can identify potential security issues in your integrated development environment (IDE) as you code. After you identify these potential issues, CodeWhisperer can offer suggestions on how you can refactor to improve the security of your code early enough to help avoid the frustration of a last-minute change to your code.

In this post, I will show you how to get started with the code scanning feature of CodeWhisperer by using the AWS Toolkit for JetBrains extension in PyCharm to identify a potentially weak hashing algorithm in your IDE, and then use CodeWhisperer suggestions to quickly cycle through possible ways to improve the security of your code.

Overview of CodeWhisperer

CodeWhisperer understands comments written in natural language (in English) and can generate multiple code suggestions in real time to help improve developer productivity. The code suggestions are based on a large language model (LLM) trained on Amazon and publicly available code with identified security vulnerabilities removed during the training process. For more details, see Amazon CodeWhisperer FAQs.

Security scans are available in VS Code and JetBrains for Java, Python, JavaScript, C#, TypeScript, CloudFormation, Terraform, and AWS Cloud Development Kit (AWS CDK) with both Python and TypeScript. AWS CodeGuru Security uses a detection engine and a machine leaning model that uses a combination of logistic regression and neural networks, finding relationships and understanding paths through code. CodeGuru Security can detect common security issues, log injection, secrets, and insecure use of AWS APIs and SDKs. The detection engine uses a Detector Library that has descriptions, examples, and additional information to help you understand why CodeWhisperer highlighted your code and whether you need to take action. You can start a scan manually through either the AWS Toolkit for Visual Studio Code or AWS Toolkit for JetBrains. To learn more, see How Amazon CodeGuru Security helps you effectively balance security and velocity.

CodeWhisperer code scan sequence

To illustrate how PyCharm, Amazon CodeWhisperer, and Amazon CodeGuru interact, Figure 1 shows a high-level view of the interactions between PyCharm and services within AWS. For more information about this interaction, see the Amazon CodeWhisperer documentation.

Figure 1: Sequence diagram of the security scan workflow

Figure 1: Sequence diagram of the security scan workflow

Communication from PyCharm to CodeWhisperer is HTTPS authenticated by using a bearer token in the authorization header of each request. As shown in Figure 1, when you manually start a security scan from PyCharm, the sequence is as follows:

  1. PyCharm sends a request to CodeWhisperer for a presigned Amazon Simple Storage Service (Amazon S3) upload URL, which initiates a request for an upload URL from CodeGuru. CodeWhisperer returns the URL to PyCharm.
  2. PyCharm archives the code in open PyCharm tabs along with linked third-party libraries into a gzip file and uploads this file directly to the S3 upload URL. The S3 bucket where the code is stored is encrypted at rest with strict access controls.
  3. PyCharm initiates the scan with CodeWhisperer, which creates a scan job with CodeGuru. CodeWhisperer returns the scan job ID that CodeGuru created to PyCharm.
  4. CodeGuru downloads the code from Amazon S3 and starts the code scan.
  5. PyCharm requests the status of the scan job from CodeWhisperer, which gets the scan status from CodeGuru. If the status is pending, PyCharm keeps polling CodeWhisperer for the status until the scan job is complete.
  6. When CodeWhisperer responds that the status of the scan job is complete, PyCharm requests the details of the security findings. The findings include the file path, line numbers, and details about the finding.
  7. The finding details are displayed in the PyCharm code editor window and in the CodeWhisperer Security Issues window.

Walkthrough

For this walkthrough, you will start by configuring PyCharm to use AWS Toolkit for JetBrains. Then you will create an AWS Builder ID to authenticate the extension with AWS. Next, you will scan Python code that CodeWhisperer will identify as a potentially weak hashing algorithm, and learn how to find more details. Finally, you will learn how to use CodeWhisperer to improve the security of your code by using suggestions.

Prerequisites

To follow along with this walkthrough, make sure that you have the following prerequisites in place:

Install and authenticate the AWS Toolkit for JetBrains

This section provides step-by-step instructions on how to install and authenticate your JetBrains IDE. If you’ve already configured JetBrains or you’re using a different IDE, skip to the section Identify a potentially weak hashing algorithm by using CodeWhisperer security scans.

In this step, you will install the latest version of AWS Toolkit for JetBrains, create a new PyCharm project, sign up for an AWS Builder ID, and then use this ID to authenticate the toolkit with AWS. To authenticate with AWS, you need either an AWS Builder ID, AWS IAM Identity Center user details, or AWS IAM credentials. Creating an AWS Builder ID is the fastest way to get started and doesn’t require an AWS account, so that’s the approach I’ll walk you through here.

To install the AWS Toolkit for JetBrains

  1. Open the PyCharm IDE, and in the left navigation pane, choose Plugins.
  2. In the search box, enter AWS Toolkit.
  3. For the result — AWS Toolkit — choose Install.

Figure 2 shows the plugins search dialog and search results for the AWS Toolkit extension.

Figure 2: PyCharm plugins browser

Figure 2: PyCharm plugins browser

To create a new project

  1. Open the PyCharm IDE.
  2. From the menu bar, choose File > New Project, and then choose Create.

To authenticate CodeWhisperer with AWS

  1. In the navigation pane, choose the AWS icon (AWS icon).
  2. In the AWS Toolkit section, choose the Developer Tools tab.
  3. Under CodeWhisperer, double-click the Start icon(play icon).
    Figure 3: Start CodeWhisperer

    Figure 3: Start CodeWhisperer

  4. In the AWS Toolkit: Add Connection section, select Use a personal email to sign up and sign in with AWS Builder ID, and then choose Connect.
    Figure 4: AWS Toolkit Add Connection

    Figure 4: AWS Toolkit Add Connection

  5. For the Sign in with AWS Builder ID dialog box, choose Open and Copy Code.
  6. In the opened browser window, in the Authorize request section, in the Code field, paste the code that you copied in the previous step, and then choose Submit and continue.
    Figure 5: Authorize request page

    Figure 5: Authorize request page

  7. On the Create your AWS Builder ID page, do the following:
    1. For Email address, enter a valid current email address.
    2. Choose Next.
    3. For Your name, enter your full name.
    4. Choose Next.
      Figure 6: Create your AWS Builder ID

      Figure 6: Create your AWS Builder ID

  8. Check your inbox for an email sent from [email protected] titled Verify your AWS Builder ID email address, and copy the verification code that’s in the email.
  9. In your browser, on the Email verification page, for Verification code, paste the verification code, and then choose Verify.
    Figure 7: Email verification

    Figure 7: Email verification

  10. On the Choose your password page, enter a Password and Confirm password, and then choose Create AWS Builder ID.
  11. In the Allow AWS Toolkit for JetBrains to access your data? section, choose Allow.
    Figure 8: Allow AWS Toolkit for JetBrains to access your data

    Figure 8: Allow AWS Toolkit for JetBrains to access your data

  12. To confirm that the authentication was successful, in the PyCharm IDE navigation pane, select the AWS icon (AWS icon). On the AWS Toolkit window, make sure that Connected with AWS Builder ID is displayed.

Identify a potentially weak hashing algorithm by using CodeWhisperer security scans

The next step is to create a file that uses the hashing algorithm, SHA-224. CodeWhisperer considers this algorithm to be potentially weak and references Common Weakness Enumeration (CWE)-328. In this step, you use this weak hashing algorithm instead of the recommend algorithm SHA-256 so that you can see how CodeWhisperer flags this potential issue.

To create the file with the weak hashing algorithm (SHA-224)

  1. Create a new file in your PyCharm project named app.py
  2. Copy the following code snippet and paste it in the app.py file. In this code snippet, PBKDF2 is used with SHA-224, instead of the recommended SHA-256 algorithm.
    import hashlib
    import os
    
    salt = os.urandom(8)
    password = ‘secret’.encode()
    # Noncompliant: potentially weak algorithm used.
    derivedkey = hashlib.pbkdf2_hmac('sha224', password, salt, 100000)
    derivedkey.hex()

To initiate a security scan

  • In the AWS Toolkit section of PyCharm, on the Developer Tools tab, double-click the play icon (play icon) next to Run Security Scan. This opens a new tab called CodeWhisperer Security Issues that shows the scan was initiated successfully, as shown in Figure 9.
    Figure 9: AWS Toolkit window with security scan in progress

    Figure 9: AWS Toolkit window with security scan in progress

Interpret the CodeWhisperer security scan results

You can now interpret the results of the security scan.

To interpret the CodeWhisperer results

  1. When the security scan completes, CodeWhisperer highlights one of the rows in the main code editor window. To see a description of the identified issue, hover over the highlighted code. In our example, the issue that is displayed is CWE-327/328, as shown in Figure 10.
    Figure 10: Code highlighted with issue CWE-327,328 – Insecure hashing

    Figure 10: Code highlighted with issue CWE-327,328 – Insecure hashing

  2. The issue description indicates that the algorithm used in the highlighted line might be weak. The first argument of the pbkdf2_hmac function shown in Figure 10 is the algorithm SHA-224, so we can assume this is the highlighted issue.

CodeWhisperer has highlighted SHA-224 as a potential issue. However, to understand whether or not you need to make changes to improve the security of your code, you must do further investigation. A good starting point for your investigation is the CodeGuru Detector Library, which powers the scanning capabilities of CodeWhisperer. The entry in the Detector Library for insecure hashing provides example code and links to additional information.

This additional information reveals that the SHA-224 output is truncated and is 32 bits shorter than SHA-256. Because the output is truncated, SHA-224 is more susceptible to collision attacks than SHA-256. SHA-224 has 112-bit security compared to the 128-bit security of SHA-256. A collision attack is a way to find another input that yields an identical hash created by the original input. The CodeWhisperer issue description for insecure hashing in Figure 10 describes this as a potential issue and is the reason that CodeWhisperer flagged the code. However, if the size of the hash result is important for your use case, SHA-224 might be the correct solution, and if so, you can ignore this warning. But if you don’t have a specific reason to use SHA-224 over other algorithms, you should consider the alternative suggestions that CodeWhisperer offers, which I describe in the next section.

Use CodeWhisperer suggestions to help remediate security issues

CodeWhisperer automatically generates suggestions in real time as you type based on your existing code and comments. Suggestions range from completing a single line of code to generating complete functions. However, because CodeWhisperer uses an LLM that is trained on vast amounts of data, you might receive multiple different suggestions. These suggestions might change over time, even when you give CodeWhisperer the same context. Therefore, you must use your judgement to decide if a suggestion is the correct solution.

To replace the algorithm

  1. In the previous step, you found that the first argument of the pbkdf2_hmac function contains the potentially vulnerable algorithm SHA-224. To initiate a suggestion for a different algorithm, delete the arguments from the function. The suggestion from CodeWhisperer was to change the algorithm from SHA-224 to SHA-256. However, because of the nature of LLMs, you could get a different suggested algorithm.
  2. To apply this suggestion and update your code, press Tab. Figure 11 shows what the suggestion looks like in the PyCharm IDE.
    Figure 11: CodeWhisperer auto-suggestions

    Figure 11: CodeWhisperer auto-suggestions

Validate CodeWhisperer suggestions by rescanning the code

Although the training data used for the CodeWhisperer machine learning model has identified that security vulnerabilities were removed, it’s still possible that some suggestions will contain security vulnerabilities. Therefore, make sure that you fully understand the CodeWhisperer suggestions before you accept them and use them in your code. You are responsible for the code that you produce. In our example, other algorithms to consider are those from the SHA-3 family, such as SHA3-256. This family of algorithms are built using the sponge function rather than the Merkle-Damgård structure that SHA-1 and SHA-2 families are built with. This means that the SHA-3 family offers greater resistance to certain security events but can be slower to compute in certain configurations and hardware. In this case, you have multiple options to improve the security of SHA-224. Before you decide which algorithm to use, test the performance on your target hardware. Whether you use the solution that CodeWhisperer proposes or an alternative, you should validate changes in the code by running the security scans again.

To validate the CodeWhisperer suggestions

  • Choose Run Security Scan to rerun the scan. When the scan is complete, the CodeWhisperer Security Issues panel of PyCharm shows a notification that the rescan was completed successfully and no issues were found.
    Figure 12: Final security scan results

    Figure 12: Final security scan results

Conclusion

In this blog post, you learned how to set up PyCharm with CodeWhisperer, how to scan code for potential vulnerabilities with security scans, and how to view the details of these potential issues and understand the implications. To improve the security of your code, you reviewed and accepted CodeWhisperer suggestions, and ran the security scan again, validating the suggestion that CodeWhisperer made. Although many potential security vulnerabilities are removed during training of the CodeWhisperer machine learning model, you should validate these suggestions. CodeWhisperer is a great tool to help you speed up software development, but you are responsible for accepting or rejecting suggestions.

The example in this post showed how to identify a potentially insecure hash and improve the security of the algorithm. But CodeWhisperer security scans can detect much more, such as the Open Web Application Security Project (OWASP) top ten web application security risks, CWE top 25 most dangerous software weaknesses, log injection, secrets, and insecure use of AWS APIs and SDKs. The detector engine behind these scans uses the searchable Detector Library with descriptions, examples, and references for additional information.

In addition to using CodeWhisperer suggestions, you can also integrate security scanning into your CI/CD pipeline. By combining CodeWhisperer and automated release pipeline checks, you can detect potential vulnerabilities early with validation throughout the delivery process. Catching potential issues earlier can help you resolve them quickly and reduce the chance of frustrating delays late in the delivery process.

Prioritizing security throughout the development lifecycle can help you build robust and secure applications. By using tools such as CodeWhisperer and adopting DevSecOps practices, you can foster a security-conscious culture on your development team and help deliver safer software to your users.

If you want to explore code scanning on your own, CodeWhisperer is now generally available, and the individual tier is free for individual use. With CodeWhisperer, you can enhance the security of your code and minimize potential vulnerabilities before they become significant problems.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon CodeWhisperer re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Peter Grainger

Peter Grainger

Peter is a Technical Account Manager at AWS. He is based in Newcastle, England, and has over 14 years of experience in IT. Peter helps AWS customers build highly reliable and cost-effective systems and achieve operational excellence while running workloads on AWS. In his free time, he enjoys the outdoors and traveling.

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/build-and-manage-your-modern-data-stack-using-dbt-and-aws-glue-through-dbt-glue-the-new-trusted-dbt-adapter/

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt focuses on the transform layer of extract, load, transform (ELT) or extract, transform, load (ETL) processes across data warehouses and databases through specific engine adapters to achieve extract and load functionality. It enables data engineers, data scientists, and analytics engineers to define the business logic with SQL select statements and eliminates the need to write boilerplate data manipulation language (DML) and data definition language (DDL) expressions. dbt lets data engineers quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, continuous integration and continuous delivery (CI/CD), and documentation.

dbt is predominantly used by data warehouses (such as Amazon Redshift) customers who are looking to keep their data transform logic separate from storage and engine. We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities.

In 2022, AWS published a dbt adapter called dbt-glue—the open source, battle-tested dbt AWS Glue adapter that allows data engineers to use dbt for cloud-based data lakes along with data warehouses and databases, paying for just the compute they need. The dbt-glue adapter democratized access for dbt users to data lakes, and enabled many users to effortlessly run their transformation workloads on the cloud with the serverless data integration capability of AWS Glue. From the launch of the adapter, AWS has continued investing into dbt-glue to cover more requirements.

Today, we are pleased to announce that the dbt-glue adapter is now a trusted adapter based on our strategic collaboration with dbt Labs. Trusted adapters are adapters not maintained by dbt Labs, but adaptors that that dbt Lab is comfortable recommending to users for use in production.

The key capabilities of the dbt-glue adapter are as follows:

  • Runs SQL as Spark SQL on AWS Glue interactive sessions
  • Manages table definitions on the AWS Glue Data Catalog
  • Supports open table formats such as Apache Hudi, Delta Lake, and Apache Iceberg
  • Supports AWS Lake Formation permissions for fine-grained access control

In addition to those capabilities, the dbt-glue adapter is designed to optimize resource utilization with several techniques on top of AWS Glue interactive sessions.

This post demonstrates how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter.

Common use cases

One common use case for using dbt-glue is if a central analytics team at a large corporation is responsible for monitoring operational efficiency. They ingest application logs into raw Parquet tables in an Amazon Simple Storage Service (Amazon S3) data lake. Additionally, they extract organized data from operational systems capturing the company’s organizational structure and costs of diverse operational components that they stored in the raw zone using Iceberg tables to maintain the original schema, facilitating easy access to the data. The team uses dbt-glue to build a transformed gold model optimized for business intelligence (BI). The gold model joins the technical logs with billing data and organizes the metrics per business unit. The gold model uses Iceberg’s ability to support data warehouse-style modeling needed for performant BI analytics in a data lake. The combination of Iceberg and dbt-glue allows the team to efficiently build a data model that’s ready to be consumed.

Another common use case is when an analytics team in a company that has an S3 data lake creates a new data product in order to enrich its existing data from its data lake with medical data. Let’s say that this company is located in Europe and the data product must comply with the GDPR. For this, the company uses Iceberg to meet needs such as the right to be forgotten and the deletion of data. The company uses dbt to model its data product on its existing data lake due to its compatibility with AWS Glue and Iceberg and the simplicity that the dbt-glue adapter brings to the use of this storage format.

How dbt and dbt-glue work

The following are key dbt features:

  • Project – A dbt project enforces a top-level structure on the staging, models, permissions, and adapters. A project can be checked into a GitHub repo for version control.
  • SQL – dbt relies on SQL select statements for defining data transformation logic. Instead of raw SQL, dbt offers templatized SQL (using Jinja) that allows code modularity. Instead of having to copy/paste SQL in multiple places, data engineers can define modular transforms and call those from other places within the project. Having a modular pipeline helps data engineers collaborate on the same project.
  • Models – dbt models are primarily written as a SELECT statement and saved as a .sql file. Data engineers define dbt models for their data representations. To learn more, refer to About dbt models.
  • Materializations – Materializations are strategies for persisting dbt models in a warehouse. There are five types of materializations built into dbt: table, view, incremental, ephemeral, and materialized view. To learn more, refer to Materializations and Incremental models.
  • Data lineage – dbt tracks data lineage, allowing you to understand the origin of data and how it flows through different transformations. dbt also supports impact analysis, which helps identify the downstream effects of changes.

The high-level data flow is as follows:

  1. Data engineers ingest data from data sources to raw tables and define table definitions for the raw tables.
  2. Data engineers write dbt models with templatized SQL.
  3. The dbt adapter converts dbt models to SQL statements compatible in a data warehouse.
  4. The data warehouse runs the SQL statements to create intermediate tables or final tables, views, or materialized views.

The following diagram illustrates the architecture.

dbt-glue works with the following steps:

  1. The dbt-glue adapter converts dbt models to SQL statements compatible in Spark SQL.
  2. AWS Glue interactive sessions run the SQL statements to create intermediate tables or final tables, views, or materialized views.
  3. dbt-glue supports csv, parquet, hudi, delta, and iceberg as fileformat.
  4. On the dbt-glue adapter, table or incremental are commonly used for materializations at the destination. There are three strategies for incremental materialization. The merge strategy requires hudi, delta, or iceberg. With the other two strategies, append and insert_overwrite, you can use csv, parquet, hudi, delta, or iceberg.

The following diagram illustrates this architecture.

Example use case

In this post, we use the data from the New York City Taxi Records dataset. This dataset is available in the Registry of Open Data on AWS (RODA), which is a repository containing public datasets from AWS resources. The raw Parquet table records in this dataset stores trip records.

The objective is to create the following three tables, which contain metrics based on the raw table:

  • silver_avg_metrics – Basic metrics based on NYC Taxi Open Data for the year 2016
  • gold_passengers_metrics – Metrics per passenger based on the silver metrics table
  • gold_cost_metrics – Metrics per cost based on the silver metrics table

The final goal is to create two well-designed gold tables that store already aggregated results in Iceberg format for ad hoc queries through Amazon Athena.

Prerequisites

The instruction requires following prerequisites:

  • An AWS Identity and Access Management (IAM) role with all the mandatory permissions to run an AWS Glue interactive session and the dbt-glue adapter
  • An AWS Glue database and table to store the metadata related to the NYC taxi records dataset
  • An S3 bucket to use as output and store the processed data
  • An Athena configuration (a workgroup and S3 bucket to store the output) to explore the dataset
  • An AWS Lambda function (created as an AWS CloudFormation custom resource) that updates all the partitions in the AWS Glue table

With these prerequisites, we simulate the situation that data engineers have already ingested data from data sources to raw tables, and defined table definitions for the raw tables.

For ease of use, we prepared a CloudFormation template. This template deploys all the required infrastructure. To create these resources, choose Launch Stack in the us-east-1 Region, and follow the instructions:

Install dbt, the dbt CLI, and the dbt adaptor

The dbt CLI is a command line interface for running dbt projects. It’s free to use and available as an open source project. Install dbt and the dbt CLI with the following code:

$ pip3 install --no-cache-dir dbt-core

For more information, refer to How to install dbt, What is dbt?, and Viewpoint.

Install the dbt adapter with the following code:

$ pip3 install --no-cache-dir dbt-glue

Create a dbt project

Complete the following steps to create a dbt project:

  1. Run the dbt init command to create and initialize a new empty dbt project:
    $ dbt init

  2. For the project name, enter dbt_glue_demo.
  3. For the database, choose glue.

Now the empty project has been created. The directory structure is shown as follows:

$ cd dbt_glue_demo 
$ tree .
.
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── seeds
├── snapshots
└── tests

Create a source

The next step is to create a source table definition. We add models/source_tables.yml with the following contents:

version: 2

sources:
  - name: data_source
    schema: nyctaxi

    tables:
      - name: records

This source definition corresponds to the AWS Glue table nyctaxi.records, which we created in the CloudFormation stack.

Create models

In this step, we create a dbt model that represents the average values for trip duration, passenger count, trip distance, and total amount of charges. Complete the following steps:

  1. Create the models/silver/ directory.
  2. Create the file models/silver/silver_avg_metrics.sql with the following contents:
    WITH source_avg as ( 
        SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
        , avg(passenger_count) as avg_passenger_count 
        , avg(trip_distance) as avg_trip_distance 
        , avg(total_amount) as avg_total_amount
        , year
        , month 
        , type
        FROM {{ source('data_source', 'records') }} 
        WHERE year = "2016"
        AND dropoff_datetime is not null 
        GROUP BY year, month, type
    ) 
    SELECT *
    FROM source_avg

  3. Create the file models/silver/schema.yml with the following contents:
    version: 2
    
    models:
      - name: silver_avg_metrics
        description: This table has basic metrics based on NYC Taxi Open Data for the year 2016
    
        columns:
          - name: avg_duration
            description: The average duration of a NYC Taxi trip
    
          - name: avg_passenger_count
            description: The average number of passenger per NYC Taxi trip
    
          - name: avg_trip_distance
            description: The average NYC Taxi trip distance
    
          - name: avg_total_amount
            description: The avarage amount of a NYC Taxi trip
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip 
    
          - name: type
            description: The type of the NYC Taxi 

  4. Create the models/gold/ directory.
  5. Create the file models/gold/gold_cost_metrics.sql with the following contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format='iceberg',
        iceberg_expire_snapshots='False',
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_trip_distance) as avg_cost_per_distance
    , (avg_total_amount/avg_duration) as avg_cost_per_minute
    , year
    , month 
    , type 
    FROM {{ ref('silver_avg_metrics') }}

  6. Create the file models/gold/gold_passengers_metrics.sql with the following contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format='iceberg',
        iceberg_expire_snapshots='False',
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_passenger_count) as avg_cost_per_passenger
    , (avg_duration/avg_passenger_count) as avg_duration_per_passenger
    , (avg_trip_distance/avg_passenger_count) as avg_trip_distance_per_passenger
    , year
    , month 
    , type 
    FROM {{ ref('silver_avg_metrics') }}

  7. Create the file models/gold/schema.yml with the following contents:
    version: 2
    
    models:
      - name: gold_cost_metrics
        description: This table has metrics per cost based on NYC Taxi Open Data
    
        columns:
          - name: avg_cost_per_distance
            description: The average cost per distance of a NYC Taxi trip
    
          - name: avg_cost_per_minute
            description: The average cost per minute of a NYC Taxi trip
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip
    
          - name: type
            description: The type of the NYC Taxi
    
      - name: gold_passengers_metrics
        description: This table has metrics per passenger based on NYC Taxi Open Data
    
        columns:
          - name: avg_cost_per_passenger
            description: The average cost per passenger for a NYC Taxi trip
    
          - name: avg_duration_per_passenger
            description: The average number of passenger per NYC Taxi trip
    
          - name: avg_trip_distance_per_passenger
            description: The average NYC Taxi trip distance
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip 
    
          - name: type
            description: The type of the NYC Taxi

  8. Remove the models/example/ folder, because it’s just an example created in the dbt init command.

Configure the dbt project

dbt_project.yml is a key configuration file for dbt projects. It contains the following code:

models:
  dbt_glue_demo:
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: view

We configure dbt_project.yml to replace the preceding code with the following:

models:
  dbt_glue_demo:
    silver:
      +materialized: table

This is because that we want to materialize the models under silver as Parquet tables.

Configure a dbt profile

A dbt profile is a configuration that specifies how to connect to a particular database. The profiles are defined in the profiles.yml file within a dbt project.

Complete the following steps to configure a dbt profile:

  1. Create the profiles directory.
  2. Create the file profiles/profiles.yml with the following contents:
    dbt_glue_demo:
      target: dev
      outputs:
        dev:
          type: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          region: us-east-1
          workers: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"

  3. Create the profiles/iceberg/ directory.
  4. Create the file profiles/iceberg/profiles.yml with the following contents:
    dbt_glue_demo:
      target: dev
      outputs:
        dev:
          type: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          region: us-east-1
          workers: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"
          datalake_formats: "iceberg"
          conf: --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse="{{ env_var('DBT_S3_LOCATION') }}"warehouse/ --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

The last two lines are added for setting Iceberg configurations on AWS Glue interactive sessions.

Run the dbt project

Now it’s time to run the dbt project. Complete the following steps:

  1. To run the project dbt, you should be in the project folder:
    $ cd dbt_glue_demo

  2. The project requires you to set environment variables in order to run on the AWS account:
    $ export DBT_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query "Account" --output text):role/GlueInteractiveSessionRole"
    $ export DBT_S3_LOCATION="s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1"

  3. Make sure the profile is set up correctly from the command line:
    $ dbt debug --profiles-dir profiles
    ...
    05:34:22 Connection test: [OK connection ok]
    05:34:22 All checks passed!

If you see any failures, check if you provided the correct IAM role ARN and S3 location in Step 2.

  1. Run the models with the following code:
    $ dbt run -m silver --profiles-dir profiles
    $ dbt run -m gold --profiles-dir profiles/iceberg/

Now the tables are successfully created in the AWS Glue Data Catalog, and the data is materialized in the Amazon S3 location.

You can verify those tables by opening the AWS Glue console, choosing Databases in the navigation pane, and opening dbt_glue_demo_nyc_metrics.

Query materialized tables through Athena

Let’s query the target table using Athena to verify the materialized tables. Complete the following steps:

  1. On the Athena console, switch the workgroup to athena-dbt-glue-aws-blog.
  2. If the workgroup athena-dbt-glue-aws-blog settings dialog box appears, choose Acknowledge.
  3. Use the following query to explore the metrics created by the dbt project:
    SELECT cm.avg_cost_per_minute
        , cm.avg_cost_per_distance
        , pm.avg_cost_per_passenger
        , cm.year
        , cm.month
        , cm.type
    FROM "dbt_glue_demo_nyc_metrics"."gold_passengers_metrics" pm
    LEFT JOIN "dbt_glue_demo_nyc_metrics"."gold_cost_metrics" cm
        ON cm.type = pm.type
        AND cm.year = pm.year
        AND cm.month = pm.month
    WHERE cm.type = 'yellow'
        AND cm.year = '2016'
        AND cm.month = '6'

The following screenshot shows the results of this query.

Review dbt documentation

Complete the following steps to review your documentation:

  1. Generate the following documentation for the project:
    $ dbt docs generate --profiles-dir profiles/iceberg
    11:41:51  Running with dbt=1.7.1
    11:41:51  Registered adapter: glue=1.7.1
    11:41:51  Unable to do partial parsing because profile has changed
    11:41:52  Found 3 models, 1 source, 0 exposures, 0 metrics, 478 macros, 0 groups, 0 semantic models
    11:41:52  
    11:41:53  Concurrency: 1 threads (target='dev')
    11:41:53  
    11:41:53  Building catalog
    11:43:32  Catalog written to /Users/username/Documents/workspace/dbt_glue_demo/target/catalog.json

  2. Run the following command to open the documentation on your browser:
    $ dbt docs serve --profiles-dir profiles/iceberg

  3. In the navigation pane, choose gold_cost_metrics under dbt_glue_demo/models/gold.

You can see the detailed view of the model gold_cost_metrics, as shown in the following screenshot.

  1. To see the lineage graph, choose the circle icon at the bottom right.

Clean up

To clean up your environment, complete the following steps:

  1. Delete the database created by dbt:
    $ aws glue delete-database —name dbt_glue_demo_nyc_metrics

  2. Delete all generated data:
    $ aws s3 rm s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive
    $ aws s3 rm s3://aws-athena-dbt-glue-query-results-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive

  3. Delete the CloudFormation stack:
    $ aws cloudformation delete-stack —stack-name dbt-demo

Conclusion

This post demonstrated how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter. You learned the end-to-end operations and data flow for data engineers to build and manage a data stack using dbt and the dbt-glue adapter. To report issues or request a feature enhancement, feel free to open an issue on GitHub.


About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team at Amazon Web Services. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Benjamin Menuet is a Senior Data Architect on the AWS Professional Services team at Amazon Web Services. He helps customers develop data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some iconic races like the UTMB.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Kinshuk Pahare is a Principal Product Manager on the AWS Glue team at Amazon Web Services.

Jason Ganz is the manager of the Developer Experience (DX) team at dbt Labs

Amazon MSK now provides up to 29% more throughput and up to 24% lower costs with AWS Graviton3 support

Post Syndicated from Sai Maddali original https://aws.amazon.com/blogs/big-data/amazon-msk-now-provides-up-to-29-more-throughput-and-up-to-24-lower-costs-with-aws-graviton3-support/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data.

Today, we’re excited to bring the benefits of Graviton3 to Kafka workloads, with Amazon MSK now offering M7g instances for new MSK provisioned clusters. AWS Graviton processors are custom Arm-based processors built by AWS to deliver the best price-performance for your cloud workloads. For example, when running an MSK provisioned cluster using M7g.4xlarge instances, you can achieve up to 27% reduction in CPU usage and up to 29% higher write and read throughput compared to M5.4xlarge instances. These performance improvements, along with M7g’s lower prices provide up to 24% in compute cost savings over M5 instances.

In February 2023, AWS launched new Graviton3-based M7g instances. M7g instances are equipped with DDR5 memory, which provides up to 50% higher memory bandwidth than the DDR4 memory used in previous generations. M7g instances also deliver up to 25% higher storage throughput and up to 88% increase in network throughput compared to similar sized M5 instances to deliver price-performance benefits for Kafka workloads. You can read more about M7g features in New Graviton3-Based General Purpose (m7g) and Memory-Optimized (r7g) Amazon EC2 Instances.

Here are the specs for the M7g instances on MSK:

Name vCPUs Memory Network Bandwidth Storage Bandwidth
M7g.large 2 8 GiB up to 12.5 Gbps up to 10 Gbps
M7g.xlarge 4 16 GiB up to 12.5 Gbps up to 10 Gbps
M7g.2xlarge 8 32 GiB up to 15 Gbps up to 10 Gbps
M7g.4xlarge 16 64 GiB up to 15 Gbps up to 10 Gbps
M7g.8xlarge 32 128 GiB 15 Gbps 10 Gbps
M7g.12xlarge 48 192 GiB 22.5 Gbps 15 Gbps
M7g.16xlarge 64 256 GiB 30 Gbps 20 Gbps

M7g instances on Amazon MSK

Organizations are adopting Amazon MSK to capture and analyze data in real time, run machine learning (ML) workflows, and build event-driven architectures. Amazon MSK enables you to reduce operational overhead and run your applications with higher availability and durability. It also offers a consistent reduction in price-performance with capabilities such as Tiered Storage. With compute making up a large portion of Kafka costs, customers wanted a way to optimize them further and see Graviton instances providing them the quickest path. Amazon MSK has fully tested and validated M7g on all Kafka versions starting with version 2.8.2, making it to run critical workloads and benefit from Gravition3 cost savings.

You can get started by provisioning new clusters with the Graviton3-based M7g instances as the broker type using the AWS Management Console, APIs via the AWS SDK, and the AWS Command Line Interface (AWS CLI). M7g instances support all Amazon MSK and Kafka features, making it straightforward for you to run all your existing Kafka workloads with minimal changes. Amazon MSK supports Graviton3-based M7g instances from large through 16xlarge sizes to run all Kafka workloads.

Let’s take the M7g instances on MSK provisioned clusters for a test drive and see how it compares with Amazon MSK M5 instances.

M7g instances in action

Customers run a wide variety of workloads on Amazon MSK; some are latency sensitive, and some are throughput bound. In this post, we focus on M7g performance impact on throughput-bound workloads. M7g comes with an increase in network and storage throughput, providing a higher throughput per broker compared to an M5-based cluster.

To understand the implications, let’s look at how Kafka uses available throughput for writing or reading data. Every broker in the MSK cluster comes with a bounded storage and network throughput entitlement. Predominantly, writes in Kafka consume both storage and network throughput, whereas reads consume mostly network throughput. This is because a Kafka consumer is typically reading real-time data from a page cache and occasionally goes to disk to process old data. Therefore, the overall throughput gains also change based on the workload’s write to read throughput ratios.

Let’s look at the throughput gains based on an example. Our setup includes an MSK cluster with M7g.4xlarge instances and another with M5.4xlarge instances, with three nodes in three different Availability Zones. We also enabled TLS encryption, AWS Identity and Access Management (IAM) authentication, and a replication factor of 3 across both M7g and M5 MSK clusters. We also applied Amazon MSK best practices for broker configurations, including num.network.threads = 8 and num.io.threads = 16. On the client side for writes, we optimized the batch size with appropriate linger.ms and batch.size configurations. For the workload, we assumed 6 topics each with 64 partitions (384 per broker). For ingestion, we generated load with an average message size of 512 bytes and with one consumer group per topic. The amount of load sent to the clusters was identical.

As we ingest more data into the MSK cluster, the M7g.4xlarge instance supports higher throughput per broker, as shown in the following graph. After an hour of consistent writes, M7g.4xlarge brokers support up to 54 MB/s of write throughput vs. 40 MB/s with M5-based brokers, which represents a 29% increase.

We also see another important observation: M7g-based brokers consume much fewer CPU resources than M5s, even though they support 29% higher throughput. As seen in the following chart, CPU utilization of an M7g-based broker is on average 40%, whereas on an M5-based broker, it’s 47%.

As covered previously, customers may see different performance improvements based on the number of consumer group, batch sizes, and instance size. We recommend referring to MSK Sizing and Pricing to calculate M7g performance gains for your use case or creating a cluster based on M7g instances and benchmark the gains on your own.

Lower costs, with lesser operational burden, and higher resiliency

Since its launch, Amazon MSK has made it cost-effective to run your Kafka workloads, while still improving overall resiliency. Since day 1, you have been able to run brokers in multiple Availability Zones without worrying about additional networking costs. In October 2022, we launched Tiered Storage, which provides virtually unlimited storage at up to 50% lower costs. When you use Tiered Storage, you not only save on overall storage cost but also improve the overall availability and elasticity of your cluster.

Continuing down this path, we are now reducing compute costs for customers while still providing performance improvements. With M7g instances, Amazon MSK provides 24% savings on compute costs compared to similar sized M5 instances. When you move to Amazon MSK, you can not only lower your operational overhead using features such as Amazon MSK Connect, Amazon MSK Replicator, and automatic Kafka version upgrades, but also improve over resiliency and reduce their infrastructure costs.

Pricing and Regions

M7g instances on Amazon MSK are available today in the US (Ohio, N. Virginia, N. California, Oregon), Asia Pacific (Hyderabad, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), and EU (Ireland, London, Spain, Stockholm) Regions.

Refer to Amazon MSK pricing to learn about Graivton3-based instances with Amazon MSK pricing.

Summary

In this post, we discussed the performance gains achieved while using Graviton-based M7g instances. These instances can provide significant improvement in read and write throughput compared to similar sized M5 instances for Amazon MSK workloads. To get started, create a new cluster with M7g brokers using the AWS Management Console, and read our documentation for more information.


About the Authors

Sai Maddali is a Senior Manager Product Management at AWS who leads the product team for Amazon MSK. He is passionate about understanding customer needs, and using technology to deliver services that empowers customers to build innovative applications. Besides work, he enjoys traveling, cooking, and running.

Umesh is a Streaming Solutions Architect at AWS. He works with AWS customers to design and build real time data processing systems. He has 13 years of working experience in software engineering including architecting, designing, and developing data analytics systems.

Lanre Afod is a Solutions Architect focused with Global Financial Services at AWS, passionate about helping customers with deploying secure, scalable, high available and resilient architectures within the AWS Cloud.

Introducing new central configuration capabilities in AWS Security Hub

Post Syndicated from Nicholas Jaeger original https://aws.amazon.com/blogs/security/introducing-new-central-configuration-capabilities-in-aws-security-hub/

As cloud environments—and security risks associated with them—become more complex, it becomes increasingly critical to understand your cloud security posture so that you can quickly and efficiently mitigate security gaps. AWS Security Hub offers close to 300 automated controls that continuously check whether the configuration of your cloud resources aligns with the best practices identified by Amazon Web Services (AWS) security experts and with industry standards. Furthermore, you can manage your cloud security posture at scale by using a single action to enable Security Hub across your organization with the default settings, and by aggregating findings across your organization accounts and Regions to a single account and Region of your choice.

With the release of the new central configuration feature of Security Hub, the setup and management of control and policy configurations is simplified and centralized to the same account you have already been using to aggregate findings. In this blog post, we will explain the benefits of the new feature and describe how you can quickly onboard to it.

Central configuration overview

With the release of the new central configuration capabilities in Security Hub, you are now able to use your delegated administrator (DA) account (an AWS Organizations account designated to manage Security Hub throughout your organization) to centrally manage Security Hub controls and standards and to view your Security Hub configuration throughout your organization from a single place. To facilitate this functionality, central configuration allows you to set up policies that specify whether or not Security Hub should be enabled and which standards and controls should be turned on. You can then choose to associate your policies with your entire organization or with specific accounts or organizational units (OUs), with your policies applying automatically across linked Regions. Policies applied to specific OUs (or to the entire organization) are inherited by child accounts. This not only applies to existing accounts, but also to new accounts added to those OUs (or to the entire organization) after you created the policy. Furthermore, when you add a new linked Region to Security Hub, your existing policies will be applied to that Region immediately. This allows you to stop maintaining manual lists of accounts and Regions to which you’d like to apply your custom configurations; instead, you can maintain several policies for your organization, with each one being associated to a different set of accounts in your organization. As a result, by using the central configuration capabilities, you can significantly reduce the time spent on configuring Security Hub and switch your focus to remediating its findings.

After applying your policies, Security Hub also provides you with a view of your organization that shows the policy status per OU and account while also preventing drift. This means that after you set up your organization by using central configuration, account owners will not be able to deviate from your chosen settings—your policies will serve as the source of truth for your organizational configuration, and you can use them to understand how Security Hub is configured for your organization.

The use of the new central configuration feature is now the recommended approach to configuring Security Hub, and its standards and controls, across some or all AWS accounts in your AWS Organizations structure.

Prerequisites

To get started with central configuration, you need to complete three prerequisites:

  1. Enable AWS Config in the accounts and Regions where you plan to enable Security Hub. (For more information on how to optimize AWS Config configuration for Security Hub usage, see this blog post.)
  2. Turn on Security Hub in your AWS Organizations management account at least in one Region where you plan to use Security Hub.
  3. Use your Organizations management account to delegate an administrator account for Security Hub.

If you are new to Security Hub, simply navigate to it in the AWS Management Console from your organization management account, and the console will walk you through setting the last two prerequisites listed here. If you already use Security Hub, these can be configured from the Settings page in Security Hub. In both cases, upon completing these three prerequisites, you can proceed with the central configuration setup from the account you set as the DA.

Recommended setup

To begin the setup, open the Security Hub console from your AWS Organizations management account or from your Security Hub delegated administrator account. In the left navigation menu, choose Configuration to open the new Configuration page, shown in Figure 1. Choose Start central configuration.

Figure 1: The new Configuration page, where you can see your current organizational configuration and start using the new capabilities

Figure 1: The new Configuration page, where you can see your current organizational configuration and start using the new capabilities

If you signed in to Security Hub using the AWS Organizations management account, you will be brought to step 1, Designate delegated administrator, where you will be able to designate a new delegated administrator or confirm your existing selection before continuing the setup. If you signed in to Security Hub using your existing delegated administrator account, you will be brought directly to step 2, Centralize organization, which is shown in Figure 2. In step 2, you are first asked to choose your home Region, which is the AWS Region you will use to create your configuration policies. By default, the current Region is selected as your home Region, unless you already use cross-Region finding aggregation — in which case, your existing aggregation Region is pre-selected as your home Region.

You are then prompted to select your linked Regions, which are the Regions you will configure by using central configuration. Regions that were already linked as part of your cross-Region aggregation settings will be pre-selected. You will also be able to add additional Regions or choose to include all AWS Regions, including future Regions. If your selection includes opt-in Regions, note that Security Hub will not be enabled in them until you enable those Regions directly.

Figure 2: The Centralize organization page

Figure 2: The Centralize organization page

Step 3, Configure organization, is shown in Figure 3. You will see a recommendation that you use the AWS recommended Security Hub configuration policy (SHCP) across your entire organization. This includes enabling the AWS Foundational Security Best Practices (FSBP) v1.0.0 standard and enabling new and existing FSBP controls in accounts in your AWS Organizations structure. This is the recommended configuration for most customers, because the AWS FSBP have been carefully curated by AWS security experts and represent trusted security practices for customers to build on.

Alternatively, if you already have a custom configuration in Security Hub and would like to import it into the new capabilities, choose Customize my Security Hub configuration and then choose Pre-populate configuration.

Figure 3: Step 3 – creating your first policy

Figure 3: Step 3 – creating your first policy

Step 4, Review and apply, is where you can review the policy you just created. Until you complete this step, your organization’s configuration will not be changed. This step will override previous account configurations and create and apply your new policy. After you choose Create policy and apply, you will be taken to the new Configuration page, which was previously shown in Figure 1. The user interface will now be updated to include three tabs — Organization, Policies, and Invitation account — where you can do the following:

  • On the Organization tab, which serves as a single pane of glass for your organization configuration in Security Hub, you can see the policy status for each account and OU and verify that your desired configuration is in effect.
  • On the Policies tab, you can view your policies, update them, and create new ones.
  • On the Invitation accounts tab, you can view and update findings for invitation accounts, which do not belong to your AWS Organizations structure. These accounts cannot be configured using the new central configuration capabilities.

Together, those tabs serve as a single pane of glass for your organization configuration in Security Hub. To that end, the organization chart you now see shows which of your accounts have already been affected by the policy you just created and which are still pending. Normally, an account will show as pending only for a few minutes after you create new policies or update existing ones. However, an account can stay in pending status for up to 24 hours. During this time, Security Hub will try to configure the account with your chosen policy settings.

If Security Hub determines that a policy cannot be successfully propagated to an account, it will show its status as failed (see Figure 4). This is most likely to happen when you missed completing the prerequisites in the account where the failure is showing. For example, if AWS Config is not yet enabled in an account, the policy will have a failed status. When you hover your pointer over the word “Failed”, Security Hub will show an error message with details about the issue. After you fix the error, you can try again to apply the policy by selecting the failed account and choosing the Re-apply policy button.

Figure 4: The Organization tab on the <strong>Configuration</strong> page shows all your organization accounts, if they are being managed by a policy, and the policy status for each account and OU” width=”780″ class=”size-full wp-image-32053″ style=”border: 1px solid #bebebe”></p>
<p id=Figure 4: The Organization tab on the Configuration page shows all your organization accounts, if they are being managed by a policy, and the policy status for each account and OU

Flexibility in onboarding to central configuration

As mentioned earlier, central configuration makes it significantly more accessible for you to centrally manage Security Hub and its controls and standards. This feature also gives you the granularity to choose the specific accounts to which your chosen settings will be applied. Even though we recommend to use central configuration to configure all your accounts, one advantage of the feature is that you can initially create a test configuration and then apply it across your organization. This is especially useful when you have already configured Security Hub using previously available methods and you would like to check that you have successfully imported your existing configuration.

When you onboard to central configuration, accounts in the organization are self-managed by default, which means that they still maintain their previous configuration until you apply a policy to them, to one of their parent OUs, or to the entire organization. This gives you the option to create a test policy when you onboard, apply it only to a test account or OU, and check that you achieved your desired outcome before applying it to other accounts in the organization.

Configure and deploy different policies per OU

Although we recommend that you use the policy recommended by Security Hub whenever possible, every customer has a different environment and some customization might be required. Central configuration does not require you to use the recommended policy, and you can instead create your own custom policies that specify how Security Hub is used across organization accounts and Regions. You can create one configuration policy for your entire organization, or multiple policies to customize Security Hub settings in different accounts.

In addition, you might need to implement different policies per OU. For example, you might need to do that when you have a finance account or OU in which you want to use Payment Card Industry Data Security Standard (PCI DSS) v3.2.1. In this case, you can go to the Policies tab, choose Create policy, specify the configuration you’d like to have, and apply it to those specific OUs or accounts, as shown in Figure 5. Note that each policy must be complete — which means that it must contain the full configuration settings you would like to apply to the chosen set of accounts or OUs. In particular, an account cannot inherit part of its settings from a policy associated with a parent OU, and the other part from its own policy. The benefit of this requirement is that each policy serves as the source of truth for the configuration of the accounts it is applied to. For more information on this behavior or on how to create new policies, see the Security Hub documentation.

Figure 5: Creation of a new policy with the FSBP and the PCI DSS standards

Figure 5: Creation of a new policy with the FSBP and the PCI DSS standards

You might find it necessary to exempt accounts from being centrally configured. You have the option to set an account or OU to self-managed status. Then only the account owner can configure the settings for that account. This is useful if your organization has teams that need to be able to set their own security coverage. Unless you disassociate self-managed accounts from your Security Hub organization, you will still see findings from self-managed accounts, giving you organization-wide visibility into your security posture. However, you won’t be able to view the configuration of those accounts, because they are not centrally managed.

Understand and manage where controls are applied

In addition to being able to centrally create and view your policies, you can use the control details page to define, review, and apply how policies are configured at a control level. To access the control details page, go to the left navigation menu in Security Hub, choose Controls, and then choose any individual control.

The control details page allows you to review the findings of a control in accounts where it is already enabled. Then, if you decide that these findings are not relevant to specific accounts and OUs, or if you decide that you want to use the control in additional accounts where it is not currently enabled, you can choose Configure, view the policies to which the control currently applies, and update the configuration accordingly as shown in Figure 6.

Figure 6: Configuring a control from the control details page

Figure 6: Configuring a control from the control details page

Organizational visibility

As you might already have noticed in the earlier screenshot of the Organization view (Figure 4), the new central configuration capability gives you a new view of the policies applied (and by extension, the controls and standards deployed) to each account and OU. If you need to customize this configuration, you can modify an existing policy or create a new policy to quickly apply to all or a subset of your accounts. At a glance, you can also see which accounts are self-managed or don’t have Security Hub turned on.

Conclusion

Security Hub central configuration helps you to seamlessly configure Security Hub and its controls and standards across your accounts and Regions so that your organization’s accounts have the level of security controls coverage that you want. AWS recommends that you use this feature when configuring, deploying, and managing controls in Security Hub across your organization’s accounts and Regions. Central configuration is now available in all commercial AWS Regions. Try it out today by visiting the new Configuration page in Security Hub from your DA. You can benefit from the Security Hub 30-day free trial even if you use central configuration, and the trial offer will be automatically applied to organization accounts in which you didn’t use Security Hub before.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Nicholas Jaeger

Nicholas Jaeger

Nicholas is a Principal Security Solutions Architect at AWS. His background includes software engineering, teaching, solutions architecture, and AWS security. Today, he focuses on helping companies and organizations operate as securely as possible on AWS. Nicholas also hosts AWS Security Activation Days to provide customers with prescriptive guidance while using AWS security services to increase visibility and reduce risk.

Gal Ordo

Gal Ordo

Gal is a Senior Product Manager for AWS Security Hub at AWS. He has more than a decade of experience in cybersecurity, having focused on IoT, network, and cloud security throughout his career. He is passionate about making sure that customers can continue to scale and grow their environments without compromising on security. Outside of work, Gal enjoys video games, reading, and exploring new places.

Use IAM Identity Center APIs to audit and manage application assignments

Post Syndicated from Laura Reith original https://aws.amazon.com/blogs/security/use-iam-identity-center-apis-to-audit-and-manage-application-assignments/

You can now use AWS IAM Identity Center application assignment APIs to programmatically manage and audit user and group access to AWS managed applications. Previously, you had to use the IAM Identity Center console to manually assign users and groups to an application. Now, you can automate this task so that you scale more effectively as your organization grows.

In this post, we will show you how to use IAM Identity Center APIs to programmatically manage and audit user and group access to applications. The procedures that we share apply to both organization instances and account instances of IAM Identity Center.

Automate management of user and group assignment to applications

IAM Identity Center is where you create, or connect, your workforce users one time and centrally manage their access to multiple AWS accounts and applications. You configure AWS managed applications to work with IAM Identity Center directly from within the relevant application console, and then manage which users or groups need permissions to the application.

You can already use the account assignment APIs to automate multi-account access and audit access assigned to your users using IAM Identity Center permission sets. Today, we expanded this capability with the new application assignment APIs. You can use these new APIs to programmatically control application assignments and develop automated workflows for auditing them.

AWS managed applications access user and group information directly from IAM Identity Center. One example of an AWS managed application is Amazon Redshift. When you configure Amazon Redshift as an AWS managed application with IAM Identity Center, and a user from your organization accesses the database, their group memberships defined in IAM Identity Center can map to Amazon Redshift database roles that grant them specific permissions. This makes it simpler for you to manage users because you don’t have to set database-object permissions for each individual. For more information, see The benefits of Redshift integration with AWS IAM Identity Center.

After you configure the integration between IAM Identity Center and Amazon Redshift, you can automate the assignment or removal of users and groups by using the DeleteApplicationAssignment and CreateApplicationAssignment APIs, as shown in Figure 1.

Figure 1: Use the CreateApplicationAssignment API to assign users and groups to Amazon Redshift

Figure 1: Use the CreateApplicationAssignment API to assign users and groups to Amazon Redshift

In this section, you will learn how to use Identity Center APIs to assign a group to your Amazon Redshift application. You will also learn how to delete the group assignment.

Prerequisites

To follow along with this walkthrough, make sure that you’ve completed the following prerequisites:

  • Enable IAM Identity Center, and use the Identity Store to manage your identity data. If you use an external identity provider, then you should handle the user creation and deletion processes in those systems.
  • Configure Amazon Redshift to use IAM Identity Center as its identity source. When you configure Amazon Redshift to use IAM Identity Center as its identity source, the application requires explicit assignment by default. This means that you must explicitly assign users to the application in the Identity Center console or APIs.
  • Install and configure AWS Command Line Interface (AWS CLI) version 2. For this example, you will use AWS CLI v2 to call the IAM Identity Center application assignment APIs. For more information, see Installing the AWS CLI and Configuring the AWS CLI.

Step 1: Get your Identity Center instance information

The first step is to run the following command to get the Amazon Resource Name (ARN) and Identity Store ID for the instance that you’re working with:

aws sso-admin list-instances

The output should look similar to the following:

{
  "Instances": [
      {
          "InstanceArn": "arn:aws:sso:::instance/ssoins-****************",
          "IdentityStoreId": "d-**********",
          "OwnerAccountId": "************",
          "Name": "MyInstanceName",
          "CreatedDate": "2023-10-08T16:45:19.839000-04:00",
          "State": {
              "Name": "ACTIVE"
          },
          "Status": "ACTIVE"
      }
  ],
  "NextToken": <<TOKEN>>
}

Take note of the IdentityStoreId and the InstanceArn — you will use both in the following steps.

Step 2: Create user and group in your Identity Store

The next step is to create a user and group in your Identity Store.

Note: If you already have a group in your Identity Center instance, get its GroupId and then proceed to Step 3. To get your GroupId, run the following command:

aws identitystore get-group-id --identity-store-id “d-********” –alternate-identifier “GroupName” ,

Create a new user by using the IdentityStoreId that you noted in the previous step.

aws identitystore create-user --identity-store-id "d-**********" --user-name "MyUser" --emails Value="[email protected]",Type="Work",Primary=true —display-name "My User" —name FamilyName="User",GivenName="My" 

The output should look similar to the following:

{
    "UserId": "********-****-****-****-************",
    "IdentityStoreId": "d--********** "
}

Create a group in your Identity Store:

aws identitystore create-group --identity-store-id d-********** --display-name engineering

In the output, make note of the GroupId — you will need it later when you create the application assignment in Step 4:

{
    "GroupId": "********-****-****-****-************",
    "IdentityStoreId": "d-**********"
}

Run the following command to add the user to the group:

aws identitystore create-group-membership --identity-store-id d-********** --group-id ********-****-****-****-************ --member-id UserId=********-****-****-****-************

The result will look similar to the following:

{
    "MembershipId": "********-****-****-****-************",
    "IdentityStoreId": "d-**********"
}

Step 3: Get your Amazon Redshift application ARN instance

The next step is to determine the application ARN. To get the ARN, run the following command.

aws sso-admin list-applications --instance-arn "arn:aws:sso:::instance/ssoins-****************"

If you have more than one application in your environment, use the filter flag to specify the application account or the application provider. To learn more about the filter option, see the ListApplications API documentation.

In this case, we have only one application: Amazon Redshift. The response should look similar to the following. Take note of the ApplicationArn — you will need it in the next step.

{

    "ApplicationArn": "arn:aws:sso:::instance/ssoins-****************/apl-***************",
    "ApplicationProviderArn": "arn:aws:sso::aws:applicationProvider/Redshift",
    "Name": "Amazon Redshift",
    "InstanceArn": "arn:aws:sso:::instance/ssoins-****************",
    "Status": "DISABLED",
    "PortalOptions": {
        "Visible": true,
        "Visibility": "ENABLED",
        "SignInOptions": {
            "Origin": "IDENTITY_CENTER"
        }
    },
    "AssignmentConfig": {
        "AssignmentRequired": true
    },
    "Description": "Amazon Redshift",
    "CreatedDate": "2023-10-09T10:48:44.496000-07:00"
}

Step 4: Add your group to the Amazon Redshift application

Now you can add your new group to the Amazon Redshift application managed by IAM Identity Center. The principal-id is the GroupId that you created in Step 2.

aws sso-admin create-application-assignment --application-arn "arn:aws:sso:::instance/ssoins-****************/apl-***************" --principal-id "********-****-****-****-************" --principal-type "GROUP"

The group now has access to Amazon Redshift, but with the default permissions in Amazon Redshift. To grant access to databases, you can create roles that control the permissions available on a set of tables or views.

To create these roles in Amazon Redshift, you need to connect to your cluster and run SQL commands. To connect to your cluster, use one of the following options:

Figure 2 shows a connection to Amazon Redshift through the query editor v2.

Figure 2: Query editor v2

Figure 2: Query editor v2

By default, all users have CREATE and USAGE permissions on the PUBLIC schema of a database. To disallow users from creating objects in the PUBLIC schema of a database, use the REVOKE command to remove that permission. For more information, see Default database user permissions.

As the Amazon Redshift database administrator, you can create roles where the role name contains the identity provider namespace prefix and the group or user name. To do this, use the following syntax:

CREATE ROLE <identitycenternamespace:rolename>;

The rolename needs to match the group name in IAM Identity Center. Amazon Redshift automatically maps the IAM Identity Center group or user to the role created previously. To expand the permissions of a user, use the GRANT command.

The identityprovidernamespace is assigned when you create the integration between Amazon Redshift and IAM Identity Center. It represents your organization’s name and is added as a prefix to your IAM Identity Center managed users and roles in the Redshift database.

Your syntax should look like the following:

CREATE ROLE <AWSIdentityCenter:MyGroup>;

Step 5: Remove application assignment

If you decide that the new group no longer needs access to the Amazon Redshift application but should remain within the IAM Identity Center instance, run the following command:

aws sso-admin delete-application-assignment --application-arn "arn:aws:sso:::instance/ssoins-****************/apl-***************" --principal-id "********-****-****-****-************" --principal-type "GROUP"

Note: Removing an application assignment for a group doesn’t remove the group from your Identity Center instance.

When you remove or add user assignments, we recommend that you review the application’s documentation because you might need to take additional steps to completely onboard or offboard a given user or group. For example, when you remove a user or group assignment, you must also remove the corresponding roles in Amazon Redshift. You can do this by using the DROP ROLE command. For more information, see Managing database security.

Audit user and group access to applications

Let’s consider how you can use the new APIs to help you audit application assignments. In the preceding example, you used the AWS CLI to create and delete assignments to Amazon Redshift. Now, we will show you how to use the new ListApplicationAssignments API to list the groups that are currently assigned to your Amazon Redshift application.

aws sso-admin list-application-assignments --application-arn arn:aws:sso::****************:application/ssoins-****************/apl-****************

The output should look similar to the following — in this case, you have a single group assigned to the application.

{
    "ApplicationAssignments": [
        {
        "ApplicationArn": "arn:aws:sso::****************:application/ssoins-****************/apl-****************",
        "PrincipalId": "********-****-****-****-************",
        "PrincipalType": "GROUP"
        }
    ]
}

To see the group membership, use the PrincipalId information to query Identity Store and get information on the users assigned to the group with a combination of the ListGroupMemberships and DescribeGroupMembership APIs.

If you have several applications that IAM Identity Center manages, you can also create a script to automatically audit those applications. You can run this script periodically in an AWS Lambda function in your environment to maintain oversight of the members that are added to each application.

To get the script for this use case, see the multiple-instance-management-iam-identity-center GitHub repository. The repository includes instructions to deploy the script using Lambda within the AWS Organizations delegated administrator account. After deployment, you can invoke the Lambda function to get .csv files of every IAM Identity Center instance in your organization, the applications assigned to each instance, and the users that have access to those applications.

Conclusion

In this post, you learned how to use the IAM Identity Center application assignment APIs to assign users to Amazon Redshift and remove them from the application when they are no longer part of the organization. You also learned to list which applications are deployed in each account, and which users are assigned to each of those applications.

To learn more about IAM Identity Center, see the AWS IAM Identity Center user guide. To test the application assignment APIs, see the SSO-admin API reference guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS IAM Identity Center re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Laura Reith

Laura is an Identity Solutions Architect at AWS, where she thrives on helping customers overcome security and identity challenges. In her free time, she enjoys wreck diving and traveling around the world.

Steve Pascoe

Steve Pascoe

Steve is a Senior Technical Product Manager with the AWS Identity team. He delights in empowering customers with creative and unique solutions to everyday problems. Outside of that, he likes to build things with his family through Lego, woodworking, and recently, 3D printing.

sowjir-1.jpeg

Sowjanya Rajavaram

Sowjanya is a Sr Solution Architect who specializes in Identity and Security in AWS. Her entire career has been focused on helping customers of all sizes solve their Identity and Access Management problems. She enjoys traveling and experiencing new cultures and food.

How to use the BatchGetSecretsValue API to improve your client-side applications with AWS Secrets Manager

Post Syndicated from Brendan Paul original https://aws.amazon.com/blogs/security/how-to-use-the-batchgetsecretsvalue-api-to-improve-your-client-side-applications-with-aws-secrets-manager/

AWS Secrets Manager is a service that helps you manage, retrieve, and rotate database credentials, application credentials, OAuth tokens, API keys, and other secrets throughout their lifecycles. You can use Secrets Manager to help remove hard-coded credentials in application source code. Storing the credentials in Secrets Manager helps avoid unintended or inadvertent access by anyone who can inspect your application’s source code, configuration, or components. You can replace hard-coded credentials with a runtime call to the Secrets Manager service to retrieve credentials dynamically when you need them.

In this blog post, we introduce a new Secrets Manager API call, BatchGetSecretValue, and walk you through how you can use it to retrieve multiple Secretes Manager secrets.

New API — BatchGetSecretValue

Previously, if you had an application that used Secrets Manager and needed to retrieve multiple secrets, you had to write custom code to first identify the list of needed secrets by making a ListSecrets call, and then call GetSecretValue on each individual secret. Now, you don’t need to run ListSecrets and loop. The new BatchGetSecretValue API reduces code complexity when retrieving secrets, reduces latency by running bulk retrievals, and reduces the risk of reaching Secrets Manager service quotas.

Security considerations

Though you can use this feature to retrieve multiple secrets in one API call, the access controls for Secrets Manager secrets remain unchanged. This means AWS Identity and Access Management (IAM) principals need the same permissions as if they were to retrieve each of the secrets individually. If secrets are retrieved using filters, principals must have both permissions for list-secrets and get-secret-value on secrets that are applicable. This helps protect secret metadata from inadvertently being exposed. Resource policies on secrets serve as another access control mechanism, and AWS principals must be explicitly granted permissions to access individual secrets if they’re accessing secrets from a different AWS account (see Cross-account access for more information). Later in this post, we provide some examples of how you can restrict permissions of this API call through an IAM policy or a resource policy.

Solution overview

In the following sections, you will configure an AWS Lambda function to use the BatchGetSecretValue API to retrieve multiple secrets at once. You also will implement attribute based access control (ABAC) for Secrets Manager secrets, and demonstrate the access control mechanisms of Secrets Manager. In following along with this example, you will incur costs for the Secrets Manager secrets that you create, and the Lambda function invocations that are made. See the Secrets Manager Pricing and Lambda Pricing pages for more details.

Prerequisites

To follow along with this walk-through, you need:

  1. Five resources that require an application secret to interact with, such as databases or a third-party API key.
  2. Access to an IAM principal that can:
    • Create Secrets Manager secrets through the AWS Command Line Interface (AWS CLI) or AWS Management Console.
    • Create an IAM role to be used as a Lambda execution role.
    • Create a Lambda function.

Step 1: Create secrets

First, create multiple secrets with the same resource tag key-value pair using the AWS CLI. The resource tag will be used for ABAC. These secrets might look different depending on the resources that you decide to use in your environment. You can also manually create these secrets in the Secrets Manager console if you prefer.

Run the following commands in the AWS CLI, replacing the secret-string values with the credentials of the resources that you will be accessing:

  1.  
    aws secretsmanager create-secret --name MyTestSecret1 --description "My first test secret created with the CLI for resource 1." --secret-string "{\"user\":\"username\",\"password\":\"EXAMPLE-PASSWORD-1\"}" --tags "[{\"Key\":\"app\",\"Value\":\"app1\"},{\"Key\":\"environment\",\"Value\":\"production\"}]"
  2.  
    aws secretsmanager create-secret --name MyTestSecret2 --description "My second test secret created with the CLI for resource 2." --secret-string "{\"user\":\"username\",\"password\":\"EXAMPLE-PASSWORD-2\"}" --tags "[{\"Key\":\"app\",\"Value\":\"app1\"},{\"Key\":\"environment\",\"Value\":\"production\"}]"
  3.  
    aws secretsmanager create-secret --name MyTestSecret3 --description "My third test secret created with the CLI for resource 3." --secret-string "{\"user\":\"username\",\"password\":\"EXAMPLE-PASSWORD-3\"}" --tags "[{\"Key\":\"app\",\"Value\":\"app1\"},{\"Key\":\"environment\",\"Value\":\"production\"}]"
  4.  
    aws secretsmanager create-secret --name MyTestSecret4 --description "My fourth test secret created with the CLI for resource 4." --secret-string "{\"user\":\"username\",\"password\":\"EXAMPLE-PASSWORD-4 \"}" --tags "[{\"Key\":\"app\",\"Value\":\"app1\"},{\"Key\":\"environment\",\"Value\":\"production\"}]"
  5.  
    aws secretsmanager create-secret --name MyTestSecret5 --description "My fifth test secret created with the CLI for resource 5." --secret-string "{\"user\":\"username\",\"password\":\"EXAMPLE-PASSWORD-5\"}" --tags "[{\"Key\":\"app\",\"Value\":\"app1\"},{\"Key\":\"environment\",\"Value\":\"production\"}]"

Next, create a secret with a different resource tag value for the app key, but the same environment key-value pair. This will allow you to demonstrate that the BatchGetSecretValue call will fail when an IAM principal doesn’t have permissions to retrieve and list the secrets in a given filter.

Create a secret with a different tag, replacing the secret-string values with credentials of the resources that you will be accessing.

  1.  
    aws secretsmanager create-secret --name MyTestSecret6 --description "My test secret created with the CLI." --secret-string "{\"user\":\"username\",\"password\":\"EXAMPLE-PASSWORD-6\"}" --tags "[{\"Key\":\"app\",\"Value\":\"app2\"},{\"Key\":\"environment\",\"Value\":\"production\"}]"

Step 2: Create an execution role for your Lambda function

In this example, create a Lambda execution role that only has permissions to retrieve secrets that are tagged with the app:app1 resource tag.

Create the policy to attach to the role

  1. Navigate to the IAM console.
  2. Select Policies from the navigation pane.
  3. Choose Create policy in the top right corner of the console.
  4. In Specify Permissions, select JSON to switch to the JSON editor view.
  5. Copy and paste the following policy into the JSON text editor.
    {
    	"Version": "2012-10-17",
    	"Statement": [
    		{
    			"Sid": "Statement1",
    			"Effect": "Allow",
    			"Action": [
    				"secretsmanager:ListSecretVersionIds",
    				"secretsmanager:GetSecretValue",
    				"secretsmanager:GetResourcePolicy",
    				"secretsmanager:DescribeSecret"
    			],
    			"Resource": [
    				"*"
    			],
    			"Condition": {
    				"StringNotEquals": {
    					"aws:ResourceTag/app": [
    						"${aws:PrincipalTag/app}"
    					]
    				}
    			}
    		},
    		{
    			"Sid": "Statement2",
    			"Effect": "Allow",
    			"Action": [
    				"secretsmanager:ListSecrets"
    			],
    			"Resource": ["*"]
    		}
    	]
    }

  6. Choose Next.
  7. Enter LambdaABACPolicy for the name.
  8. Choose Create policy.

Create the IAM role and attach the policy

  1. Select Roles from the navigation pane.
  2. Choose Create role.
  3. Under Select Trusted Identity, leave AWS Service selected.
  4. Select the dropdown menu under Service or use case and select Lambda.
  5. Choose Next.
  6. Select the checkbox next to the LambdaABACPolicy policy you just created and choose Next.
  7. Enter a name for the role.
  8. Select Add tags and enter app:app1 as the key value pair for a tag on the role.
  9. Choose Create Role.

Step 3: Create a Lambda function to access secrets

  1. Navigate to the Lambda console.
  2. Choose Create Function.
  3. Enter a name for your function.
  4. Select the Python 3.10 runtime.
  5. Select change default execution role and attach the execution role you just created.
  6. Choose Create Function.
    Figure 1: create a Lambda function to access secrets

    Figure 1: create a Lambda function to access secrets

  7. In the Code tab, copy and paste the following code:
    import json
    import boto3
    from botocore.exceptions import ClientError
    import urllib.request
    import json
    
    session = boto3.session.Session()
    # Create a Secrets Manager client
    client = session.client(
            service_name='secretsmanager'
        )
        
    
    def lambda_handler(event, context):
    
         application_secrets = client.batch_get_secret_value(Filters =[
            {
            'Key':'tag-key',
            'Values':[event["TagKey"]]
            },
            {
            'Key':'tag-value',
            'Values':[event["TagValue"]]
            }
            ])
    
    
        ### RESOURCE 1 CONNECTION ###
        try:
            print("TESTING CONNECTION TO RESOURCE 1")
            resource_1_secret = application_secrets["SecretValues"][0]
            ## IMPLEMENT RESOURCE CONNECTION HERE
    
            print("SUCCESFULLY CONNECTED TO RESOURCE 1")
        
        except Exception as e:
            print("Failed to connect to resource 1")
            return e
    
        ### RESOURCE 2 CONNECTION ###
        try:
            print("TESTING CONNECTION TO RESOURCE 2")
            resource_2_secret = application_secrets["SecretValues"][1]
            ## IMPLEMENT RESOURCE CONNECTION HERE
            
            print("SUCCESFULLY CONNECTED TO RESOURCE 2")
        
        except Exception as e:
            print("Failed to connect to resource 2",)
            return e
    
        
        ### RESOURCE 3 CONNECTION ###
        try:
            print("TESTING CONNECTION TO RESOURCE 3")
            resource_3_secret = application_secrets["SecretValues"][2]
            ## IMPLEMENT RESOURCE CONNECTION HERE
            
            print("SUCCESFULLY CONNECTED TO DB 3")
            
        except Exception as e:
            print("Failed to connect to resource 3")
            return e 
    
        ### RESOURCE 4 CONNECTION ###
        try:
            print("TESTING CONNECTION TO RESOURCE 4")
            resource_4_secret = application_secrets["SecretValues"][3]
            ## IMPLEMENT RESOURCE CONNECTION HERE
            
            print("SUCCESFULLY CONNECTED TO RESOURCE 4")
            
        except Exception as e:
            print("Failed to connect to resource 4")
            return e
    
        ### RESOURCE 5 CONNECTION ###
        try:
            print("TESTING ACCESS TO RESOURCE 5")
            resource_5_secret = application_secrets["SecretValues"][4]
            ## IMPLEMENT RESOURCE CONNECTION HERE
            
            print("SUCCESFULLY CONNECTED TO RESOURCE 5")
            
        except Exception as e:
            print("Failed to connect to resource 5")
            return e
        
        return {
            'statusCode': 200,
            'body': json.dumps('Successfully Completed all Connections!')
        }

  8. You need to configure connections to the resources that you’re using for this example. The code in this example doesn’t create database or resource connections to prioritize flexibility for readers. Add code to connect to your resources after the “## IMPLEMENT RESOURCE CONNECTION HERE” comments.
  9. Choose Deploy.

Step 4: Configure the test event to initiate your Lambda function

  1. Above the code source, choose Test and then Configure test event.
  2. In the Event JSON, replace the JSON with the following:
    {
    "TagKey": "app",
    “TagValue”:”app1”
    }

  3. Enter a Name for your event.
  4. Choose Save.

Step 5: Invoke the Lambda function

  1. Invoke the Lambda by choosing Test.

Step 6: Review the function output

  1. Review the response and function logs to see the new feature in action. Your function logs should show successful connections to the five resources that you specified earlier, as shown in Figure 2.
    Figure 2: Review the function output

    Figure 2: Review the function output

Step 7: Test a different input to validate IAM controls

  1. In the Event JSON window, replace the JSON with the following:
    {
      "TagKey": "environment",
    “TagValue”:”production”
    }

  2. You should now see an error message from Secrets Manager in the logs similar to the following:
    User: arn:aws:iam::123456789012:user/JohnDoe is not authorized to perform: 
    secretsmanager:GetSecretValue because no resource-based policy allows the secretsmanager:GetSecretValue action

As you can see, you were able to retrieve the appropriate secrets based on the resource tag. You will also note that when the Lambda function tried to retrieve secrets for a resource tag that it didn’t have access to, Secrets Manager denied the request.

How to restrict use of BatchGetSecretValue for certain IAM principals

When dealing with sensitive resources such as secrets, it’s recommended that you adhere to the principle of least privilege. Service control policies, IAM policies, and resource policies can help you do this. Below, we discuss three policies that illustrate this:

Policy 1: IAM ABAC policy for Secrets Manager

This policy denies requests to get a secret if the principal doesn’t share the same project tag as the secret that the principal is trying to retrieve. Note that the effectiveness of this policy is dependent on correctly applied resource tags and principal tags. If you want to take a deeper dive into ABAC with Secrets Manager, see Scale your authorization needs for Secrets Manager using ABAC with IAM Identity Center.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Statement1",
      "Effect": "Deny",
      "Action": [
        "secretsmanager:GetSecretValue",
	“secretsmanager:BatchGetSecretValue”
      ],
      "Resource": [
        "*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:ResourceTag/project": [
            "${aws:PrincipalTag/project}"
          ]
        }
      }
    }
  ]
}

Policy 2: Deny BatchGetSecretValue calls unless from a privileged role

This policy example denies the ability to use the BatchGetSecretValue unless it’s run by a privileged workload role.

"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "Statement1",
			"Effect": "Deny",
			"Action": [
				"secretsmanager:BatchGetSecretValue",
			],
			"Resource": [
				"arn:aws:secretsmanager:us-west-2:12345678910:secret:testsecret"
			],
			"Condition": {
				"StringNotLike": {
					"aws:PrincipalArn": [
						"arn:aws:iam::123456789011:role/prod-workload-role"
					]
				}
			}
		}]
}

Policy 3: Restrict actions to specified principals

Finally, let’s take a look at an example resource policy from our data perimeters policy examples. This resource policy restricts Secrets Manager actions to the principals that are in the organization that this secret is a part of, except for AWS service accounts.

{
    "Version": "2012-10-17",
    "Statement": 
	[
        {
            "Sid": "EnforceIdentityPerimeter",
            "Effect": "Deny",
            "Principal": 
			{
                "AWS": "*"
            },
            "Action": "secretsmanager:*",
            "Resource": "*",
            "Condition": 
			{
                "StringNotEqualsIfExists": 
				{
                    "aws:PrincipalOrgID": "<my-org-id>"
                },
                "BoolIfExists": 
				{
                    "aws:PrincipalIsAWSService": "false"
                }
            }
        },
     ]
}

Conclusion

In this blog post, we introduced the BatchGetSecretValue API, which you can use to improve operational excellence, performance efficiency, and reduce costs when using Secrets Manager. We looked at how you can use the API call in a Lambda function to retrieve multiple secrets that have the same resource tag and showed an example of an IAM policy to restrict access to this API.

To learn more about Secrets Manager, see the AWS Secrets Manager documentation or the AWS Security Blog.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Brendan Paul

Brendan Paul

Brendan is a Senior Solutions Architect at Amazon Web Services supporting media and entertainment companies. He has a passion for data protection and has been working at AWS since 2019. In 2024, he will start to pursue his Master’s Degree in Data Science at UC Berkeley. In his free time, he enjoys watching sports and running.

Enhance query performance using AWS Glue Data Catalog column-level statistics

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential cost savings.

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets. When talking with our customers, we learned that one the challenging aspect of data lake performance is how to optimize these analytics queries to execute faster.

The data lake performance optimization is especially important for queries with multiple joins and that is where cost-based optimizers helps the most. In order for CBO to work, column statistics need to be collected and updated based on changes in the data. We’re launching capability of generating column-level statistics such as number of distinct, number of nulls, max, and min on files such as Parquet, ORC, JSON, Amazon ION, CSV, XML on AWS Glue tables. With this launch, customers now have integrated end-to-end experience where statistics on Glue tables are collected and stored in the AWS Glue Catalog, and made available to analytics services for improved query planning and execution.

Using these statistics, cost-based optimizers improves query run plans and boosts the performance of queries run in Amazon Athena and Amazon Redshift Spectrum. For example, CBO can use column statistics such as number of distinct values and number of nulls to improve row prediction. Row prediction is the number of rows from a table that will be returned by a certain step during the query planning stage. The more accurate the row predictions are, the more efficient query execution steps are. This leads to faster query execution and potentially reduced cost. Some of the specific optimizations that CBO can employ include join reordering and push-down of aggregations based on the statistics available for each table and column.

For customers using data mesh with AWS Lake Formation permissions, tables from different data producers are cataloged in the centralized governance accounts. As they generate statistics on tables on centralized catalog and share those tables with consumers, queries on those tables in consumer accounts will see query performance improvements automatically. In this post, we’ll demonstrate the capability of AWS Glue Data Catalog to generate column statistics for our sample tables.

Solution overview

To demonstrate the effectiveness of this capability, we employ the industry-standard TPC-DS 3 TB dataset stored in an Amazon Simple Storage Service (Amazon S3) public bucket. We’ll compare the query performance before and after generating column statistics for the tables, by running queries in Amazon Athena and Amazon Redshift Spectrum. We are providing queries that we used in this post and we encourage to try out your own queries following workflow as illustrated in the following details.

The workflow consists of the following high level steps:

  1. Cataloging the Amazon S3 Bucket: Utilize AWS Glue Crawler to crawl the designated Amazon S3 bucket, extracting metadata, and seamlessly storing it in the AWS Glue data catalog. We’ll query these tables using Amazon Athena and Amazon Redshift Spectrum.
  2. Generating column statistics: Employ the enhanced capabilities of AWS Glue Data Catalog to generate comprehensive column statistics for the crawled data, thereby providing valuable insights into the dataset.
  3. Querying with Amazon Athena and Amazon Redshift Spectrum: Evaluate the impact of column statistics on query performance by utilizing Amazon Athena and Amazon Redshift Spectrum to execute queries on the dataset.

The following diagram illustrates the solution architecture.

Walkthrough

To implement the solution, we complete the following steps:

  1. Set up resources with AWS CloudFormation.
  2. Run AWS Glue Crawler on Public Amazon S3 bucket to list the 3TB TPC-DS dataset.
  3. Run queries on Amazon Athena and Amazon Redshift and note down query duration
  4. Generate statistics for AWS Glue Data Catalog tables
  5. Run queries on Amazon Athena and Amazon Redshift and compare query duration with previous run
  6. Optional: Schedule AWS Glue column statistics jobs using AWS Lambda and the Amazon EventBridge Scheduler

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template generates the following resources:

  • An Amazon Virtual Private Cloud (Amazon VPC), public subnet, private subnets and route tables.
  • An Amazon Redshift Serverless workgroup and namespace.
  • An AWS Glue crawler to crawl the public Amazon S3 bucket and create a table for the Glue Data Catalog for TPC-DS dataset
  • AWS Glue catalog databases and tables
  • An Amazon S3 bucket to store athena result.
  • AWS Identity and Access Management (AWS IAM) users and policies.
  • AWS Lambda and Amazon Event Bridge scheduler to schedule the AWS Glue Column statistics

To launch the AWS CloudFormation stack, complete the following steps:

Note: The AWS Glue data catalog tables are generated using the public bucket s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/, hosted in the us-east-1 region. If you intend to deploy this AWS CloudFormation template in a different region, it is necessary to either copy the data to the corresponding region or share the data within your deployed region for it to be accessible from Amazon Redshift.

  1. Log in to the AWS Management Console as AWS Identity and Access Management (AWS IAM) administrator.
  2. Choose Launch Stack to deploy a AWS CloudFormation template.
  3. Choose Next.
  4. On the next page, keep all the option as default or make appropriate changes based on your requirement choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

This stack can take around 10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

Run the AWS Glue Crawlers created by the AWS CloudFormation stack

To run your crawlers, complete the following steps:

  1. On the AWS Glue console to AWS Glue Console, choose Crawlers under Data Catalog in the navigation pane.
  2. Locate and run two crawlers tpcdsdb-without-stats and tpcdsdb-with-stats. It may take few mins to complete.

Once the crawler completes successfully, it would create two identical databases tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats will have No Stats and we’ll use them as reference. We generate statistics on tables in tpcdsdbwithstats. Please verify that you have those two databases and underlying tables from the AWS Glue Console. The tpcdsdbnostats database will look like below. At this time there are no statistics generated on these tables.

Run provided query using Amazon Athena on no-stats tables

To run your query in Amazon Athena on tables without statistics, complete the following steps:

  1. Download the athena queries from here.
  2. On the Amazon Athena Console, choose the provided query one at a time for tables in database tpcdsdbnostats.
  3. Run the query and note down the Run time for each query.

Run provided query using Amazon Redshift Spectrum on no-stats tables

To run your query in Amazon Redshift, complete the following steps:

  1. Download the Amazon Redshift queries from here.
  2. On the Redshift query editor v2, execute the Redshift Query for tables without stats section from downloaded query.
  3. Run the query and note down the query execution of each query.

Generate statistics on AWS Glue Catalog tables

To generate statistics on AWS Glue Catalog tables, complete the following steps:

  1. Navigate to the AWS Glue Console and choose the databases under Data Catalog.
  2. Click on tpcdsdbwithstats database and it will list all the available tables.
  3. Select any of these tables (e.g., call_center).
  4. Go to Column statistics – new tab and choose Generate statistics.
  5. Keep the default option. Under Choose columns keep Table (All columns) and Under Row sampling options Keep All rows, Under IAM role choose AWSGluestats-blog and select Generate statistics.

You’ll be able to see status of the statistics generation run as shown in the following illustration:

After generate statistics on AWS Glue Catalog tables, you should be able to see detailed column statistics for that table:

Reiterate steps 2–5 to generate statistics for all necessary tables, such as catalog_sales, catalog_returns, warehouse, item, date_dim, store_sales, customer, customer_address, web_sales, time_dim, ship_mode, web_site, web_returns. Alternatively, you can follow the “Schedule AWS Glue Statistics Runs” section near the end of this blog to generate statistics for all tables. Once done, assess query performance for each query.

Run provided query using Athena Console on stats tables

  1. On the Amazon Athena console, execute the Athena Query for tables with stats section from downloaded query.
  2. Run and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 55%.

Athena query time improvement

TPC-DS 3T Queries without glue stats (sec) with glue stats (sec) performance improvement (%)
Query 2 33.62 15.17 55%
Query 4 132.11 72.94 45%
Query 14 134.77 91.48 32%
Query 28 55.99 39.36 30%
Query 38 29.32 25.58 13%

Run the provided query using Amazon Redshift Spectrum on statistics tables

  1. On the Amazon Redshift query editor v2, execute the Redshift Query for tables with stats section from downloaded query.
  2. Run the query and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 89%.

Amazon Redshift Spectrum query time improvement

TPC-DS 3T Queries without glue stats (sec) with glue stats (sec) performance improvement (%)
Query 40 124.156 13.12 89%
Query 60 29.52 16.97 42%
Query 66 18.914 16.39 13%
Query 95 308.806 200 35%
Query 99 20.064 16 20%

Schedule AWS Glue statistics Runs

In this segment of the post, we’ll guide you through the steps of scheduling AWS Glue column statistics runs using AWS Lambda and the Amazon EventBridge Scheduler. To streamline this process, a AWS Lambda function and an Amazon EventBridge scheduler were created as part of the CloudFormation stack deployment.

  1. AWS Lambda function setup:

To begin, we utilize an AWS Lambda function to trigger the execution of the AWS Glue column statistics job. The AWS Lambda function invokes the start_column_statistics_task_run API through the boto3 (AWS SDK for Python) library. This sets the groundwork for automating the column statistics update.

Let’s explore the AWS Lambda function:

    • Go to the AWS Glue Lambda Console.
    • Select Functions and locate the GlueTableStatisticsFunctionv1.
    • For a clearer understanding of the AWS Lambda function, we recommend reviewing the code in the Code section and examining the environment variables under Configuration.
  1. Amazon EventBridge scheduler configuration

The next step involves scheduling the AWS Lambda function invocation using the Amazon EventBridge Scheduler. The scheduler is configured to trigger the AWS Lambda function daily at a specific time – in this case, 08:00 PM. This ensures that the AWS Glue column statistics job runs on a regular and predictable basis.

Now, let’s explore how you can update the schedule:

Cleaning up

To avoid unwanted charges to your AWS account, delete the AWS resources:

  1. Sign into the AWS CloudFormation console as the AWS IAM administrator used for creating the AWS CloudFormation stack.
  2. Delete the AWS CloudFormation stack you created.

Conclusion

In this post, we showed you how you can use AWS Glue Data Catalog to generate column-level statistics for AWS Glue tables. These statistics are now integrated with cost-based optimizer from Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential costs savings. Refer to Docs for support for Glue Catalog Statistics across various AWS analytical services.

If you have questions or suggestions, submit them in the comments section.


About the Authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Navnit Shukla serves as an AWS Specialist Solution Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled Data Wrangling on AWS. He can be reached via LinkedIn.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/

For any modern data-driven company, having smooth data integration pipelines is crucial. These pipelines pull data from various sources, transform it, and load it into destination systems for analytics and reporting. When running properly, it provides timely and trustworthy information. However, without vigilance, the varying data volumes, characteristics, and application behavior can cause data pipelines to become inefficient and problematic. Performance can slow down or pipelines can become unreliable. Undetected errors result in bad data and impact downstream analysis. That’s why robust monitoring and troubleshooting for data pipelines is essential across the following four areas:

  • Reliability
  • Performance
  • Throughput
  • Resource utilization

Together, these four aspects of monitoring provide end-to-end visibility and control over a data pipeline and its operations.

Today we are pleased to announce a new class of Amazon CloudWatch metrics reported with your pipelines built on top of AWS Glue for Apache Spark jobs. The new metrics provide aggregate and fine-grained insights into the health and operations of your job runs and the data being processed. In addition to providing insightful dashboards, the metrics provide classification of errors, which helps with root cause analysis of performance bottlenecks and error diagnosis. With this analysis, you can evaluate and apply the recommended fixes and best practices for architecting your jobs and pipelines. As a result, you gain the benefit of higher availability, better performance, and lower cost for your AWS Glue for Apache Spark workload.

This post demonstrates how the new enhanced metrics help you monitor and debug AWS Glue jobs.

Enable the new metrics

The new metrics can be configured through the job parameter enable-observability-metrics.

The new metrics are enabled by default on the AWS Glue Studio console. To configure the metrics on the AWS Glue Studio console, complete the following steps:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Under Your jobs, choose your job.
  3. On the Job details tab, expand Advanced properties.
  4. Under Job observability metrics, select Enable the creation of additional observability CloudWatch metrics when this job runs.

To enable the new metrics in the AWS Glue CreateJob and StartJobRun APIs, set the following parameters in the DefaultArguments property:

  • Key--enable-observability-metrics
  • Valuetrue

To enable the new metrics in the AWS Command Line Interface (AWS CLI), set the same job parameters in the --default-arguments argument.

Use case

A typical workload for AWS Glue for Apache Spark jobs is to load data from a relational database to a data lake with SQL-based transformations. The following is a visual representation of an example job where the number of workers is 10.

When the example job ran, the workerUtilization metrics showed the following trend.

Note that workerUtilization showed values between 0.20 (20%) and 0.40 (40%) for the entire duration. This typically happens when the job capacity is over-provisioned and many Spark executors were idle, resulting in unnecessary cost. To improve resource utilization efficiency, it’s a good idea to enable AWS Glue Auto Scaling. The following screenshot shows the same workerUtilization metrics graph when AWS Glue Auto Scaling is enabled for the same job.

workerUtilization showed 1.0 in the beginning because of AWS Glue Auto Scaling and it trended between 0.75 (75%) and 1.0 (100%) based on the workload requirements.

Query and visualize metrics in CloudWatch

Complete the following steps to query and visualize metrics on the CloudWatch console:

  1. On the CloudWatch console, choose All metrics in the navigation pane.
  2. Under Custom namespaces, choose Glue.
  3. Choose Observability Metrics (or Observability Metrics Per Source, or Observability Metrics Per Sink).
  4. Search for and select the specific metric name, job name, job run ID, and observability group.
  5. On the Graphed metrics tab, configure your preferred statistic, period, and so on.

Query metrics using the AWS CLI

Complete the following steps for querying using the AWS CLI (for this example, we query the worker utilization metric):

  1. Create a metric definition JSON file (provide your AWS Glue job name and job run ID):
    $ cat multiplequeries.json
    [
      {
        "Id": "avgWorkerUtil_0",
        "MetricStat" : {
          "Metric" : {
            "Namespace": "Glue",
            "MetricName": "glue.driver.workerUtilization",
            "Dimensions": [
              {
                  "Name": "JobName",
                  "Value": "<your-Glue-job-name-A>"
              },
              {
                "Name": "JobRunId",
                "Value": "<your-Glue-job-run-id-A>"
              },
              {
                "Name": "Type",
                "Value": "gauge"
              },
              {
                "Name": "ObservabilityGroup",
                "Value": "resource_utilization"
              }
            ]
          },
          "Period": 1800,
          "Stat": "Minimum",
          "Unit": "None"
        }
      },
      {
          "Id": "avgWorkerUtil_1",
          "MetricStat" : {
          "Metric" : {
            "Namespace": "Glue",
            "MetricName": "glue.driver.workerUtilization",
            "Dimensions": [
               {
                 "Name": "JobName",
                 "Value": "<your-Glue-job-name-B>"
               },
               {
                 "Name": "JobRunId",
                 "Value": "<your-Glue-job-run-id-B>"
               },
               {
                 "Name": "Type",
                 "Value": "gauge"
               },
               {
                 "Name": "ObservabilityGroup",
                 "Value": "resource_utilization"
               }
            ]
          },
          "Period": 1800,
          "Stat": "Minimum",
          "Unit": "None"
        }
      }
    ]

  2. Run the get-metric-data command:
    $ aws cloudwatch get-metric-data --metric-data-queries file://multiplequeries.json \
         --start-time '2023-10-28T18:20' \
         --end-time '2023-10-28T19:10'  \
         --region us-east-1
    {
        "MetricDataResults": [
          {
             "Id": "avgWorkerUtil_0",
             "Label": "<your label A>",
             "Timestamps": [
                   "2023-10-28T18:20:00+00:00"
             ], 
             "Values": [
                   0.06718750000000001
             ],
             "StatusCode": "Complete"
          },
          {
             "Id": "avgWorkerUtil_1",
             "Label": "<your label B>",
             "Timestamps": [
                  "2023-10-28T18:20:00+00:00"
              ],
              "Values": [
                  0.5959183673469387
              ],
              "StatusCode": "Complete"
           }
        ],
        "Messages": []
    }

Create a CloudWatch alarm

You can create static threshold-based alarms for the different metrics. For instructions, refer to Create a CloudWatch alarm based on a static threshold.

For example, for skewness, you can set an alarm for skewness.stage with a threshold of 1.0, and skewness.job with a threshold of 0.5. This threshold is just a recommendation; you can adjust the threshold based on your specific use case (for example, some jobs are expected to be skewed and it’s not an issue to be alarmed for). Our recommendation is to evaluate the metric values of your job runs for some time before qualifying the anomalous values and configuring the thresholds to alarm.

Other enhanced metrics

For a full list of other enhanced metrics available with AWS Glue jobs, refer to Monitoring with AWS Glue Observability metrics. These metrics allow you to capture the operational insights of your jobs, such as resource utilization (memory and disk), normalized error classes such as compilation and syntax, user or service errors, and throughput for each source or sink (records, files, partitions, and bytes read or written).

Job observability dashboards

You can further simplify observability for your AWS Glue jobs using dashboards for the insight metrics that enable real-time monitoring using Amazon Managed Grafana, and enable visualization and analysis of trends with Amazon QuickSight.

Conclusion

This post demonstrated how the new enhanced CloudWatch metrics help you monitor and debug AWS Glue jobs. With these enhanced metrics, you can more easily identify and troubleshoot issues in real time. This results in AWS Glue jobs that experience higher uptime, faster processing, and reduced expenditures. The end benefit for you is more effective and optimized AWS Glue for Apache Spark workloads. The metrics are available in all AWS Glue supported Regions. Check it out!


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Shenoda Guirguis is a Senior Software Development Engineer on the AWS Glue team. His passion is in building scalable and distributed Data Infrastructure/Processing Systems. When he gets a chance, Shenoda enjoys reading and playing soccer.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-studio-serverless-spark-ui-for-better-monitoring-and-troubleshooting/

In AWS, hundreds of thousands of customers use AWS Glue, a serverless data integration service, to discover, combine, and prepare data for analytics and machine learning. When you have complex datasets and demanding Apache Spark workloads, you may experience performance bottlenecks or errors during Spark job runs. Troubleshooting these issues can be difficult and delay getting jobs working in production. Customers often use Apache Spark Web UI, a popular debugging tool that is part of open source Apache Spark, to help fix problems and optimize job performance. AWS Glue supports Spark UI in two different ways, but you need to set it up yourself. This requires time and effort spent managing networking and EC2 instances, or through trial-and error with Docker containers.

Today, we are pleased to announce serverless Spark UI built into the AWS Glue console. You can now use Spark UI easily as it’s a built-in component of the AWS Glue console, enabling you to access it with a single click when examining the details of any given job run. There’s no infrastructure setup or teardown required. AWS Glue serverless Spark UI is a fully-managed serverless offering and generally starts up in a matter of seconds. Serverless Spark UI makes it significantly faster and easier to get jobs working in production because you have ready access to low level details for your job runs.

This post describes how the AWS Glue serverless Spark UI helps you to monitor and troubleshoot your AWS Glue job runs.

Getting started with serverless Spark UI

You can access the serverless Spark UI for a given AWS Glue job run by navigating from your Job’s page in AWS Glue console.

  1. On the AWS Glue console, choose ETL jobs.
  2. Choose your job.
  3. Choose the Runs tab.
  4. Select the job run you want to investigate, then choose Spark UI.

The Spark UI will display in the lower pane, as shown in the following screen capture:

Alternatively, you can get to the serverless Spark UI for a specific job run by navigating from Job run monitoring in AWS Glue.

  1. On the AWS Glue console, choose job run monitoring under ETL jobs.
  2. Select your job run, and choose View run details.

Scroll down to the bottom to view the Spark UI for the job run.

Prerequisites

Complete the following prerequisite steps:

  1. Enable Spark UI event logs for your job runs. It is enabled by default on Glue console and once enabled, Spark event log files will be created during the job run, and stored in your S3 bucket. The serverless Spark UI parses a Spark event log file generated in your S3 bucket to visualize detailed information for both running and completed job runs. A progress bar shows the percentage to completion, with a typical parsing time of less than a minute. Once logs are parsed, you can
  2. When logs are parsed, you can use the built-in Spark UI to debug, troubleshoot, and optimize your jobs.

For more information about Apache Spark UI, refer to Web UI in Apache Spark.

Monitor and Troubleshoot with Serverless Spark UI

A typical workload for AWS Glue for Apache Spark jobs is loading data from relational databases to S3-based data lakes. This section demonstrates how to monitor and troubleshoot an example job run for the above workload with serverless Spark UI. The sample job reads data from MySQL database and writes to S3 in Parquet format. The source table has approximately 70 million records.

The following screen capture shows a sample visual job authored in AWS Glue Studio visual editor. In this example, the source MySQL table has already been registered in the AWS Glue Data Catalog in advance. It can be registered through AWS Glue crawler or AWS Glue catalog API. For more information, refer to Data Catalog and crawlers in AWS Glue.

Now it’s time to run the job! The first job run finished in 30 minutes and 10 seconds as shown:

Let’s use Spark UI to optimize the performance of this job run. Open Spark UI tab in the Job runs page. When you drill down to Stages and view the Duration column, you will notice that Stage Id=0 spent 27.41 minutes to run the job, and the stage had only one Spark task in the Tasks:Succeeded/Total column. That means there was no parallelism to load data from the source MySQL database.

To optimize the data load, introduce parameters called hashfield and hashpartitions to the source table definition. For more information, refer to Reading from JDBC tables in parallel. Continuing to the Glue Catalog table, add two properties: hashfield=emp_no, and hashpartitions=18 in Table properties.

This means the new job runs reading parallelize data load from the source MySQL table.

Let’s try running the same job again! This time, the job run finished in 9 minutes and 9 seconds. It saved 21 minutes from the previous job run.

As a best practice, view the Spark UI and compare them before and after the optimization. Drilling down to Completed stages, you will notice that there was one stage and 18 tasks instead of one task.

In the first job run, AWS Glue automatically shuffled data across multiple executors before writing to destination because there were too few tasks. On the other hand, in the second job run, there was only one stage because there was no need to do extra shuffling, and there were 18 tasks for loading data in parallel from source MySQL database.

Considerations

Keep in mind the following considerations:

  • Serverless Spark UI is supported in AWS Glue 3.0 and later
  • Serverless Spark UI will be available for jobs that ran after November 20, 2023, due to a change in how AWS Glue emits and stores Spark logs
  • Serverless Spark UI can visualize Spark event logs which is up to 1 GB in size
  • There is no limit in retention because serverless Spark UI scans the Spark event log files on your S3 bucket
  • Serverless Spark UI is not available for Spark event logs stored in S3 bucket that can only be accessed by your VPC

Conclusion

This post described how the AWS Glue serverless Spark UI helps you monitor and troubleshoot your AWS Glue jobs. By providing instant access to the Spark UI directly within the AWS Management Console, you can now inspect the low-level details of job runs to identify and resolve issues. With the serverless Spark UI, there is no infrastructure to manage—the UI spins up automatically for each job run and tears down when no longer needed. This streamlined experience saves you time and effort compared to manually launching Spark UIs yourself.

Give the serverless Spark UI a try today. We think you’ll find it invaluable for optimizing performance and quickly troubleshooting errors. We look forward to hearing your feedback as we continue improving the AWS Glue console experience.


About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Alexandra Tello is a Senior Front End Engineer with the AWS Glue team in New York City. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso enthusiast and enjoys building mechanical keyboards.

Matt Sampson is a Software Development Manager on the AWS Glue team. He loves working with his other Glue team members to make services that our customers benefit from. Outside of work, he can be found fishing and maybe singing karaoke.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytic services. In his spare time, he enjoys skiing and gardening.

How to use multiple instances of AWS IAM Identity Center

Post Syndicated from Laura Reith original https://aws.amazon.com/blogs/security/how-to-use-multiple-instances-of-aws-iam-identity-center/

Recently, AWS launched a new feature that allows deployment of account instances of AWS IAM Identity Center . With this launch, you can now have two types of IAM Identity Center instances: organization instances and account instances. An organization instance is the IAM Identity Center instance that’s enabled in the management account of your organization created with AWS Organizations. This instance is used to manage access to AWS accounts and applications across your entire organization. Organization instances are the best practice when deploying IAM Identity Center. Many customers have requested a way to enable AWS applications using test or sandbox identities. The new account instances are intended to support sand-boxed deployments of AWS managed applications such as Amazon CodeCatalyst and are only usable from within the account and AWS Region in which they were created. They can exist in a standalone account or in a member account within AWS Organizations.

In this blog post, we show you when to use each instance type, how to control the deployment of account instances, and how you can monitor, manage, and audit these instances at scale using the enhanced IAM Identity Center APIs.

IAM Identity Center instance types

IAM Identity Center now offers two deployment types, the traditional organization instance and an account instance, shown in Figure 1. In this section, we show you the differences between the two.
 

Figure 1: IAM Identity Center instance types

Figure 1: IAM Identity Center instance types

Organization instance of IAM Identity Center

An organization instance of IAM Identity Center is the fully featured version that’s available with AWS Organizations. This type of instance helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications in your organization. The recommended use of an organization instance of Identity Center is for workforce authentication and authorization on AWS for organizations of any size and type.

Using the organization instance of IAM Identity Center, your identity center administrator can create and manage user identities in the Identity Center directory, or connect your existing identity source, including Microsoft Active Directory, Okta, Ping Identity, JumpCloud, Google Workspace, and Azure Active Directory (Entra ID). There is only one organization instance of IAM Identity Center at the organization level. If you have enabled IAM Identity Center before November 15, 2023, you have an organization instance.

Account instances of IAM Identity Center

Account instances of IAM Identity Center provide a subset of the features of the organization instance. Specifically, account instances support user and group assignments initially only to Amazon CodeCatalyst. They are bound to a single AWS account, and you can deploy them in either member accounts of an organization or in standalone AWS accounts. You can only deploy one account instance per AWS account regardless of Region.

You can use account instances of IAM Identity Center to provide access to supported Identity Center enabled application if the application is in the same account and Region.

Account instances of Identity Center don’t support permission sets or assignments to customer managed applications. If you’ve enabled Identity Center before November 15, 2023 then you must enable account instance creation from your management account. To learn more see Enable account instances in the AWS Management Console documentation. If you haven’t yet enabled Identity Center, then account instances are now available to you.

When should I use account instances of IAM Identity Center?

Account instances are intended for use in specific situations where organization instances are unavailable or impractical, including:

  • You want to run a temporary trial of a supported AWS managed application to determine if it suits your business needs. See Additional Considerations.
  • You are unable to deploy IAM Identity Center across your organization, but still want to experiment with one or more AWS managed applications. See Additional Considerations.
  • You have an organization instance of IAM Identity Center, but you want to deploy a supported AWS managed application to an isolated set of users that are distinct from those in your organization instance.

Additional considerations

When working with multiple instances of IAM Identity Center, you want to keep a number of things in mind:

  • Each instance of IAM Identity Center is separate and distinct from other Identity Center instances. That is, users and assignments are managed separately in each instance without a means to keep them in sync.
  • Migration between instances isn’t possible. This means that migrating an application between instances requires setting up that application from scratch in the new instance.
  • Account instances have the same considerations when changing your identity source as an organization instance. In general, you want to set up with the right identity source before adding assignments.
  • Automating assigning users to applications through the IAM Identity Center public APIs also requires using the applications APIs to ensure that those users and groups have the right permissions within the application. For example, if you assign groups to CodeCatalyst using Identity Center, you still have to assign the groups to the CodeCatalyst space from the Amazon CodeCatalyst page in the AWS Management Console. See the Setting up a space that supports identity federation documentation.
  • By default, account instances require newly added users to register a multi-factor authentication (MFA) device when they first sign in. This can be altered in the AWS Management Console for Identity Center for a specific instance.

Controlling IAM Identity Center instance deployments

If you’ve enabled IAM Identity Center prior to November 15, 2023 then account instance creation is off by default. If you want to allow account instance creation, you must enable this feature from the Identity Center console in your organization’s management account. This includes scenarios where you’re using IAM Identity Center centrally and want to allow deployment and management of account instances. See Enable account instances in the AWS Management Console documentation.

If you enable IAM Identity Center after November 15, 2023 or if you haven’t enabled Identity Center at all, you can control the creation of account instances of Identity Center through a service control policy (SCP). We recommend applying the following sample policy to restrict the use of account instances to all but a select set of AWS accounts. The sample SCP that follows will help you deny creation of account instances of Identity Center to accounts in the organization unless the account ID matches the one you specified in the policy. Replace <ALLOWED-ACCOUNT_ID> with the ID of the account that is allowed to create account instances of Identity Center:

{
    "Version": "2012-10-17",
    "Statement" : [
        {
            "Sid": "DenyCreateAccountInstances",
            "Effect": "Deny",
            "Action": [
                "sso:CreateInstance"
            ],
            "Resource": "*",
            "Condition": {
                "StringNotEquals": [
                    "aws:PrincipalAccount": ["<ALLOWED-ACCOUNT-ID>"]
                ]
            }
        }
    ]
}

To learn more about SCPs, see the AWS Organizations User Guide on service control policies.

Monitoring instance activity with AWS CloudTrail

If your organization has an existing log ingestion pipeline solution to collect logs and generate reports through AWS CloudTrail, then IAM Identity Center supported CloudTrail operations will automatically be present in your pipeline, including additional account instances of IAM Identity Center actions such as sso:CreateInstance.

To create a monitoring solution for IAM Identity Center events in your organization, you should set up monitoring through AWS CloudTrail. CloudTrail is a service that records events from AWS services to facilitate monitoring activity from those services in your accounts. You can create a CloudTrail trail that captures events across all accounts and all Regions in your organization and persists them to Amazon Simple Storage Service (Amazon S3).

After creating a trail for your organization, you can use it in several ways. You can send events to Amazon CloudWatch Logs and set up monitoring and alarms for Identity Center events, which enables immediate notification of supported IAM Identity Center CloudTrail operations. With multiple instances of Identity Center deployed within your organization, you can also enable notification of instance activity, including new instance creation, deletion, application registration, user authentication, or other supported actions.

If you want to take action on IAM Identity Center events, you can create a solution to process events using additional service such as Amazon Simple Notification Service, Amazon Simple Queue Service, and the CloudTrail Processing Library. With this solution, you can set your own business logic and rules as appropriate.

Additionally, you might want to consider AWS CloudTrail Lake, which provides a powerful data store that allows you to query CloudTrail events without needing to manage a complex data loading pipeline. You can quickly create a data store for new events, which will immediately start gathering data that can be queried within minutes. To analyze historical data, you can copy your organization trail to CloudTrail Lake.

The following is an example of a simple query that shows you a list of the Identity Center instances created and deleted, the account where they were created, and the user that created them. Replace <Event_data_store_ID> with your store ID.

SELECT 
    userIdentity.arn AS userARN, eventName, userIdentity.accountId 
FROM 
    <Event_data_store_ID> 
WHERE 
    userIdentity.arn IS NOT NULL 
    AND eventName = 'DeleteInstance' 
    OR eventName = 'CreateInstance'

You can save your query result to an S3 bucket and download a copy of the results in CSV format. To learn more, follow the steps in Download your CloudTrail Lake saved query results. Figure 2 shows the CloudTrail Lake query results.

Figure 2: AWS CloudTrail Lake query results

Figure 2: AWS CloudTrail Lake query results

If you want to automate the sourcing, aggregation, normalization, and data management of security data across your organization using the Open Cyber Security Framework (OCSF) standard, you will benefit from using Amazon Security Lake. This service helps make your organization’s security data broadly accessible to your preferred security analytics solutions to power use cases such like threat detection, investigation, and incident response. Learn more in What is Amazon Security Lake?

Instance management and discovery within an organization

You can create account instances of IAM Identity Center in a standalone account or in an account that belongs to your organization. Creation can happen from an API call (CreateInstance) from the Identity Center console in a member account or from the setup experience of a supported AWS managed application. Learn more about Supported AWS managed applications.

If you decide to apply the DenyCreateAccountInstances SCP shown earlier to accounts in your organization, you will no longer be able to create account instances of IAM Identity Center in those accounts. However, you should also consider that when you invite a standalone AWS account to join your organization, the account might have an existing account instance of Identity Center.

To identify existing instances, who’s using them, and what they’re using them for, you can audit your organization to search for new instances. The following script shows how to discover all IAM Identity Center instances in your organization and export a .csv summary to an S3 bucket. This script is designed to run on the account where Identity Center was enabled. Click here to see instructions on how to use this script.

. . .
. . .
accounts_and_instances_dict={}
duplicated_users ={}

main_session = boto3.session.Session()
sso_admin_client = main_session.client('sso-admin')
identity_store_client = main_session.client('identitystore')
organizations_client = main_session.client('organizations')
s3_client = boto3.client('s3')
logger = logging.getLogger()
logger.setLevel(logging.INFO)

#create function to list all Identity Center instances in your organization
def lambda_handler(event, context):
    application_assignment = []
    user_dict={}
    
    current_account = os.environ['CurrentAccountId']
 
    logger.info("Current account %s", current_account)
    
    paginator = organizations_client.get_paginator('list_accounts')
    page_iterator = paginator.paginate()
    for page in page_iterator:
        for account in page['Accounts']:
            get_credentials(account['Id'],current_account)
            #get all instances per account - returns dictionary of instance id and instances ARN per account
            accounts_and_instances_dict = get_accounts_and_instances(account['Id'], current_account)
                    
def get_accounts_and_instances(account_id, current_account):
    global accounts_and_instances_dict
    
    instance_paginator = sso_admin_client.get_paginator('list_instances')
    instance_page_iterator = instance_paginator.paginate()
    for page in instance_page_iterator:
        for instance in page['Instances']:
            #send back all instances and identity centers
            if account_id == current_account:
                accounts_and_instances_dict = {current_account:[instance['IdentityStoreId'],instance['InstanceArn']]}
            elif instance['OwnerAccountId'] != current_account: 
                accounts_and_instances_dict[account_id]= ([instance['IdentityStoreId'],instance['InstanceArn']])
    return accounts_and_instances_dict
  . . .  
  . . .
  . . .

The following table shows the resulting IAM Identity Center instance summary report with all of the accounts in your organization and their corresponding Identity Center instances.

AccountId IdentityCenterInstance
111122223333 d-111122223333
111122224444 d-111122223333
111122221111 d-111111111111

Duplicate user detection across multiple instances

A consideration of having multiple IAM Identity Center instances is the possibility of having the same person existing in two or more instances. In this situation, each instance creates a unique identifier for the same person and the identifier associates application-related data to the user. Create a user management process for incoming and outgoing users that is similar to the process you use at the organization level. For example, if a user leaves your organization, you need to revoke access in all Identity Center instances where that user exists.

The code that follows can be added to the previous script to help detect where duplicates might exist so you can take appropriate action. If you find a lot of duplication across account instances, you should consider adopting an organization instance to reduce your management overhead.

...
#determine if the member in IdentityStore have duplicate
def get_users(identityStoreId, user_dict): 
    global duplicated_users
    paginator = identity_store_client.get_paginator('list_users')
    page_iterator = paginator.paginate(IdentityStoreId=identityStoreId)
    for page in page_iterator:
        for user in page['Users']:
            if ( 'Emails' not in user ):
                print("user has no email")
            else:
                for email in user['Emails']:
                    if email['Value'] not in user_dict:
                        user_dict[email['Value']] = identityStoreId
                    else:
                        print("Duplicate user found " + user['UserName'])
                        user_dict[email['Value']] = user_dict[email['Value']] + "," + identityStoreId
                        duplicated_users[email['Value']] = user_dict[email['Value']]
    return user_dict 
... 

The following table shows the resulting report with duplicated users in your organization and their corresponding IAM identity Center instances.

User_email IdentityStoreId
[email protected] d-111122223333, d-111111111111
[email protected] d-111122223333, d-111111111111, d-222222222222
[email protected] d-111111111111, d-222222222222

The full script for all of the above use cases is available in the multiple-instance-management-iam-identity-center GitHub repository. The repository includes instructions to deploy the script using AWS Lambda within the management account. After deployment, you can invoke the Lambda function to get .csv files of every IAM Identity center instance in your organization, the applications assigned to each instance, and the users that have access to those applications. With this function, you also get a report of users that exist in more than one local instance.

Conclusion

In this post, you learned the differences between an IAM Identity Center organization instance and an account instance, considerations for when to use an account instance, and how to use Identity Center APIs to automate discovery of Identity Center account instances in your organization.

To learn more about IAM Identity Center, see the AWS IAM Identity Center user guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS IAM Identity Center re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Laura Reith

Laura is an Identity Solutions Architect at AWS, where she thrives on helping customers overcome security and identity challenges. In her free time, she enjoys wreck diving and traveling around the world.

Steve Pascoe

Steve Pascoe

Steve is a Senior Technical Product Manager with the AWS Identity team. He delights in empowering customers with creative and unique solutions to everyday problems. Outside of that, he likes to build things with his family through Lego, woodworking, and recently, 3D printing.

sowjir-1.jpeg

Sowjanya Rajavaram

Sowjanya is a Sr Solutions Architect who specializes in Identity and Security in AWS. Her entire career has been focused on helping customers of all sizes solve their identity and access management challenges. She enjoys traveling and experiencing new cultures and food.

Download AWS Security Hub CSV report

Post Syndicated from Pablo Pagani original https://aws.amazon.com/blogs/security/download-aws-security-hub-csv-report/

AWS Security Hub provides a comprehensive view of your security posture in Amazon Web Services (AWS) and helps you check your environment against security standards and best practices. In this post, I show you a solution to export Security Hub findings to a .csv file weekly and send an email notification to download the file from Amazon Simple Storage Service (Amazon S3). By using this solution, you can share the report with others without providing access to your AWS account. You can also use it to generate assessment reports and prioritize and build a remediation roadmap.

When you enable Security Hub, it collects and consolidates findings from AWS security services that you’re using, such as threat detection findings from Amazon GuardDuty, vulnerability scans from Amazon Inspector, S3 bucket policy findings from Amazon Macie, publicly accessible and cross-account resources from AWS Identity and Access Management Access Analyzer, and resources missing AWS WAF coverage from AWS Firewall Manager. Security Hub also consolidates findings from integrated AWS Partner Network (APN) security solutions.

Cloud security processes can differ from traditional on-premises security in that security is often decentralized in the cloud. With traditional on-premises security operations, security alerts are typically routed to centralized security teams operating out of security operations centers (SOCs). With cloud security operations, it’s often the application builders or DevOps engineers who are best situated to triage, investigate, and remediate security alerts.

This solution uses the Security Hub API, AWS Lambda, Amazon S3, and Amazon Simple Notification Service (Amazon SNS). Findings are aggregated into a .csv file to help identify common security issues that might require remediation action.

Solution overview

This solution assumes that Security Hub is enabled in your AWS account. If it isn’t enabled, set up the service so that you can start seeing a comprehensive view of security findings across your AWS accounts.

How the solution works

  1. An Amazon EventBridge time-based event invokes a Lambda function for processing.
  2. The Lambda function gets finding results from the Security Hub API and writes them into a .csv file.
  3. The API uploads the file into Amazon S3 and generates a presigned URL with a 24-hour duration, or the duration of the temporary credential used in Lambda, whichever ends first.
  4. Amazon SNS sends an email notification to the address provided during deployment. This email address can be updated afterwards through the Amazon SNS console.
  5. The email includes a link to download the file.
Figure 1: Solution overview, deployed through AWS CloudFormation

Figure 1: Solution overview, deployed through AWS CloudFormation

Fields included in the report:

Note: You can extend the report by modifying the Lambda function to add fields as needed.

Solution resources

The solution provided with this blog post consists of an AWS CloudFormation template named security-hub-full-report-email.json that deploys the following resources:

  1. An Amazon SNS topic named SecurityHubRecurringFullReport and an email subscription to the topic.
    Figure 2: SNS topic created by the solution

    Figure 2: SNS topic created by the solution

  2. The email address that subscribes to the topic is captured through a CloudFormation template input parameter. The subscriber is notified by email to confirm the subscription. After confirmation, the subscription to the SNS topic is created. Additional subscriptions can be added as needed to include additional emails or distribution lists.
    Figure 3: SNS email subscription

    Figure 3: SNS email subscription

  3. The SendSecurityHubFullReportEmail Lambda function queries the Security Hub API to get findings into a .csv file that’s written to Amazon S3. A pre-authenticated link to the file is generated and sends the email message to the SNS topic described above.
    Figure 4: Lambda function created by the solution

    Figure 4: Lambda function created by the solution

  4. An IAM role for the Lambda function to be able to create logs in CloudWatch, get findings from Security Hub, publish messages to SNS, and put objects into an S3 bucket.
    Figure 5: Permissions policy for the Lambda function

    Figure 5: Permissions policy for the Lambda function

  5. An EventBridge rule that runs on a schedule named SecurityHubFullReportEmailSchedule used to invoke the Lambda function that generates the findings report. The default schedule is every Monday at 8:00 AM UTC. This schedule can be overwritten by using a CloudFormation input parameter. Learn more about creating cron expressions.
    Figure 6: Example of the EventBridge schedule created by the solution

    Figure 6: Example of the EventBridge schedule created by the solution

Deploy the solution

Use the following steps to deploy this solution in a single AWS account. If you have a Security Hub administrator account or are using Security Hub cross-Region aggregation, the report will get the findings from the linked AWS accounts and Regions.

To deploy the solution

  1. Download the CloudFormation template security-hub-full-report-email.json from our GitHub repository.
  2. Copy the template to an S3 bucket within your target AWS account and Region. Copy the object URL for the CloudFormation template .json file.
  3. On the AWS Management Console, go to the CloudFormation console. Choose Create Stack and select With new resources.
    Figure 7: Create stack with new resources

    Figure 7: Create stack with new resources

  4. Under Specify template, in the Amazon S3 URL textbox, enter the S3 object URL for the .json file that you uploaded in step 1.
    Figure 8: Specify S3 URL for CloudFormation template

    Figure 8: Specify S3 URL for CloudFormation template

  5. Choose Next. On the next page, do the following:
    1. Stack name: Enter a name for the stack.
    2. Email address: Enter the email address of the subscriber to the Security Hub findings email.
    3. RecurringScheduleCron: Enter the cron expression for scheduling the Security Hub findings email. The default is every Monday at 8:00 AM UTC. Learn more about creating cron expressions.
    4. SecurityHubRegion: Enter the Region where Security Hub is aggregating the findings.
    Figure 9: Enter stack name and parameters

    Figure 9: Enter stack name and parameters

  6. Choose Next.
  7. Keep all defaults in the screens that follow and choose Next.
  8. Check the box I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.

Test the solution

You can send a test email after the deployment is complete. To do this, open the Lambda console and locate the SendSecurityHubFullReportEmail Lambda function. Perform a manual invocation with an event payload to receive an email within a few minutes. You can repeat this procedure as many times as you want.

Conclusion

In this post I’ve shown you an approach for rapidly building a solution for sending weekly findings report of the security posture of your AWS account as evaluated by Security Hub. This solution helps you to be diligent in reviewing outstanding findings and to remediate findings in a timely way based on their severity. You can extend the solution in many ways, including:

  • Send a file to an email-enabled ticketing service, such as ServiceNow or another security information and event management (SIEM) that you use.
  • Add links to internal wikis for workflows such as organizational exceptions to vulnerabilities or other internal processes.
  • Extend the solution by modifying the filters, email content, and delivery frequency.

To learn more about how to set up and customize Security Hub, see these additional blog posts.

If you have feedback about this post, submit comments in the Comments section below. If you have any questions about this post, start a thread on the AWS Security Hub re:Post forum.

Want more AWS Security news? Follow us on Twitter.

Pablo Pagani

Pablo Pagani

Pablo is the Sr. Latam Security Manager for AWS Professional Services based in Buenos Aires, Argentina. He helps customers build a secure journey in AWS. He developed his passion for computers while writing his first lines of code in BASIC using a Talent MSX.