Tag Archives: Advanced (300)

Validate IAM policies by using IAM Policy Validator for AWS CloudFormation and GitHub Actions

Post Syndicated from Mitch Beaumont original https://aws.amazon.com/blogs/security/validate-iam-policies-by-using-iam-policy-validator-for-aws-cloudformation-and-github-actions/

In this blog post, I’ll show you how to automate the validation of AWS Identity and Access Management (IAM) policies by using a combination of the IAM Policy Validator for AWS CloudFormation (cfn-policy-validator) and GitHub Actions. Policy validation is an approach that is designed to minimize the deployment of unwanted IAM identity-based and resource-based policies to your Amazon Web Services (AWS) environments.

With GitHub Actions, you can automate, customize, and run software development workflows directly within a repository. Workflows are defined using YAML and are stored alongside your code. I’ll discuss the specifics of how you can set up and use GitHub actions within a repository in the sections that follow.

The cfn-policy-validator tool is a command-line tool that takes an AWS CloudFormation template, finds and parses the IAM policies that are attached to IAM roles, users, groups, and resources, and then runs the policies through IAM Access Analyzer policy checks. Implementing IAM policy validation checks at the time of code check-in helps shift security to the left (closer to the developer) and shortens the time between when developers commit code and when they get feedback on their work.

Let’s walk through an example that checks the policies that are attached to an IAM role in a CloudFormation template. In this example, the cfn-policy-validator tool will find that the trust policy attached to the IAM role allows the role to be assumed by external principals. This configuration could lead to unintended access to your resources and data, which is a security risk.

Prerequisites

To complete this example, you will need the following:

  1. A GitHub account
  2. An AWS account, and an identity within that account that has permissions to create the IAM roles and resources used in this example

Step 1: Create a repository that will host the CloudFormation template to be validated

To begin with, you need to create a GitHub repository to host the CloudFormation template that is going to be validated by the cfn-policy-validator tool.

To create a repository:

  1. Open a browser and go to https://github.com.
  2. In the upper-right corner of the page, in the drop-down menu, choose New repository. For Repository name, enter a short, memorable name for your repository.
  3. (Optional) Add a description of your repository.
  4. Choose either the option Public (the repository is accessible to everyone on the internet) or Private (the repository is accessible only to people access is explicitly shared with).
  5. Choose Initialize this repository with: Add a README file.
  6. Choose Create repository. Make a note of the repository’s name.

Step 2: Clone the repository locally

Now that the repository has been created, clone it locally and add a CloudFormation template.

To clone the repository locally and add a CloudFormation template:

  1. Open the command-line tool of your choice.
  2. Use the following command to clone the new repository locally. Make sure to replace <GitHubOrg> and <RepositoryName> with your own values.
    git clone [email protected]:<GitHubOrg>/<RepositoryName>.git

  3. Change in to the directory that contains the locally-cloned repository.
    cd <RepositoryName>

    Now that the repository is locally cloned, populate the locally-cloned repository with the following sample CloudFormation template. This template creates a single IAM role that allows a principal to assume the role to perform the S3:GetObject action.

  4. Use the following command to create the sample CloudFormation template file.

    WARNING: This sample role and policy should not be used in production. Using a wildcard in the principal element of a role’s trust policy would allow any IAM principal in any account to assume the role.

    cat << EOF > sample-role.yaml
    
    AWSTemplateFormatVersion: "2010-09-09"
    Description: Base stack to create a simple role
    Resources:
      SampleIamRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Statement:
              - Effect: Allow
                Principal:
                  AWS: "*"
                Action: ["sts:AssumeRole"]
          Path: /      
          Policies:
            - PolicyName: root
              PolicyDocument:
                Version: 2012-10-17
                Statement:
                  - Resource: "*"
                    Effect: Allow
                    Action:
                      - s3:GetObject
    EOF

Notice that AssumeRolePolicyDocument refers to a trust policy that includes a wildcard value in the principal element. This means that the role could potentially be assumed by an external identity, and that’s a risk you want to know about.

Step 3: Vend temporary AWS credentials for GitHub Actions workflows

In order for the cfn-policy-validator tool that’s running in the GitHub Actions workflow to use the IAM Access Analyzer API, the GitHub Actions workflow needs a set of temporary AWS credentials. The AWS Credentials for GitHub Actions action helps address this requirement. This action implements the AWS SDK credential resolution chain and exports environment variables for other actions to use in a workflow. Environment variable exports are detected by the cfn-policy-validator tool.

AWS Credentials for GitHub Actions supports four methods for fetching credentials from AWS, but the recommended approach is to use GitHub’s OpenID Connect (OIDC) provider in conjunction with a configured IAM identity provider endpoint.

To configure an IAM identity provider endpoint for use in conjunction with GitHub’s OIDC provider:

  1. Open the AWS Management Console and navigate to IAM.
  2. In the left-hand menu, choose Identity providers, and then choose Add provider.
  3. For Provider type, choose OpenID Connect.
  4. For Provider URL, enter
    https://token.actions.githubusercontent.com
  5. Choose Get thumbprint.
  6. For Audiences, enter sts.amazonaws.com
  7. Choose Add provider to complete the setup.

At this point, make a note of the OIDC provider name. You’ll need this information in the next step.

After it’s configured, the IAM identity provider endpoint should look similar to the following:

Figure 1: IAM Identity provider details

Figure 1: IAM Identity provider details

Step 4: Create an IAM role with permissions to call the IAM Access Analyzer API

In this step, you will create an IAM role that can be assumed by the GitHub Actions workflow and that provides the necessary permissions to run the cfn-policy-validator tool.

To create the IAM role:

  1. In the IAM console, in the left-hand menu, choose Roles, and then choose Create role.
  2. For Trust entity type, choose Web identity.
  3. In the Provider list, choose the new GitHub OIDC provider that you created in the earlier step. For Audience, select sts.amazonaws.com from the list.
  4. Choose Next.
  5. On the Add permission page, choose Create policy.
  6. Choose JSON, and enter the following policy:
    
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                  "iam:GetPolicy",
                  "iam:GetPolicyVersion",
                  "access-analyzer:ListAnalyzers",
                  "access-analyzer:ValidatePolicy",
                  "access-analyzer:CreateAccessPreview",
                  "access-analyzer:GetAccessPreview",
                  "access-analyzer:ListAccessPreviewFindings",
                  "access-analyzer:CreateAnalyzer",
                  "s3:ListAllMyBuckets",
                  "cloudformation:ListExports",
                  "ssm:GetParameter"
                ],
                "Resource": "*"
            },
            {
              "Effect": "Allow",
              "Action": "iam:CreateServiceLinkedRole",
              "Resource": "*",
              "Condition": {
                "StringEquals": {
                  "iam:AWSServiceName": "access-analyzer.amazonaws.com"
                }
              }
            } 
        ]
    }

  7. After you’ve attached the new policy, choose Next.

    Note: For a full explanation of each of these actions and a CloudFormation template example that you can use to create this role, see the IAM Policy Validator for AWS CloudFormation GitHub project.

  8. Give the role a name, and scroll down to look at Step 1: Select trusted entities.

    The default policy you just created allows GitHub Actions from organizations or repositories outside of your control to assume the role. To align with the IAM best practice of granting least privilege, let’s scope it down further to only allow a specific GitHub organization and the repository that you created earlier to assume it.

  9. Replace the policy to look like the following, but don’t forget to replace {AWSAccountID}, {GitHubOrg} and {RepositoryName} with your own values.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Federated": "arn:aws:iam::{AWSAccountID}:oidc-provider/token.actions.githubusercontent.com"
                },
                "Action": "sts:AssumeRoleWithWebIdentity",
                "Condition": {
                    "StringEquals": {
                        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
                    },
                    "StringLike": {
                        "token.actions.githubusercontent.com:sub": "repo:${GitHubOrg}/${RepositoryName}:*"
                    }
                }
            }
        ]
    }

For information on best practices for configuring a role for the GitHub OIDC provider, see Creating a role for web identity or OpenID Connect Federation (console).

Checkpoint

At this point, you’ve created and configured the following resources:

  • A GitHub repository that has been locally cloned and filled with a sample CloudFormation template.
  • An IAM identity provider endpoint for use in conjunction with GitHub’s OIDC provider.
  • A role that can be assumed by GitHub actions, and a set of associated permissions that allow the role to make requests to IAM Access Analyzer to validate policies.

Step 5: Create a definition for the GitHub Actions workflow

The workflow runs steps on hosted runners. For this example, we are going to use Ubuntu as the operating system for the hosted runners. The workflow runs the following steps on the runner:

  1. The workflow checks out the CloudFormation template by using the community actions/checkout action.
  2. The workflow then uses the aws-actions/configure-aws-credentials GitHub action to request a set of credentials through the IAM identity provider endpoint and the IAM role that you created earlier.
  3. The workflow installs the cfn-policy-validator tool by using the python package manager, PIP.
  4. The workflow runs a validation against the CloudFormation template by using the cfn-policy-validator tool.

The workflow is defined in a YAML document. In order for GitHub Actions to pick up the workflow, you need to place the definition file in a specific location within the repository: .github/workflows/main.yml. Note the “.” prefix in the directory name, indicating that this is a hidden directory.

To create the workflow:

  1. Use the following command to create the folder structure within the locally cloned repository:
    mkdir -p .github/workflows

  2. Create the sample workflow definition file in the .github/workflows directory. Make sure to replace <AWSAccountID> and <AWSRegion> with your own information.
    cat << EOF > .github/workflows/main.yml
    name: cfn-policy-validator-workflow
    
    on: push
    
    permissions:
      id-token: write
      contents: read
    
    jobs: 
      cfn-iam-policy-validation: 
        name: iam-policy-validation
        runs-on: ubuntu-latest
        steps:
          - name: Checkout code
            uses: actions/checkout@v3
    
          - name: Configure AWS Credentials
            uses: aws-actions/configure-aws-credentials@v2
            with:
              role-to-assume: arn:aws:iam::<AWSAccountID>:role/github-actions-access-analyzer-role
              aws-region: <AWSRegion>
              role-session-name: GitHubSessionName
            
          - name: Install cfn-policy-validator
            run: pip install cfn-policy-validator
    
          - name: Validate templates
            run: cfn-policy-validator validate --template-path ./sample-role-test.yaml --region <AWSRegion>
    EOF
    

Step 6: Test the setup

Now that everything has been set up and configured, it’s time to test.

To test the workflow and validate the IAM policy:

  1. Add and commit the changes to the local repository.
    git add .
    git commit -m ‘added sample cloudformation template and workflow definition’

  2. Push the local changes to the remote GitHub repository.
    git push

    After the changes are pushed to the remote repository, go back to https://github.com and open the repository that you created earlier. In the top-right corner of the repository window, there is a small orange indicator, as shown in Figure 2. This shows that your GitHub Actions workflow is running.

    Figure 2: GitHub repository window with the orange workflow indicator

    Figure 2: GitHub repository window with the orange workflow indicator

    Because the sample CloudFormation template used a wildcard value “*” in the principal element of the policy as described in the section Step 2: Clone the repository locally, the orange indicator turns to a red x (shown in Figure 3), which signals that something failed in the workflow.

    Figure 3: GitHub repository window with the red cross workflow indicator

    Figure 3: GitHub repository window with the red cross workflow indicator

  3. Choose the red x to see more information about the workflow’s status, as shown in Figure 4.
    Figure 4: Pop-up displayed after choosing the workflow indicator

    Figure 4: Pop-up displayed after choosing the workflow indicator

  4. Choose Details to review the workflow logs.

    In this example, the Validate templates step in the workflow has failed. A closer inspection shows that there is a blocking finding with the CloudFormation template. As shown in Figure 5, the finding is labelled as EXTERNAL_PRINCIPAL and has a description of Trust policy allows access from external principals.

    Figure 5: Details logs from the workflow showing the blocking finding

    Figure 5: Details logs from the workflow showing the blocking finding

    To remediate this blocking finding, you need to update the principal element of the trust policy to include a principal from your AWS account (considered a zone of trust). The resources and principals within your account comprises of the zone of trust for the cfn-policy-validator tool. In the initial version of sample-role.yaml, the IAM roles trust policy used a wildcard in the Principal element. This allowed principals outside of your control to assume the associated role, which caused the cfn-policy-validator tool to generate a blocking finding.

    In this case, the intent is that principals within the current AWS account (zone of trust) should be able to assume this role. To achieve this result, replace the wildcard value with the account principal by following the remaining steps.

  5. Open sample-role.yaml by using your preferred text editor, such as nano.
    nano sample-role.yaml

    Replace the wildcard value in the principal element with the account principal arn:aws:iam::<AccountID>:root. Make sure to replace <AWSAccountID> with your own AWS account ID.

    AWSTemplateFormatVersion: "2010-09-09"
    Description: Base stack to create a simple role
    Resources:
      SampleIamRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Statement:
              - Effect: Allow
                Principal:
                  AWS: "arn:aws:iam::<AccountID>:root"
                Action: ["sts:AssumeRole"]
          Path: /      
          Policies:
            - PolicyName: root
              PolicyDocument:
                Version: 2012-10-17
                Statement:
                  - Resource: "*"
                    Effect: Allow
                    Action:
                      - s3:GetObject

  6. Add the updated file, commit the changes, and push the updates to the remote GitHub repository.
    git add sample-role.yaml
    git commit -m ‘replacing wildcard principal with account principal’
    git push

After the changes have been pushed to the remote repository, go back to https://github.com and open the repository. The orange indicator in the top right of the window should change to a green tick (check mark), as shown in Figure 6.

Figure 6: GitHub repository window with the green tick workflow indicator

Figure 6: GitHub repository window with the green tick workflow indicator

This indicates that no blocking findings were identified, as shown in Figure 7.

Figure 7: Detailed logs from the workflow showing no more blocking findings

Figure 7: Detailed logs from the workflow showing no more blocking findings

Conclusion

In this post, I showed you how to automate IAM policy validation by using GitHub Actions and the IAM Policy Validator for CloudFormation. Although the example was a simple one, it demonstrates the benefits of automating security testing at the start of the development lifecycle. This is often referred to as shifting security left. Identifying misconfigurations early and automatically supports an iterative, fail-fast model of continuous development and testing. Ultimately, this enables teams to make security an inherent part of a system’s design and architecture and can speed up product development workflows.

In addition to the example I covered today, IAM Policy Validator for CloudFormation can validate IAM policies by using a range of IAM Access Analyzer policy checks. For more information about these policy checks, see Access Analyzer reference policy checks.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mitch Beaumont

Mitch Beaumont

Mitch is a Principal Solutions Architect for Amazon Web Services, based in Sydney, Australia. Mitch works with some of Australia’s largest financial services customers, helping them to continually raise the security bar for the products and features that they build and ship. Outside of work, Mitch enjoys spending time with his family, photography, and surfing.

Generate machine learning insights for Amazon Security Lake data using Amazon SageMaker

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/generate-machine-learning-insights-for-amazon-security-lake-data-using-amazon-sagemaker/

Amazon Security Lake automatically centralizes the collection of security-related logs and events from integrated AWS and third-party services. With the increasing amount of security data available, it can be challenging knowing what data to focus on and which tools to use. You can use native AWS services such as Amazon QuickSight, Amazon OpenSearch, and Amazon SageMaker Studio to visualize, analyze, and interactively identify different areas of interest to focus on, and prioritize efforts to increase your AWS security posture.

In this post, we go over how to generate machine learning insights for Security Lake using SageMaker Studio. SageMaker Studio is a web integrated development environment (IDE) for machine learning that provides tools for data scientists to prepare, build, train, and deploy machine learning models. With this solution, you can quickly deploy a base set of Python notebooks focusing on AWS Security Hub findings in Security Lake, which can also be expanded to incorporate other AWS sources or custom data sources in Security Lake. After you’ve run the notebooks, you can use the results to help you identify and focus on areas of interest related to security within your AWS environment. As a result, you might implement additional guardrails or create custom detectors to alert on suspicious activity.

Prerequisites

  1. Specify a delegated administrator account to manage the Security Lake configuration for all member accounts within your organization.
  2. Security Lake has been enabled in the delegated administrator AWS account.
  3. As part of the solution in this post, we focus on Security Hub as a data source. AWS Security Hub must be enabled for your AWS Organizations. When enabling Security Lake, select All log and event sources to include AWS Security Hub findings.
  4. Configure subscriber query access to Security Lake. Security Lake uses AWS Lake Formation cross-account table sharing to support subscriber query access. Accept the resource share request in the subscriber AWS account in AWS Resource Access Manager (AWS RAM). Subscribers with query access can query the data that Security Lake collects. These subscribers query Lake Formation tables in an Amazon Simple Storage Service (Amazon S3) bucket with Security Lake data using services such as Amazon Athena.

Solution overview

Figure 1 that follows depicts the architecture of the solution.

Figure 1 SageMaker machine learning insights architecture for Security Lake

Figure 1 SageMaker machine learning insights architecture for Security Lake

The deployment builds the architecture by completing the following steps:

  1. A Security Lake is set up in an AWS account with supported log sources — such as Amazon VPC Flow Logs, AWS Security Hub, AWS CloudTrail, and Amazon Route53 — configured.
  2. Subscriber query access is created from the Security Lake AWS account to a subscriber AWS account.

    Note: See Prerequisite #4 for more information.

  3. The AWS RAM resource share request must be accepted in the subscriber AWS account where this solution is deployed.

    Note: See Prerequisite #4 for more information.

  4. A resource link database in Lake Formation is created in the subscriber AWS account and grants access for the Athena tables in the Security Lake AWS account.
  5. VPC is provisioned for SageMaker with IGW, NAT GW, and VPC endpoints for the AWS services used in the solution. IGW and NAT are required to install external open-source packages.
  6. A SageMaker Domain for SageMaker Studio is created in VPCOnly mode with a single SageMaker user profile that is tied to a dedicated AWS Identity and Access Management (IAM) role.
  7. A dedicated IAM role is created to restrict access to create and access the presigned URL for the SageMaker Domain from a specific CIDR for accessing the SageMaker notebook.
  8. An AWS CodeCommit repository containing Python notebooks is used for the AI and ML workflow by the SageMaker user-profile.
  9. An Athena workgroup is created for the Security Lake queries with an S3 bucket for output location (access logging configured for the output bucket).

Deploy the solution

You can deploy the SageMaker solution by using either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

Option 1: Deploy the solution with AWS CloudFormation using the console

Use the console to sign in to your subscriber AWS account and then choose the Launch Stack button to open the AWS CloudFormation console pre-loaded with the template for this solution. It takes approximately 10 minutes for the CloudFormation stack to complete.

Select this image to open a link that starts building the CloudFormation stack

Option 2: Deploy the solution by using the AWS CDK

You can find the latest code for the SageMaker solution in the SageMaker machine learning insights GitHub repository, where you can also contribute to the sample code. For instructions and more information on using the AWS CDK, see Get Started with AWS CDK.

To deploy the solution by using the AWS CDK

  1. To build the app when navigating to the project’s root folder, use the following commands:
    npm install -g aws-cdk-lib
    npm install

  2. Update IAM_role_assumption_for_sagemaker_presigned_url and security_lake_aws_account default values in source/lib/sagemaker_domain.ts with their respective appropriate values.
  3. Run the following commands in your terminal while authenticated in your subscriber AWS account. Be sure to replace <INSERT_AWS_ACCOUNT> with your account number and replace <INSERT_REGION> with the AWS Region that you want the solution deployed to.
    cdk bootstrap aws://<INSERT_AWS_ACCOUNT>/<INSERT_REGION>
    cdk deploy

Post deployment steps

Now that you’ve deployed the SageMaker solution, you must grant the SageMaker user profile in the subscriber AWS account query access to your Security Lake. You can Grant permission for the SageMaker user profile to Security Lake in Lake Formation in the subscriber AWS account.

Grant permission to the Security Lake database

  1. Copy the SageMaker user-profile Amazon resource name (ARN) arn:aws:iam::<account-id>:role/sagemaker-user-profile-for-security-lake
  2. Go to Lake Formation in the console.
  3. Select the amazon_security_lake_glue_db_us_east_1 database.
  4. From the Actions Dropdown, select Grant.
  5. In Grant Data Permissions, select SAML Users and Groups.
  6. Paste the SageMaker user profile ARN from Step 1.
  7. In Database Permissions, select Describe and then Grant.

Grant permission to Security Lake – Security Hub table

  1. Copy the SageMaker user-profile ARN arn:aws:iam:<account-id>:role/sagemaker-user-profile-for-security-lake
  2. Go to Lake Formation in the console.
  3. Select the amazon_security_lake_glue_db_us_east_1 database.
  4. Choose View Tables.
  5. Select the amazon_security_lake_table_us_east_1_sh_findings_1_0 table.
  6. From Actions Dropdown, select Grant.
  7. In Grant Data Permissions, select SAML Users and Groups.
  8. Paste the SageMaker user-profile ARN from Step 1.
  9. In Table Permissions, select Describe and then Grant.

Launch your SageMaker Studio application

Now that you have granted permissions for a SageMaker user-profile, we can move on to launching the SageMaker application associated to that user-profile.

  1. Navigate to the SageMaker Studio domain in the console.
  2. Select the SageMaker domain security-lake-ml-insights-<account-id>.
  3. Select the SageMaker user profile sagemaker-user-profile-for-security-lake.
  4. Select the Launch drop-down and select Studio
    Figure 2 SageMaker domain user-profile AWS console screen

    Figure 2: SageMaker domain user-profile AWS console screen

Clone Python notebooks

You’ll work primarily in the SageMaker user profile to create a data-science app to work in. As part of the solution deployment, we’ve created Python notebooks in CodeCommit that you will need to clone.

To clone the Python notebooks

  1. Navigate to CloudFormation in the console.
  2. In the Stacks section, select the SageMakerDomainStack.
  3. Select to the Outputs tab/
  4. Copy the value for sagemakernotebookmlinsightsrepositoryURL. (For example: https://git-codecommit.us-east-1.amazonaws.com/v1/repos/sagemaker_ml_insights_repo)
  5. Go back to your SageMaker app.
  6. In Studio, in the left sidebar, choose the Git icon (identified by a diamond with two branches), then choose Clone a Repository.
    Figure 3 SageMaker clone CodeCommit repository

    Figure 3: SageMaker clone CodeCommit repository

  7. Paste the CodeCommit repository link from Step 4 under the Git repository URL (git). After you paste the URL, select Clone “https://git-codecommit.us-east-1.amazonaws.com/v1/repos/sagemaker_ml_insights_repo”, then select Clone.

    NOTE: If you don’t select from the auto-populated drop-down, SageMaker won’t be able to clone the repository.

    Figure 4 SageMaker clone CodeCommit URL

    Figure 4: SageMaker clone CodeCommit URL

Generating machine learning insights using SageMaker Studio

You’ve successfully pulled the base set of Python notebooks into your SageMaker app and they can be accessed at sagemaker_ml_insights_repo/notebooks/tsat/. The notebooks provide you with a starting point for running machine learning analysis using Security Lake data. These notebooks can be expanded to existing native or custom data sources being sent to Security Lake.

Figure 5: SageMaker cloned Python notebooks

Figure 5: SageMaker cloned Python notebooks

Notebook #1 – Environment setup

The 0.0-tsat-environ-setup notebook handles the installation of the required libraries and dependencies needed for the subsequent notebooks within this blog. For our notebooks, we use an open-source Python library called Kats, which is a lightweight, generalizable framework to perform time series analysis.

  1. Select the 0.0-tsat-environ-setup.ipynb notebook for the environment setup.

    Note: If you have already provisioned a kernel, you can skip steps 2 and 3.

  2. In the right-hand corner, select No Kernel
  3. In the Set up notebook environment pop-up, leave the defaults and choose Select.
    Figure 6 SageMaker application environment settings

    Figure 6: SageMaker application environment settings

  4. After the kernel has successfully started, choose the Terminal icon to open the image terminal.
    Figure 7: SageMaker application terminal

    Figure 7: SageMaker application terminal

  5. To install open-source packages from https instead of http, you must update the sources.list file. After the terminal opens, send the following commands:
    cd /etc/apt
    sed -i 's/http:/https:/g' sources.list

  6. Go back to the 0.0-tsat-environ-setup.ipynb notebook and select the Run drop-down and select Run All Cells. Alternatively, you can run each cell independently, but it’s not required. Grab a coffee! This step will take about 10 minutes.

    IMPORTANT: If you complete the installation out of order or update the requirements.txt file, you might not be able to successfully install Kats and you will need to rebuild your environment by using a net-new SageMaker user profile.

  7. After installing all the prerequisites, check the Kats version to determine if it was successfully installed.
    Figure 8: Kats installation verification

    Figure 8: Kats installation verification

  8. Install PyAthena (Python DB API client for Amazon Athena) which is used to query your data in Security Lake.

You’ve successfully set up the SageMaker app environment! You can now load the appropriate dataset and create a time series.

Notebook #2 – Load data

The 0.1-load-data notebook establishes the Athena connection to query data in Security Lake and creates the resulting time series dataset. The time series dataset will be used for subsequent notebooks to identify trends, outliers, and change points.

  1. Select the 0.1-load-data.ipynb notebook.
  2. If you deployed the solution outside of us-east-1, update the con details to the appropriate Region. In this example, we’re focusing on Security Hub data within Security Lake. If you want to change the underlying data source, you can update the TABLE value.
    Figure 9: SageMaker notebook load Security Lake data settings

    Figure 9: SageMaker notebook load Security Lake data settings

  3. In the Query section, there’s an Athena query to pull specific data from Security Hub, this can be expanded as needed to a subset or can include all products within Security Hub. The query below pulls Security Hub information after 01:00:00 1/1/2022 from the products listed in productname.
    Figure 10: SageMaker notebook Athena query

    Figure 10: SageMaker notebook Athena query

  4. After the values have been updated, you can create your time series dataset. For this notebook, we recommend running each cell individually instead of running all cells at once so you can get a bit more familiar with the process. Select the first cell and choose the Run icon.
    Figure 11: SageMaker run Python notebook code

    Figure 11: SageMaker run Python notebook code

  5. Follow the same process as Step 4 for the subsequent cells.

    Note: If you encounter any issues with querying the table, make sure you completed the post-deployment step for Grant permission to Security Lake – Security Hub table.

You’ve successfully loaded your data and created a timeseries! You can now move on to generating machine learning insights from your timeseries.

Notebook #3 – Trend detector

The 1.1-trend-detector.ipynb notebook handles trend detection in your data. Trend represents a directional change in the level of a time series. This directional change can be either upward (increase in level) or downward (decrease in level). Trend detection helps detect a change while ignoring the noise from natural variability. Each environment is different, and trends help us identify where to look more closely to determine why a trend is positive or negative.

  1. Select 1.1-trend-detector.ipynb notebook for trend detection.
  2. Slopes are created to identify the relationship between x (time) and y (counts).
    Figure 12: SageMaker notebook slope view

    Figure 12: SageMaker notebook slope view

  3. If the counts are increasing with time, then it’s considered a positive slope and the reverse is considered a negative slope. A positive slope isn’t necessarily a good thing because in an ideal state we would expect counts of a finding type to come down with time.
    Figure 13: SageMaker notebook trend view

    Figure 13: SageMaker notebook trend view

  4. Now you can plot the top five positive and negative trends to identify the top movers.
    Figure 14: SageMaker notebook trend results view

    Figure 14: SageMaker notebook trend results view

Notebook #4 – Outlier detection

The 1.2-outlier-detection.ipynb notebook handles outlier detection. This notebook does a seasonal decomposition of the input time series, with additive or multiplicative decomposition as specified (default is additive). It uses a residual time series by either removing only trend or both trend and seasonality if the seasonality is strong. The intent is to discover useful, abnormal, and irregular patterns within data sets, allowing you to pinpoint areas of interest.

  1. To start, it detects points in the residual that are over 5 times the inter-quartile range.
  2. Inter-quartile range (IQR) is the difference between the seventy-fifth and twenty-fifth percentiles of residuals or the spread of data within the middle two quartiles of the entire dataset. IQR is useful in detecting the presence of outliers by looking at values that might lie outside of the middle two quartiles.
  3. The IQR multiplier controls the sensitivity of the range and decision of identifying outliers. By using a larger value for the iqr_mult_thresh parameter in OutlierDetector, outliers would be considered data points, while a smaller value would identify data points as outliers.

    Note: If you don’t have enough data, decrease iqr_mult_thresh to a lower value (for example iqr_mult_thresh=3).

    Figure 15: SageMaker notebook outlier setting

    Figure 15: SageMaker notebook outlier setting

  4. Along with outlier detection plots, investigation SQL will be displayed as well, which can help with further investigation of the outliers.

    In the diagram that follows, you can see that there are several outliers in the number of findings, related to failed AWS Firewall Manager policies, which have been identified by the vertical red lines within the line graph. These are outliers because they deviate from the normal behavior and number of findings on a day-to-day basis. When you see outliers, you can look at the resources that might have caused an unusual increase in Firewall Manager policy findings. Depending on the findings, it could be related to an overly permissive or noncompliant security group or a misconfigured AWS WAF rule group.

    Figure 16: SageMaker notebook outlier results view

    Figure 16: SageMaker notebook outlier results view

Notebook #5 – Change point detection

The 1.3-changepoint-detector.ipynb notebook handles the change point detection. Change point detection is a method to detect changes in a time series that persist over time, such as a change in the mean value. To detect a baseline to identify when several changes might have occurred from that point. Change points occur when there’s an increase or decrease to the average number of findings within a data set.

  1. Along with identifying change points within the data set, the investigation SQL is generated to further investigate the specific change point if applicable.

    In the following diagram, you can see there’s a change point decrease after July 27, 2022, with confidence of 99.9 percent. It’s important to note that change points differ from outliers, which are sudden changes in the data set observed. This diagram means there was some change in the environment that resulted in an overall decrease in the number of findings for S3 buckets with block public access being disabled. The change could be the result of an update to the CI/CD pipelines provisioning S3 buckets or automation to enable all S3 buckets to block public access. Conversely, if you saw a change point that resulted in an increase, it could mean that there was a change that resulted in a larger number of S3 buckets with a block public access configuration consistently being disabled.

    Figure 17: SageMaker changepoint detector view

    Figure 17: SageMaker changepoint detector view

By now, you should be familiar with the set up and deployment for SageMaker Studio and how you can use Python notebooks to generate machine learning insights for your Security Lake data. You can take what you’ve learned and start to curate specific datasets and data sources within Security Lake, create a time series, detect trends, and identify outliers and change points. By doing so, you can answer a variety of security-related questions such as:

  • CloudTrail

    Is there a large volume of Amazon S3 download or copy commands to an external resource? Are you seeing a large volume of S3 delete object commands? Is it possible there’s a ransomware event going on?

  • VPC Flow Logs

    Is there an increase in the number of requests from your VPC to external IPs? Is there an increase in the number of requests from your VPC to your on-premises CIDR? Is there a possibility of internal or external data exfiltration occurring?

  • Route53

    Which resources are making DNS requests that they haven’t typically made within the last 30–45 days? When did it start? Is there a potential command and control session occurring on an Amazon Elastic Compute Cloud (Amazon EC2) instance?

It’s important to note that this isn’t a solution to replace Amazon GuardDuty, which uses foundational data sources to detect communication with known malicious domains and IP addresses and identify anomalous behavior, or Amazon Detective, which provides customers with prebuilt data aggregations, summaries, and visualizations to help security teams conduct faster and more effective investigations. One of the main benefits of using Security Lake and SageMaker Studio is the ability to interactively create and tailor machine learning insights specific to your AWS environment and workloads.

Clean up

If you deployed the SageMaker machine learning insights solution by using the Launch Stack button in the AWS Management Console or the CloudFormation template sagemaker_ml_insights_cfn, do the following to clean up:

  1. In the CloudFormation console for the account and Region where you deployed the solution, choose the SageMakerML stack.
  2. Choose the option to Delete the stack.

If you deployed the solution by using the AWS CDK, run the command cdk destroy.

Conclusion

Amazon Security Lake gives you the ability to normalize and centrally store your security data from various log sources to help you analyze, visualize, and correlate appropriate security logs. You can then use this data to increase your overall security posture by implementing additional security guardrails or take appropriate remediation actions within your AWS environment.

In this blog post, you learned how you can use SageMaker to generate machine learning insights for your Security Hub findings in Security Lake. Although the example solution focuses on a single data source within Security Lake, you can expand the notebooks to incorporate other native or custom data sources in Security Lake.

There are many different use-cases for Security Lake that can be tailored to fit your AWS environment. Take a look at this blog post to learn how you can ingest, transform and deliver Security Lake data to Amazon OpenSearch to help your security operations team quickly analyze security data within your AWS environment. In supported Regions, new Security Lake account holders can try the service free for 15 days and gain access to its features.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Jonathan Nguyen

Jonathan Nguyen

Jonathan is a Principal Security Architect at AWS. His background is in AWS security, with a focus on threat detection and incident response. He helps enterprise customers develop a comprehensive AWS security strategy, deploy security solutions at scale, and train customers on AWS security best practices.

Madhunika Reddy Mikkili

Madhunika Reddy Mikkili

Madhunika is a Data and Machine Learning Engineer with the AWS Professional Services Shared Delivery Team. She is passionate about helping customers achieve their goals through the use of data and machine learning insights. Outside of work, she loves traveling and spending time with family and friends.

How we designed Cedar to be intuitive to use, fast, and safe

Post Syndicated from Emina Torlak original https://aws.amazon.com/blogs/security/how-we-designed-cedar-to-be-intuitive-to-use-fast-and-safe/

This post is a deep dive into the design of Cedar, an open source language for writing and evaluating authorization policies. Using Cedar, you can control access to your application’s resources in a modular and reusable way. You write Cedar policies that express your application’s permissions, and the application uses Cedar’s authorization engine to decide which access requests to allow. This decouples access control from the application logic, letting you write, update, audit, and reuse authorization policies independently of application code.

Cedar’s authorization engine is built to a high standard of performance and correctness. Application developers report typical authorization latencies of less than 1 ms, even with hundreds of policies. The resulting authorization decision — Allow or Deny — is provably correct, thanks to the use of verification-guided development. This high standard means your application can use Cedar with confidence, just like Amazon Web Services (AWS) does as part of the Amazon Verified Permissions and AWS Verified Access services.

Cedar’s design is based on three core tenets: usability, speed, and safety. Cedar policies are intuitive to read because they’re defined using your application’s vocabulary—for example, photos organized into albums for a photo-sharing application. Cedar’s policy structure reflects common authorization use cases and enables fast evaluation. Cedar’s semantics are intuitive and safer by default: policies combine to allow or deny access according to rules you already know from AWS Identity and Access Management (IAM).

This post shows how Cedar’s authorization semantics, data model, and policy syntax work together to make the Cedar language intuitive to use, fast, and safe. We cover each of these in turn and highlight how their design reflects our tenets.

The Cedar authorization semantics: Default deny, forbid wins, no ordering

We show how Cedar works on an example application for sharing photos, called PhotoFlash, illustrated in Figure 1.

Figure 1: An example PhotoFlash account. User Jane has two photos, four albums, and three user groups

Figure 1: An example PhotoFlash account. User Jane has two photos, four albums, and three user groups

PhotoFlash lets users like Jane upload photos to the cloud, tag them, and organize them into albums. Jane can also share photos with others, for example, letting her friends view photos in her trips album. PhotoFlash provides a point-and-click interface for users to share access, and then stores the resulting permissions as Cedar policies.

When a user attempts to perform an action on a resource (for example, view a photo), PhotoFlash calls the Cedar authorization engine to determine whether access is allowed. The authorizer evaluates the stored policies against the request and application-specific data (such as a photo’s tags) and returns Allow or Deny. If it returns Allow, PhotoFlash proceeds with the action. If it returns Deny, PhotoFlash reports that the action is not permitted.

Let’s look at some policies and see how Cedar evaluates them to authorize requests safely and simply.

Default deny

To let Jane’s friends view photos in her trips album, PhotoFlash generates and stores the following Cedar permit policy:

// Policy A: Jane's friends can view photos in Jane's trips album.
permit(
  principal in Group::"jane/friends", 
  action == Action::"viewPhoto",
  resource in Album::"jane/trips");

Cedar policies define who (the principal) can do what (the action) on what asset (the resource). This policy allows the principal (a PhotoFlash User) in Jane’s friends group to view the resources (a Photo) in Jane’s trips album.

Cedar’s authorizer grants access only if a request satisfies a specific permit policy. This semantics is default deny: Requests that don’t satisfy any permit policy are denied.

Given only our example Policy A, the authorizer will allow Alice to view Jane’s flower.jpg photo. Alice’s request satisfies Policy A because Alice is one of Jane’s friends (see Figure 1). But the authorizer will deny John’s request to view this photo. That’s because John isn’t one of Jane’s friends, and there is no other permit that grants John access to Jane’s photos.

Forbid wins

While PhotoFlash allows individual users to choose their own permissions, it also enforces system-wide security rules.

For example, PhotoFlash wants to prevent users from performing actions on resources that are owned by someone else and tagged as private. If a user (Jane) accidentally permits someone else (Alice) to view a private photo (receipt.jpg), PhotoFlash wants to override the user-defined permission and deny the request.

In Cedar, such guardrails are expressed as forbid policies:

// Policy B: Users can't perform any actions on private resources they don't own.
forbid(principal, action, resource)
when {
  resource.tags.contains("private") &&
  !(resource in principal.account)
};

This PhotoFlash policy says that a principal is forbidden from taking an action on a resource when the resource is tagged as private and isn’t contained in the principal’s account.

Cedar’s authorizer makes sure that forbids override permits. If a request satisfies a forbid policy, it’s denied regardless of what permissions are satisfied.

For example, the authorizer will deny Alice’s request to view Jane’s receipt.jpg photo. This request satisfies Policy A because Alice is one of Jane’s friends. But it also satisfies the guardrail in Policy B because the photo is tagged as private. The guardrail wins, and the request is denied.

No ordering

Cedar’s authorization decisions are independent of the order the policies are evaluated in. Whether the authorizer evaluates Policy A first and then Policy B, or the other way around, doesn’t matter. As you’ll see later, the Cedar language design ensures that policies can be evaluated in any order to reach the same authorization decision. To understand the combined meaning of multiple Cedar policies, you need only remember that access is allowed if the request satisfies a permit policy and there are no applicable forbid policies.

Safe by default and intuitive

We’ve proved (using automated reasoning) that Cedar’s authorizer satisfies the default denyforbids override permits, and order independence properties. These properties help make Cedar’s behavior safe by default and intuitive. Amazon IAM has the same properties. Cedar builds on more than a decade of IAM experience by formalizing and enforcing these properties as parts of its design.

Now that we’ve seen how Cedar authorizes requests, let’s look at how its data model and syntax support writing policies that are quick to read and evaluate.

The Cedar data model: entities with attributes, arranged in a hierarchy

Cedar policies are defined in terms of a vocabulary specific to your application. For example, PhotoFlash organizes photos into albums and users into groups while a task management application organizes tasks into lists. You reflect this vocabulary into Cedar’s data model, which organizes entities into a hierarchy. Entities correspond to objects within your application, such as photos and users. The hierarchy reflects grouping of entities, such as nesting of photos into albums. Think of it as a directed-acyclic graph. Figure 2 shows the entity hierarchy for PhotoFlash that matches Figure 1.

Figure 2: An example hierarchy for PhotoFlash, matching the illustration in Figure 1

Figure 2: An example hierarchy for PhotoFlash, matching the illustration in Figure 1

Entities are stored objects that serve as principals, resources, and actions in Cedar policies. Policies refer to these objects using entity references, such as Album::”jane/art”.

Policies use the in operator to check if the hierarchy relates two entities. For example, Photo::”flower.jpg” in Account::”jane” is true for the hierarchy in Figure 2, but Photo::”flower.jpg” in Album::”jane/conference” is not. PhotoFlash can persist the entity hierarchy in a dedicated entity store, or compute the relevant parts as needed for an authorization request.

Each entity also has a record that maps named attributes to values. An attribute stores a Cedar value: an entity reference, record, string, 64-bit integer, boolean, or a set of values. For example, Photo::”flower.jpg” has attributes describing the photo’s metadata, such as tags, which is a set of strings, and raw, which is an entity reference to another Photo. Cedar supports a small collection of operators that can be applied to values; these operators are carefully chosen to enable efficient evaluation.

Built-in support for role and attribute-based access control

If the concepts you’ve seen so far seem familiar, that’s not surprising. Cedar’s data model is designed to allow you to implement time-tested access control models, including role-based and attribute-based access control (RBAC and ABAC). The entity hierarchy and the in operator support RBAC-style roles as groups, while entity records and the . operator let you express ABAC-style permissions using per-object attributes.

The Cedar syntax: Structured, loop-free, and stateless

Cedar uses a simple, structured syntax for writing policies. This structure makes Cedar policies simple to understand and fast to authorize at scale. Let’s see how by taking a closer look at Cedar’s syntax.

Structure for readability and scalable authorization

Figure 3 illustrates the structure of Cedar policies: an effect and scope, optionally followed by one or more conditions.

The effect of a policy is to either permit or forbid access. The scope can use equality (==) or membership (in) constraints to restrict the principals, actions, and resources to which the policy applies. Policy conditions are expressions that further restrict when the policy applies.

This structure makes policies straightforward to read and understand: The scope expresses an RBAC rule, and the conditions express ABAC rules. For example, PhotoFlash Policy A has no conditions and expresses a single RBAC rule. Policy B has an open (unconstrained) scope and expresses a single ABAC rule. A quick glance is enough to see if a policy is just an RBAC rule, just an ABAC rule, or a mix of both.

Figure 3: Cedar policy structure, illustrated on PhotoFlash Policy A and B

Figure 3: Cedar policy structure, illustrated on PhotoFlash Policy A and B

Scopes also enable scalable authorization for large policy stores through policy slicing. This is a property of Cedar that lets applications authorize a request against a subset of stored policies, supporting real-time decisions even for stores with thousands of policies. With slicing, an application needs to pass a policy to the authorizer only when the request’s principal and resource are descendants of the principal and resource entities specified in the policy’s scope. For example, PhotoFlash needs to include Policy A only for requests that involve the descendants of Group::”jane/friends” and Album::”jane/trips”. But Policy B must be included for all requests because of its open scope.

No loops or state for fast evaluation and intuitive decisions

Policy conditions are Boolean-valued expressions. The Cedar expression language has a familiar syntax that includes if-then-else expressions, short-circuiting Boolean operators (!, &&, ||), and basic operations on Cedar values. Notably, there is no way to express looping or to change the application state (for example, mutate an attribute).

Cedar excludes loops to bound authorization latency. With no loops or costly built-in operators, Cedar policies terminate in O(n2) steps in the worst case (when conditions contain certain set operations), or O(n) in the common case.

Cedar also excludes stateful operations for performance and understandability. Since policies can’t change the application state, their evaluation can be parallelized for better performance, and you can reason about them in any order to see what accesses are allowed.

Learn more

In this post, we explored how Cedar’s design supports intuitive, fast, and safe authorization. With Cedar, your application’s access control rules become standalone policies that are clear, auditable, and reusable. You enforce these policies by calling Cedar’s authorizer to decide quickly and safely which requests are allowed. To learn more, see how to use Cedar to secure your app, and how we built Cedar to a high standard of assurance. You can also visit the Cedar website and blog, try it out in the Cedar playground, and join us on Cedar’s Slack channel.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Emina Torlak

Emina Torlak

Emina is a Senior Principal Applied Scientist at Amazon Web Services and an Associate Professor at the University of Washington. Her research aims to help developers build better software more easily. She develops languages and tools for program verification and synthesis. Emina co-leads the development of Cedar.

Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew health

Post Syndicated from Nikhil Agarwal original https://aws.amazon.com/blogs/big-data/amazon-cloudwatch-metrics-for-amazon-opensearch-service-storage-and-shard-skew-health/

Amazon OpenSearch Service is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in AWS to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite.

When working with OpenSearch Service, shard strategy is key. Shards distribute your workload across the data nodes of your cluster. When creating an index, you tell OpenSearch Service how many primary shards to create and how many replicas to create of each shard. The primary shards are independent partitions of the full dataset. OpenSearch Service automatically distributes your data across the primary shards in an index. Our recommendation is to use two replicas for your index. For example, if you set your index’s shard count to three primary shards and two replicas, you will have a total of nine shards. Properly configured indexes can help boost overall domain performance, whereas a misconfigured index will lead to storage and performance skew.

OpenSearch Service distributes the shards in your indexes to the data nodes in your domain, ensuring that no primary shard and its replicas are placed on the same node. The data for the shards are stored in the node’s storage. If your indexes (and therefore their shards) are very different sizes, the storage used on the data nodes in the domain will be unequal, or skewed. Storage skew leads to uneven memory and CPU utilization, intermittent and uneven latency, and uneven queueing and rejecting of requests. Therefore, it’s important to configure and maintain indexes such that shards can be distributed evenly across the data nodes of your cluster.

In this post, we explore how to deploy Amazon CloudWatch metrics using an AWS CloudFormation template to monitor an OpenSearch Service domain’s storage and shard skew. This solution uses an AWS Lambda function to extract storage and shard distribution metadata from your OpenSearch Service domain, calculates the level of skew, and then pushes this information to CloudWatch metrics so that you can easily monitor, alert, and respond.

Solution overview

The solution and associated resources are available for you to deploy into your own AWS account as a CloudFormation template. The template deploys the following resources:

  • An AWS Identity and Access Management (IAM) role for the Lambda function called OpensearchSkewMetricsLambdaRole. This allows write access to CloudWatch metrics and access to the CloudWatch log group and OpenSearch APIs.
  • An AWS Lambda function called Opensearch-SkewMetricsPublisher-py.
  • An Amazon CloudWatch log group for the Lambda function called /aws/lambda/Opensearch-skewmetrics-publisher-py.
  • An Amazon EventBridge rule for the Lambda function called EventRuleForOSSkew.
  • The following CloudWatch metrics for the Lambda function:
    • aws_/<region-name>/<MetricIdentifier>/_storagemetric
    • aws_/<region-name>/<MetricIdentifier>/_shardmetric

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • An OpenSearch Service domain.
  • This post requires you to add a Lambda role to the OpenSearch Service domain’s security configuration access policy. If your domain is using fine-grained access control, then you need to follow the steps as described in the section Mapping roles to users to enable access for the newly deployed Lambda execution role to the domain after deploying the CloudFormation template.

Deploy the CloudFormation template

To deploy the CloudFormation template, complete the following steps:

  1. Log in to your AWS account.
  2. Select the Region where you’re running your OpenSearch Service domain.
  3. To launch your CloudFormation stack, choose Launch Stack
  4. For Stack name, enter a name for the stack (maximum length 30 characters).
  5. For MetricIdentifier, enter a unique identifier that will help you identify the custom CloudWatch metrics for your domain.
  6. For OpensearchDomainURL, enter the domain endpoint that you are monitoring.
  7. Choose Next.
  8. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Create stack.
  9. Wait for the stack creation to complete.
  10. On the Lambda console, choose Functions in the navigation pane.
  11. Choose the Lambda function called Opensearch-SkewMetricsPublisher-py-<stackname>.
  12. In the Code section, choose Test.
  13. Keep the default values for the test event and run a quick test.

Make sure to grant the Lambda execution role permission to the OpenSearch Service domain’s resource-based policy, if you are using one. If fine-grained access control is enabled on the domain, then follow the steps in Mapping roles to users (as mentioned in the prerequisites) to allow the Lambda function to read from the domain in read-only access.

The Lambda function that sends OpenSearch domain metrics to CloudWatch is set to a default frequency of 1 day. You can change this configuration to monitor the domain at the required granularity by updating the event schedule for the rule deployed by the CloudFormation stack on the EventBridge console. Note that if the frequency is set to 1 minute, this will trigger the Lambda function every minute and will increase the Lambda cost.

This solution uses the cat/allocation API, which provides the number of data nodes in the domain along with each data node’s number of shards and storage usage attributes. For further details on domain storage and shard skew, refer to Node shard and storage skew. The Lambda function processes and sorts each data node’s storage and shard skew from the average value. Any data node’s skew above 10% from the average is generally considered to be significantly skewed. This will start to impact CPU, network, and disk bandwidth usage because the nodes with the highest storage utilization tend to be the resource-strained nodes, whereas nodes with less than 10% usage represent underutilized capacity.

Refer to Demystifying Elasticsearch shard allocation for details related to shard size and shard count strategy. In general, we recommend keeping shard sizes between 10–30 GB for workloads where search latency is a key performance objective and 30–50 GB for write-heavy workloads. For shard count, we recommend maintaining index shard counts that are divisible by the data node count. For additional details, refer to Sizing Amazon OpenSearch Service domains and Shard strategy.

View skew metrics in CloudWatch

After you run this solution in your account, it will create two CloudWatch metrics for monitoring. To access these CloudWatch metrics, use the following steps:

  1. On the CloudWatch console, under Metrics in the navigation pane, choose All metrics.
  2. Choose Browse and select Custom namespaces. You should see two custom metrics ending with _storageworkspace and _shardworkspace, respectively.
  3. Choose either of the custom metrics and then select NodeID.
  4. On the list of node IDs, select all the nodes displayed in the list, and the graph will be plotted automatically.

You can hover the mouse over the plotted lines to see the node skew information.

The following screenshots show examples of how the CloudWatch metrics will appear on the console.

The storage skew metrics will be similar to the following screenshot. Storage skew metrics shows the domain storage skew. If you hover over the graph, it shows the node list with available nodes in the domain. This list is sorted by the storage size (largest to smallest). The Lambda function will periodically post the latest storage skew results.

The shard skew metrics will be similar to the following screenshot. Shard skew metrics show the domain shard skew. If you hover over the graph, it shows the node list with available nodes in the domain. This list is sorted by the shard size (largest to smallest). The Lambda function will periodically post the latest storage skew results.

Storage skew occurs when one or more nodes within the domain has significantly more storage than other nodes. The CloudWatch metric will show higher deviation of storage usage for these nodes vs. other nodes. Similarly, shard skew occurs when one or more nodes has significantly more shards than others nodes. The CloudWatch metric will show higher deviation for these nodes vs. other nodes in the domain. When the domain storage or shard skew is detected, you can raise a support case to work with the AWS team for remediation actions. See How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster for information on how to take remediation actions to configure your domain shard strategy for optimal performance.

Costs

The cost associated with using this solution would be minimal, around few cents per month since it generates CloudWatch metrics. The solution also runs Lambda code, and in this case the Lambda functions make API calls. For pricing details, refer to Amazon CloudWatch Pricing and AWS Lambda Pricing.

Clean up

If you decide that you no longer want to keep the Lambda function and associated resources, you can navigate to the AWS CloudFormation console, choose the stack, and choose Delete.

If you want to add the CloudWatch skew monitor metrics mechanism back in at any point, you can create the stack again from the CloudFormation template.

Conclusion

You can use this solution to get a better understanding of your OpenSearch Service domain’s storage and shard skew to improve its performance and possibly lower the cost of operating your domain. See Use Elasticsearch’s _rollover API For efficient storage distribution for more details related to shard allocation and efficient storage distribution strategy.


About the authors

Nikhil Agarwal is Sr. Technical Manager with Amazon Web Services. He is passionate about helping customers achieve operational excellence in their cloud journey and working activity on technical solutions. He is also AI/ML enthusiastic and deep dives into customer’s ML-specific use cases. Outside of work, he enjoys traveling with family and exploring different gadgets.

Karthik Chemudupati is a Principal Technical Account Manager (TAM) with AWS, focused on helping customers achieve cost optimization and operational excellence. He has more than 19 years of IT experience in software engineering, cloud operations and automations. Karthik joined AWS in 2016 as a TAM and worked with more than dozen Enterprise Customers across US-West. Outside of work, he enjoys spending time with his family.

Gene Alpert is a Senior Analytics Specialist with AWS Enterprise Support. He has been focused on our Amazon OpenSearch Service customers and ecosystem for the past three years. Gene joined AWS in 2017. Outside of work he enjoys mountain biking, traveling, and playing Population:One in VR.

How to Connect Your On-Premises Active Directory to AWS Using AD Connector

Post Syndicated from Jeremy Cowan original https://aws.amazon.com/blogs/security/how-to-connect-your-on-premises-active-directory-to-aws-using-ad-connector/

August 17, 2023: We updated the instructions and screenshots in this post to align with changes to the AWS Management Console.

April 25, 2023: We’ve updated this blog post to include more security learning resources.


AD Connector is designed to give you an easy way to establish a trusted relationship between your Active Directory and AWS. When AD Connector is configured, the trust allows you to:

  • Sign in to AWS applications such as Amazon WorkSpaces, Amazon WorkDocs, and Amazon WorkMail by using your Active Directory credentials.
  • Seamlessly join Windows instances to your Active Directory domain either through the Amazon EC2 launch wizard or programmatically through the EC2 Simple System Manager (SSM) API.
  • Provide federated sign-in to the AWS Management Console by mapping Active Directory identities to AWS Identity and Access Management (IAM) roles.

AD Connector cannot be used with your custom applications, as it is only used for secure AWS integration for the three use-cases mentioned above. Custom applications relying on your on-premises Active Directory should communicate with your domain controllers directly or utilize AWS Managed Microsoft AD rather than integrating with AD Connector. To learn more about which AWS Directory Service solution works best for your organization, see the service documentation.

With AD Connector, you can streamline identity management by extending your user identities from Active Directory. It also enables you to reuse your existing Active Directory security policies such as password expiration, password history, and account lockout policies. Also, your users will no longer need to remember yet another user name and password combination. Since AD Connector doesn’t rely on complex directory synchronization technologies or Active Directory Federation Services (AD FS), you can forego the added cost and complexity of hosting a SAML-based federation infrastructure. In sum, AD Connector helps foster a hybrid environment by allowing you to leverage your existing on-premises investments to control different facets of AWS.

This blog post will show you how AD Connector works as well as walk through how to enable federated console access, assign users to roles, and seamlessly join an EC2 instance to an Active Directory domain.

AD Connector – Under the Hood

AD Connector is a dual Availability Zone proxy service that connects AWS apps to your on-premises directory. AD Connector forwards sign-in requests to your Active Directory domain controllers for authentication and provides the ability for applications to query the directory for data. When you configure AD Connector, you provide it with service account credentials that are securely stored by AWS. This account is used by AWS to enable seamless domain join, single sign-on (SSO), and AWS Applications (WorkSpaces, WorkDocs, and WorkMail) functionality. Given AD Connector’s role as a proxy, it does not store or cache user credentials. Rather, authentication, lookup, and management requests are handled by your Active Directory.

In order to create an AD Connector, you must also provide a pair of DNS IP addresses during setup. These are used by AD Connector to retrieve Service (SRV) DNS records to locate the nearest domain controllers to route requests to. The AD connector proxy instances use an algorithm similar to the Active Directory domain controller locator process to decide which domain controllers to connect to for LDAP and Kerberos requests.

For authentication to AWS applications and the AWS Management Console, you can configure an access URL from the AWS Directory Service console. This access URL is in the format of https://<alias>.awsapps.com and provides a publicly accessible sign-in page. You can visit https://<alias>.awsapps.com/workdocs to sign in to WorkDocs, and https://<alias>.awsapps.com/console to sign in to the AWS Management Console. The following image shows the sign-in page for the AWS Management Console.

Figure 1: Login

Figure 1: Login

For added security you can enable multi-factor authentication (MFA) for AD Connector, but you’ll need to have an existing RADIUS infrastructure in your on-premises network set up to leverage this feature. See AD Connector – Multi-factor Authentication Prerequisites for more information about requirements and configuration. With MFA enabled with AD Connector, the sign-in page hosted at your access URL will prompt users for an MFA code in addition to their standard sign-in credentials.

AD Connector comes in two sizes: small and large. A large AD Connector runs on more powerful compute resources and is more expensive than a small AD Connector. Depending on the volume of traffic to be proxied by AD Connector, you’ll want to select the appropriate size for your needs.

Figure 2: Directory size

Figure 2: Directory size

AD Connector is highly available, meaning underlying hosts are deployed across multiple Availability Zones in the region you deploy. In the event of host-level failure, Directory Service will promptly replace failed hosts. Directory Service also applies performance and security updates automatically to AD Connector.

The following diagram illustrates the authentication flow and network path when you enable AWS Management Console access:

  1. A user opens the secure custom sign-in page and supplies their Active Directory user name and password.
  2. The authentication request is sent over SSL to AD Connector.
  3. AD Connector performs LDAP authentication to Active Directory.

    Note: AD Connector locates the nearest domain controllers by querying the SRV DNS records for the domain.

  4. After the user has been authenticated, AD Connector calls the STS AssumeRole method to get temporary security credentials for that user. Using those temporary security credentials, AD Connector constructs a sign-in URL that users use to access the console.

    Note: If a user is mapped to multiple roles, the user will be presented with a choice at sign-in as to which role they want to assume. The user session is valid for 1 hour.

    Figure 3: Authentication flow and network path

    Figure 3: Authentication flow and network path

Before getting started with configuring AD Connector for federated AWS Management Console access, be sure you’ve read and understand the prerequisites for AD Connector. For example, as shown in Figure 3 there must be a VPN or Direct Connect circuit in place between your VPC and your on-premises environment. Your domain also has to be running at Windows 2003 functional level or later. Also, various ports have to be opened between your VPC and your on-premises environment to allow AD Connector to communicate with your on-premises directory.

Configuring AD Connector for federated AWS Management Console access

Enable console access

To allow users to sign in with their Active Directory credentials, you need to explicitly enable console access. You can do this by opening the Directory Service console and clicking the Directory ID name (Figure 4).

This opens the Directory Details page, where you’ll find a dropdown menu on the Apps & Services tab to enable the directory for AWS Management Console access.

Figure 4: Directories

Figure 4: Directories

Choose the Application management tab as seen in Figure 5.

Figure 5: Application Management

Figure 5: Application Management

Scroll down to AWS Management Console as shown in Figure 6, and choose Enable from the Actions dropdown list.

Figure 6: Enable console access

Figure 6: Enable console access

After enabling console access, you’re ready to start configuring roles and associating Active Directory users and groups with those roles.

Follow these steps to create a new role. When you create a new role through the Directory Service console, AD Connector automatically adds a trust relationship to Directory Service. The following code example shows the IAM trust policy for the role, after a role is created.

{
   "Version": "2012-10-17",
   "Statement": [
     {
       "Sid": "",
       "Effect": "Allow",
       "Principal": {
         "Service": "ds.amazonaws.com"
       },
       "Action": "sts:AssumeRole",
       "Condition": {
         "StringEquals": {
           "sts:externalid": "482242153642"
	  }
	}
     }
   ]
}

Assign users to roles

Now that AD Connector is configured and you’ve created a role, your next job is to assign users or groups to those IAM roles. Role mapping is what governs what resources a user has access to within AWS. To do this you’ll need to do the following steps:

  1. Open the Directory Service console and navigate to the AWS Management Console section.
  2. In the search bar, type the name of the role you just created.
  3. Select the role that you just created by choosing the name under the IAM role field.
  4. Choose Add, and enter the name to be added to find users or groups for this role.
  5. Choose Add, and the user or group is now assigned to the role.

When you’re finished, you should see the name of the user or group along with the corresponding ID for that object. It is also important to note that this list can be used to remove users or groups from the role. The next time the user signs in to the AWS Management Console from the custom sign-in page, they will be signed in under the EC2ReadOnly security role.

Seamlessly join an instance to an Active Directory domain

Another advantage to using AD Connector is the ability to seamlessly join Windows (EC2) instances to your Active Directory domain. This allows you to join a Windows Server to the domain while the instance is being provisioned instead of using a script or doing it manually. This section of this blog post will explain the steps necessary to enable this feature in your environment and how the service works.

Step 1: Create a role

Until recently you had to manually create an IAM policy to allow an EC2 instance to access the SSM, an AWS service that allows you to configure Windows instances while they’re running and on first launch. Now, there’s a managed policy called AmazonEC2RoleforSSM that you can use instead. The role you are about to create will be assigned to an EC2 instance when it’s provisioned, which will grant it permission to access the SSM service.

To create the role:

  1. Open the IAM console.
  2. Click Roles in the navigation pane.
  3. Click Create Role.
  4. Type a name for your role in the Role Name field.
  5. Under AWS Service Roles, select Amazon EC2 and then click Select.
  6. On the Attach Policy page, select AmazonEC2RoleforSSM and then click Next Step.
  7. On the Review page, click Create Role.

If you click the role you created, you’ll see a trust policy for EC2, which looks like the following code example.

{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "",
         "Effect": "Allow",
         "Principal": {
           "Service": "ec2.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
       }
     ]
}

Step 2: Create a new Windows instance from the EC2 console

With this role in place, you can now join a Windows instance to your domain via the EC2 launch wizard. For a detailed explanation about how to do this, see Joining a Domain Using the Amazon EC2 Launch Wizard.

If you’re instantiating a new instance from the API, however, you will need to create an SSM configuration document and upload it to the SSM service beforehand. We’ll step through that process next.

Note: The instance will require internet access to communicate with the SSM service.

Figure 7: Configure instance details

Figure 7: Configure instance details

When you create a new Windows instance from the EC2 launch wizard as shown in Figure 7, the wizard automatically creates the SSM configuration document from the information stored in AD Connector. Presently, the EC2 launch wizard doesn’t allow you to specify which organizational unit (OU) you want to deploy the member server into.

Step 3: Create an SSM document (for seamlessly joining a server to the domain through the AWS API)

If you want to provision new Windows instances from the AWS CLI or API or you want to specify the target OU for your instances, you will need to create an SSM configuration document. The configuration document is a JSON file that contains various parameters used to configure your instances. The following code example is a configuration document for joining a domain.

{
	"schemaVersion": "1.0",
	"description": "Sample configuration to join an instance to a domain",
	"runtimeConfig": {
	   "aws:domainJoin": {
	       "properties": {
	          "directoryId": "d-1234567890",
	          "directoryName": "test.example.com",
	          "directoryOU": "OU=test,DC=example,DC=com",
	          "dnsIpAddresses": [
	             "198.51.100.1",
	             "198.51.100.2"
	          ]
	       }
	   }
	}
}

In this configuration document:

  • directoryId is the ID for the AD Connector you created earlier.
  • directoryName is the name of the domain (for example, examplecompany.com).
  • directoryOU is the OU for the domain.
  • dnsIpAddresses are the IP addresses for the DNS servers you specified when you created the AD Connector.

For additional information, see aws:domainJoin. When you’re finished creating the file, save it as a JSON file.

Note: The name of the file has to be at least 1 character and at most 64 characters in length.

Step 4: Upload the configuration document to SSM

This step requires that the user have permission to use SSM to configure an instance. If you don’t have a policy that includes these rights, create a new policy by using the following JSON, and assign it to an IAM user or group.

{
   "Version": "2012-10-17",
   "Statement": [
     {
       "Effect": "Allow",
       "Action": "ssm:*",
       "Resource": "*"
     }
   ]
}

After you’ve signed in with a user that associates with the SSM IAM policy you created, run the following command from the AWS CLI.

aws ssm create-document ‐‐content file://path/to/myconfigfile.json ‐‐name "My_Custom_Config_File"

Note: On Linux/Mac systems, you need to add a “/” at the beginning of the path (for example, file:///Users/username/temp).

This command uploads the configuration document you created to the SSM service, allowing you to reference it when creating a new Windows instance from either the AWS CLI or the EC2 launch wizard.

Conclusion

This blog post has shown you how you can simplify account management by federating with your Active Directory for AWS Management Console access. The post also explored how you can enable hybrid IT by using AD Connector to seamlessly join Windows instances to your Active Directory domain. Armed with this information you can create a trust between your Active Directory and AWS. In addition, you now have a quick and simple way to enable single sign-on without needing to replicate identities or deploy additional infrastructure on premises.

We’d love to hear more about how you are using Directory Service, and welcome any feedback about how we can improve the experience. You can post comments below, or visit the Directory Service forum to post comments and questions.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Directory Service knowledge Center re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Jeremy Cowan

Jeremy Cowan

Jeremy is a Specialist Solutions Architect for containers at AWS, although his family thinks he sells “cloud space”. Prior to joining AWS, Jeremy worked for several large software vendors, including VMware, Microsoft, and IBM. When he’s not working, you can usually find on a trail in the wilderness, far away from technology.

Bright Dike

Bright Dike

Bright is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance assessing and improving their security posture, as well as executing automated remediation techniques. His domains are threat detection, incident response, and security hub.

David Selberg

David Selberg

David is an Enterprise Solutions Architect at AWS who is passionate about helping customers build Well-Architected solutions on the AWS cloud. With a background in cybersecurity, David loves to dive deep on security topics when he’s not creating technical content like the “All Things AWS” Twitch series.

Abhra Sinha

Abhra Sinha

Abhra is a Toronto-based Enterprise Solutions Architect at AWS. Abhra enjoys being a trusted advisor to customers, working closely with them to solve their technical challenges and help build a secure scalable architecture on AWS. In his spare time, he enjoys photography and exploring new restaurants.

How to automate the review and validation of permissions for users and groups in AWS IAM Identity Center

Post Syndicated from Yee Fei Ooi original https://aws.amazon.com/blogs/security/how-to-automate-the-review-and-validation-of-permissions-for-users-and-groups-in-aws-iam-identity-center/

AWS IAM Identity Center (successor to AWS Single Sign-On) is widely used by organizations to centrally manage federated access to their Amazon Web Services (AWS) environment. As organizations grow, it’s crucial that they maintain control of access to their environment and conduct regular reviews of existing granted permissions to maintain a good security posture. With continuous movement of users among projects and teams within an organization, there are constant updates in groups and permission sets. Given the frequency of updates, it’s important for organizations to maintain the integrity of the identity entities and promote visibility into their associated permissions within IAM Identity Center.

Performing an audit of permissions assignment through the IAM Identity Center Management Console can be an arduous and time-consuming task, especially for customers managing a significant number of AWS accounts. This blog post addresses the following concerns faced by security administrators:

  • How to maintain control over permissions and efficiently conduct thorough audits.
  • How to regularly review granted permissions to uphold the principle of least privilege.

In this blog post, we show you how to automate your IAM Identity Center users and groups permission review process with AWS SDK and AWS serverless services. The solution also includes how to schedule the review process based on preferred frequency and generating a business-specific access and permission review report.

By using AWS serverless services and AWS SDK, you can create an automated workflow to retrieve the latest permission sets of your identities in IAM Identity Center and extract them as a report. Amazon EventBridge scheduling allows you to set customized schedules to launch the automation process. AWS Lambda functions are used in data retrieval, data transformation, and report generation, and Amazon DynamoDB tables are used for storing raw unstructured data.

We show you how to build an automated solution using AWS SDK, AWS Step Functions, Lambda, DynamoDB, EventBridge, Amazon Simple Storage Service (Amazon S3), and Amazon Simple Notification Service (Amazon SNS) to review the IAM Identity Center instance that you specify. The review includes retrieving attached permission policies (inline, AWS managed, and customer managed) based on the assigned identity.

Note: This solution will incur costs based on the AWS services used.

Prerequisites

In your own AWS environment, make sure that you have the following:

  • An IAM Identity Center instance set up in the account
  • IAM Identity Center instance metadata that you want to perform the analysis on:
    • The IAM Identity Center instance identityStoreId – example: d-xxxxxxxxxx
    • The IAM Identity Center instance instanceArn – example: arn:aws:sso:::instance/ssoins-xxxxxxxxxx
  • Access and permission to deploy the related AWS services mentioned previously in AWS CloudFormation.

    Note: This solution is expected to deploy in the account where your IAM Identity Center instance is being set up. If you want to deploy in other accounts, you need to establish cross-account access for the IAM roles of the relevant services mentioned previously.

  • AWS SAM CLI installed. You will deploy the solution using AWS Serverless Application Model (AWS SAM). To learn more about how AWS SAM works, see the AWS Serverless Application Model (AWS SAM) specification.

Solution overview

In this section, we discuss the steps to set up solution. We provide a CloudFormation template that you can use to set up the required services and Lambda functions. Figure 1 illustrates the architecture of the solution.

Figure 1: Architecture of the solution

Figure 1: Architecture of the solution

The solution is deployed using AWS SAM, which is an open-sourced framework for building serverless applications. AWS SAM helps to organize related components and operate on a single stack. When used together with the SAM CLI, it’s a useful tool for developing, testing, and building serverless applications on AWS.

To generate the report, the solution uses the following steps:

  • The EventBridge Scheduler is configured to launch the Step Functions based on the frequency of the cron job stated. The user can also manually launch the review as needed.
  • After the Step Functions are launched, the dataExtractionFunction Lambda function retrieves data from IAM Identity Center and stores it in two separate DynamoDB tables, fullPermissionSetsWithGroupTable and userWithGroupTable.
  • Step Functions will then launch the dataTransformLoadFunction Lambda function, which retrieves the data from both DynamoDB tables to perform data transformation for report generation.
  • The permission review report is stored in an S3 bucket and notification of completion is sent to the stakeholders.

Deploy the solution

  1. Make sure that you have AWS SAM CLI installed.
  2. Clone the GitHub repository. Open a CLI window and run
    git clone https://github.com/aws-samples/aws-iam-identity-center-permission-policies-analyzer.git
  3. Navigate to root directory of the GitHub repository by running cd aws-iam-identity-center-permission-policies-analyzer
  4. Run sam deploy ‐‐guided and follow the step-by-step instructions to indicate the deployment details such as the desired CloudFormation stack name, AWS Region and other details as shown in Figure 2.
     
    Figure 2: Configure SAM deploy

    Figure 2: Configure SAM deploy

  5. As shown in Figure 2, you receive confirmation that the required resources have been created. AWS SAM creates a default S3 bucket to store the necessary resources and then proceeds to the deployment prompt. Enter y to deploy and wait for deployment to complete.
  6. After deployment is complete, you should see the following output: Successfully created/updated stack – {StackName} in {AWSRegion}. You can review the resources and stack in your CloudFormation console as shown in Figure 3.
     
    Figure 3: CloudFormation console view of deployed stack

    Figure 3: CloudFormation console view of deployed stack

    The CloudFormation template specifies the cron schedule on the first day of each month at 0800 UTC +8 by default. You can update the schedule based on your preference by following steps 7 and 8.

  7. Open the EventBridge console. In the navigation pane, under Scheduler, choose Schedules. Check the box next to {StackName}-monthlySchedule-{RandomID} and choose Edit.
     
    Figure 4: EventBridge schedule console

    Figure 4: EventBridge schedule console

  8. At Step 1, under the Schedule pattern segment, enter your preferred scheduling. To learn about the different types of EventBridge scheduling, see Schedule types on EventBridge Scheduler. For this example, you use a recurring type of schedule using cron expression. Update to your preferred schedule and time zone and choose Next.
     
    Figure 5: EventBridge Schedule edit console Step 1 – Specify schedule detail

    Figure 5: EventBridge Schedule edit console Step 1 – Specify schedule detail

  9. Check the email address you entered during the deployment stage of this solution for an email sent by [email protected], similar to what you see in Figure 6. Follow the steps in the email to confirm the Amazon SNS topic subscription.
     
    Figure 6: Example email from Amazon SNS for subscription confirmation

    Figure 6: Example email from Amazon SNS for subscription confirmation

Manually launch the review

After you’ve updated the schedule, the review process runs on the specified timing and frequency. You can manually launch the review immediately after you’ve deployed the solution, or at a time outside of the schedule on an as-needed basis.

  1. To manually launch the review, open the Step Functions console,
  2. Select the state machine monthlyUserPermissionAssessment-{randomID} and choose Start execution.
     
    Figure 7: Start execution for monthlyUserPermissionAssessment state machine

    Figure 7: Start execution for monthlyUserPermissionAssessment state machine

  3. Enter the following event pattern and choose Start execution.
    {
      "identityStoreId": "d-xxxxxxxxxx",
      "instanceArn": "arn:aws:sso:::instance/ssoins-xxxxxxxxxx",
      "ssoDeployedRegion": "YOUR_SSO_DEPLOYED_REGION" 
    }

    Note: The format and keyword format are important to run the Step Functions successfully.

Figure 8: Example input to start state machine execution

Figure 8: Example input to start state machine execution

When the process starts, the execution page opens and you can follow the process. The flow turns green when each step has been completed successfully. You can also review Events and check the Lambda functions or logs if you need to troubleshoot or refer to the details.

Figure 9: State machine successful execution example

Figure 9: State machine successful execution example

Notification from each successful review

After each successful execution, you should receive an email notification at the email you specified in the Amazon SNS topic. You can then retrieve the report from the S3 bucket with the bucket name {StackName}-monthlyre-{AccountID}. Your report is stored according to the object key name specified in the email. An example of the email notification is shown in Figure 10.

Figure 10: Example email notification

Figure 10: Example email notification

You can download the report in CSV format from the S3 bucket. The headers of the report are:

User: Username
PrincipalId: An identifier for an object in IAM Identity Center, such as a user or group
PrincipalType: USER or GROUP
GroupName: Group’s display name value (if PrincipalType is GROUP)
AccountIdAssignment: Identifier of the AWS account assigned with the specified permission set
PermissionSetARN: ARN of the permission set
PermissionSetName: Name of the permission set
Inline Policy: Inline policy that is attached to the permission set
Customer Managed Policy: Specifies the names and paths of the customer managed policies that you have attached to your permission set
AWS Managed Policy: Details of the AWS managed policy
Permission Boundary: Permission boundary details (Customer Managed Policy Reference and/or AWS managed policy ARN)

From the report, you can determine whether a user is assigned to an account individually or as part of a group, along with the corresponding permission sets. The report also includes details on inline policy, AWS managed policy, customer managed policy, and the permission boundaries attached to the permission set. Inline policies and AWS managed policies are presented in JSON format. However, for customer managed policies and permission boundaries, to keep the solution simple, the generated report provides only basic information on the policies that you’ve attached to the permission set. You can log in to the respective accounts to view the policies in full JSON format through the AWS IAM console.

[Optional] Customize the user notification email

If you want to customize the email notification subject and content, you can do so by editing the Lambda function {StackName}-dataTransformLoadFunction-{RandomID}. Scroll down to the bottom of the source code and edit the sns_message and Subject accordingly.

Figure 11: Customizing the notification email in dataTransformLoadFunction source code

Figure 11: Customizing the notification email in dataTransformLoadFunction source code

Clean up the resources

To clean up the resources that you created for this example:

  1. Empty your S3 bucket. Open the Amazon S3 console, search for the bucket name and choose Empty. Follow the instructions on screen to empty it.
  2. Delete the CloudFormation stack by either:
    1. Using the CloudFormation console to delete the stack, or
    2. Using the AWS SAM CLI to run sam delete in your terminal. Follow the instructions and enter y when prompted to delete the stack.

Conclusion

In this post, you learned how to deploy a solution that simplifies the review and analysis of IAM permissions granted to IAM Identity Center with an automated flow. You also learned about customization that you can set up to fit your team’s needs and preferences.

If you have feedback about this post, submit comments in the Comments section. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Yee Fei Ooi

Yee Fei Ooi

Yee Fei is a Solutions Architect supporting independent software vendor (ISV) customers in Singapore and is part of the Containers TFC. She enjoys helping her customers to grow their businesses and build solutions that automate, innovate, and improve efficiency.

ZhiWei Huang

Edmund Yeo

Edmund is a Security Solutions Architect who helps customers build confidently and securely on AWS. He shares with customers his expertise in AWS security and advocates to build with security best practices in mind. He brings a combination of perspective and experience to help small and medium businesses in ASEAN to improve their security posture.

ZhiWei Huang

ZhiWei Huang

ZhiWei is a Financial Services Solutions Architect at AWS. He works with FSI customers across the ASEAN region, providing guidance for establishing robust security controls and networking foundations as customers build on and scale with AWS. Outside of work, he finds joy in travelling the world and spending quality time with his family.

Derive operational insights from application logs using Automated Data Analytics on AWS

Post Syndicated from Aparajithan Vaidyanathan original https://aws.amazon.com/blogs/big-data/derive-operational-insights-from-application-logs-using-automated-data-analytics-on-aws/

Automated Data Analytics (ADA) on AWS is an AWS solution that enables you to derive meaningful insights from data in a matter of minutes through a simple and intuitive user interface. ADA offers an AWS-native data analytics platform that is ready to use out of the box by data analysts for a variety of use cases. With ADA, teams can ingest, transform, govern, and query diverse datasets from a range of data sources without requiring specialist technical skills. ADA provides a set of pre-built connectors to ingest data from a wide range of sources including Amazon Simple Storage Service (Amazon S3), Amazon Kinesis Data Streams, Amazon CloudWatch, Amazon CloudTrail, and Amazon DynamoDB as well as many others.

ADA provides a foundational platform that can be used by data analysts in a diverse set of use cases including IT, finance, marketing, sales, and security. ADA’s out-of-the-box CloudWatch data connector allows data ingestion from CloudWatch logs in the same AWS account in which ADA has been deployed, or from a different AWS account.

In this post, we demonstrate how an application developer or application tester is able to use ADA to derive operational insights of applications running in AWS. We also demonstrate how you can use the ADA solution to connect to different data sources in AWS. We first deploy the ADA solution into an AWS account and set up the ADA solution by creating data products using data connectors. We then use the ADA Query Workbench to join the separate datasets and query the correlated data, using familiar Structured Query Language (SQL), to gain insights. We also demonstrate how ADA can be integrated with business intelligence (BI) tools such as Tableau to visualize the data and to build reports.

Solution overview

In this section, we present the solution architecture for the demo and explain the workflow. For the purposes of demonstration, the bespoke application is simulated using an AWS Lambda function that emits logs in Apache Log Format at a preset interval using Amazon EventBridge. This standard format can be produced by many different web servers and be read by many log analysis programs. The application (Lambda function) logs are sent to a CloudWatch log group. The historical application logs are stored in an S3 bucket for reference and for querying purposes. A lookup table with a list of HTTP status codes along with the descriptions is stored in a DynamoDB table. These three serve as sources from which data is ingested into ADA for correlation, query, and analysis. We deploy the ADA solution into an AWS account and set up ADA. We then create the data products within ADA for the CloudWatch log group, S3 bucket, and DynamoDB. As the data products are configured, ADA provisions data pipelines to ingest the data from the sources. With the ADA Query Workbench, you can query the ingested data using plain SQL for application troubleshooting or issue diagnosis.

The following diagram provides an overview of the architecture and workflow of using ADA to gain insights into application logs.

The workflow includes the following steps:

  1. A Lambda function is scheduled to be triggered at 2-minute intervals using EventBridge.
  2. The Lambda function emits logs that are stored at a specified CloudWatch log group under /aws/lambda/CdkStack-AdaLogGenLambdaFunction. The application logs are generated using the Apache Log Format schema but stored in the CloudWatch log group in JSON format.
  3. The data products for CloudWatch, Amazon S3, and DynamoDB are created in ADA. The CloudWatch data product connects to the CloudWatch log group where the application (Lambda function) logs are stored. The Amazon S3 connector connects to an S3 bucket folder where the historical logs are stored. The DynamoDB connector connects to a DynamoDB table where the status codes that are referred by the application and historical logs are stored.
  4. For each of the data products, ADA deploys the data pipeline infrastructure to ingest data from the sources. When the data ingestion is complete, you can write queries using SQL via the ADA Query Workbench.
  5. You can log in to the ADA portal and compose SQL queries from the Query Workbench to gain insights in to the application logs. You can optionally save the query and share the query with other ADA users in the same domain. The ADA query feature is powered by Amazon Athena, which is a serverless, interactive analytics service that provides a simplified, flexible way to analyze petabytes of data.
  6. Tableau is configured to access the ADA data products via ADA egress endpoints. You then create a dashboard with two charts. The first chart is a heat map that shows the prevalence of HTTP error codes correlated with the application API endpoints. The second chart is a bar chart that shows the top 10 application APIs with a total count of HTTP error codes from the historical data.

Prerequisites

For this post, you need to complete the following prerequisites:

  1. Install the AWS Command Line Interface (AWS CLI), AWS Cloud Development Kit (AWS CDK) prerequisites, TypeScript-specific prerequisites, and git.
  2. Deploy the ADA solution in your AWS account in the us-east-1 Region.
    1. Provide an admin email while launching the ADA AWS CloudFormation stack. This is needed for ADA to send the root user password. An admin phone number is required to receive a one-time password message if multi-factor authentication (MFA) is enabled. For this demo, MFA is not enabled.
  3. Build and deploy the sample application (available on the GitHub repo) solution so that the following resources can be provisioned in your account in the us-east-1 Region:
    1. A Lambda function that simulates the logging application and an EventBridge rule that invokes the application function at 2-minute intervals.
    2. An S3 bucket with the relevant bucket policies and a CSV file that contains the historical application logs.
    3. A DynamoDB table with the lookup data.
    4. Relevant AWS Identity and Access Management (IAM) roles and permissions required for the services.
  4. Optionally, install Tableau Desktop, a third-party BI provider. For this post, we use Tableau Desktop version 2021.2. There is a cost involved in using a licensed version of the Tableau Desktop application. For additional details, refer to the Tableau licensing information.

Deploy and set up ADA

After ADA is deployed successfully, you can log in using the admin email provided during the installation. You then create a domain named CW_Domain. A domain is a user-defined collection of data products. For example, a domain might be a team or a project. Domains provide a structured way for users to organize their data products and manage access permissions.

  1. On the ADA console, choose Domains in the navigation pane.
  2. Choose Create domain.
  3. Enter a name (CW_Domain) and description, then choose Submit.

Set up the sample application infrastructure using AWS CDK

The AWS CDK solution that deploys the demo application is hosted on GitHub. The steps to clone the repo and to set up the AWS CDK project are detailed in this section. Before you run these commands, be sure to configure your AWS credentials. Create a folder, open the terminal, and navigate to the folder where the AWS CDK solution needs to be installed. Run the following code:

gh repo clone aws-samples/operational-insights-with-automated-data-analytics-on-aws
cd operational-insights-with-automated-data-analytics-on-aws
npm install
npm run build
cdk synth
cdk deploy

These steps perform the following actions:

  • Install the library dependencies
  • Build the project
  • Generate a valid CloudFormation template
  • Deploy the stack using AWS CloudFormation in your AWS account

The deployment takes about 1–2 minutes and creates the DynamoDB lookup table, Lambda function, and S3 bucket containing the historical log files as outputs. Copy these values to a text editing application, such as Notepad.

Create ADA data products

We create three different data products for this demo, one for each data source that you’ll be querying to gain operational insights. A data product is a dataset (a collection of data such as a table or a CSV file) that has been successfully imported into ADA and that can be queried.

Create a CloudWatch data product

First, we create a data product for the application logs by setting up ADA to ingest the CloudWatch log group for the sample application (Lambda function). Use the CdkStack.LambdaFunction output to get the Lambda function ARN and locate the corresponding CloudWatch log group ARN on the CloudWatch console.

Then complete the following steps:

  1. On the ADA console, navigate to the ADA domain and create a CloudWatch data product.
  2. For Name¸ enter a name.
  3. For Source type, choose Amazon CloudWatch.
  4. Disable Automatic PII.

ADA has a feature that automatically detects personally identifiable information (PII) data during import that is enabled by default. For this demo, we disable this option for the data product because the discovery of PII data is not in the scope of this demo.

  1. Choose Next.
  2. Search for and choose the CloudWatch log group ARN copied from the previous step.
  3. Copy the log group ARN.
  4. On the data product page, enter the log group ARN.
  5. For CloudWatch Query, enter a query that you want ADA to get from the log group.

In this demo, we query the @message field because we’re interested in getting the application logs from the log group.

  1. Select how the data updates are triggered after initial import.

ADA can be configured to ingest the data from the source at flexible intervals (up to 15 minutes or later) or on demand. For the demo, we set the data updates to run hourly.

  1. Choose Next.

Next, ADA will connect to the log group and query the schema. Because the logs are in Apache Log Format, we transform the logs into separate fields so that we can run queries on the specific log fields. ADA provides four default transformations and supports custom transformation through a Python script. In this demo, we run a custom Python script to transform the JSON message field into Apache Log Format fields.

  1. Choose Transform schema.
  2. Choose Create new transform.
  3. Upload the apache-log-extractor-transform.py script from the /asset/transform_logs/ folder.
  4. Choose Submit.

ADA will transform the CloudWatch logs using the script and present the processed schema.

  1. Choose Next.
  2. In the last step, review the steps and choose Submit.

ADA will start the data processing, create the data pipelines, and prepare the CloudWatch log groups to be queried from the Query Workbench. This process will take a few minutes to complete and will be shown on the ADA console under Data Products.

Create an Amazon S3 data product

We repeat the steps to add the historical logs from the Amazon S3 data source and look up reference data from the DynamoDB table. For these two data sources, we don’t create custom transforms because the data formats are in CSV (for historical logs) and key attributes (for reference lookup data).

  1. On the ADA console, create a new data product.
  2. Enter a name (hist_logs) and choose Amazon S3.
  3. Copy the Amazon S3 URI (the text after arn:aws:s3:::) from the CdkStack.S3 output variable and navigate to the Amazon S3 console.
  4. In the search box, enter the copied text, open the S3 bucket, select the /logs folder, and choose Copy S3 URI.

The historical logs are stored in this path.

  1. Navigate back to the ADA console and enter the copied S3 URI for S3 location.
  2. For Update Trigger, select On Demand because the historical logs are updated at an unspecified frequency.
  3. For Update Policy, select Append to append newly imported data to the existing data.
  4. Choose Next.

ADA processes the schema for the files in the selected folder path. Because the logs are in CSV format, ADA is able to read the column names without requiring additional transformations. However, the columns status_code and request_size are inferred as long type by ADA. We want to keep the column data types consistent among the data products so that we can join the data tables and query the data. The column status_code will be used to create joins across the data tables.

  1. Choose Transform schema to change the data types of the two columns to string data type.

Note the highlighted column names in the Schema preview pane prior to applying the data type transformations.

  1. In the Transform plan pane, under Built-in transforms, choose Apply Mapping.

This option allows you to change the data type from one type to another.

  1. In the Apply Mapping section, deselect Drop other fields.

If this option is not disabled, only the transformed columns will be preserved and all other columns will be dropped. Because we want to retain all the columns, we disable this option.

  1. Under Field Mappings¸ for Old name and New name, enter status_code and for New type, enter string.
  2. Choose Add Item.
  3. For Old name and New name¸ enter request_size and for New data type, enter string.
  4. Choose Submit.

ADA will apply the mapping transformation on the Amazon S3 data source. Note the column types in the Schema preview pane.

  1. Choose View sample to preview the data with the transformation applied.

ADA will display the PII data acknowledgement to ensure that either only authorized users can view the data or that the dataset doesn’t contain any PII data.

  1. Choose Agree to continue to view the sample data.

Note that the schema is identical to the CloudWatch log group schema because both the current application and historical application logs are in Apache Log Format.

  1. In the final step, review the configuration and choose Submit.

ADA starts processing the data from the Amazon S3 source, creates the backend infrastructure, and prepares the data product. This process takes a few minutes depending upon the size of the data.

Create a DynamoDB data product

Lastly, we create a DynamoDB data product. Complete the following steps:

  1. On the ADA console, create a new data product.
  2. Enter a name (lookup) and choose Amazon DynamoDB.
  3. Enter the Cdk.DynamoDBTable output variable for DynamoDB Table ARN.

This table contains key attributes that will be used as a lookup table in this demo. For the lookup data, we are using the HTTP codes and long and short descriptions of the codes. You can also use PostgreSQL, MySQL, or a CSV file source as an alternative.

  1. For Update Trigger, select On-Demand.

The updates will be on demand because the lookup is mostly for reference purpose while querying and any updates to the lookup data can be updated in ADA using on-demand triggers.

  1. Choose Next.

ADA reads the schema from the underlying DynamoDB schema and presents the column name and type for optional transformation. We will proceed with the default schema selection because the column types are consistent with the types from the CloudWatch log group and Amazon S3 CSV data source. Having data types that are consistent across the data sources allows us to write queries to fetch records by joining the tables using the column fields. For example, the column key in the DynamoDB schema corresponds to the status_code in the Amazon S3 and CloudWatch data products. We can write queries that can join the three tables using the column name key. An example is shown in the next section.

  1. Choose Continue with current schema.
  2. Review the configuration and choose Submit.

ADA will process the data from the DynamoDB table data source and prepare the data product. Depending upon the size of the data, this process takes a few minutes.

Now we have all the three data products processed by ADA and available for you to run queries.

Use the Query Workbench to query the data

ADA allows you to run queries against the data products while abstracting the data source and making it accessible using SQL (Structured Query Language). You can write queries and join the tables just as you would query against tables in a relational database. We demonstrate ADA’s querying capability via two user scenarios. In both the scenarios, we join an application log dataset to the error codes lookup table. In the first use case, we query the current application logs to identify the top 10 most accessed application endpoints along with the corresponding HTTP status codes:

--Query the top 10 Application endpoints along with the corresponding HTTP request type and HTTP status code.

SELECT logs.endpoint AS Application_EndPoint, logs.http_request AS REQUEST, count(logs.endpoint) as Endpoint_Count, ref.key as HTTP_Status_Code, ref.short as Description
FROM cw_domain.cloud_watch_application_logs logs
INNER JOIN cw_domain.lookup ref ON logs.status_code = ref.key
where logs.status_code LIKE '4%%' OR logs.status_code LIKE '5%%' -- = '/v1/server'
GROUP BY logs.endpoint, logs.http_request, ref.key, ref.short
ORDER BY Endpoint_Count DESC
LIMIT 10

In the second example, we query the historical logs table to get the top 10 application endpoints with the most errors to understand the endpoint call pattern:

-- Query Historical Logs to get the top 10 Application Endpoints with most number of errors along with an explanation of the error code.

SELECT endpoint as Application_EndPoint, count(status_code) as Error_Count, ref.long as Description FROM cw_domain.hist_logs hist
INNER JOIN cw_domain.lookup ref ON hist.status_code = ref.key
WHERE hist.status_code LIKE '4%%' OR hist.status_code LIKE '5%%'
GROUP BY endpoint, status_code, ref.long
ORDER BY Error_Count desc
LIMIT 10

In addition to querying, you can optionally save the query and share the saved query with other users in the same domain. The shared queries are accessible directly from the Query Workbench. The query results can also be exported to CSV format.

Visualize ADA data products in Tableau

ADA offers the ability to connect to third-party BI tools to visualize data and create reports from the ADA data products. In this demo, we use ADA’s native integration with Tableau to visualize the data from the three data products we configured earlier. Using Tableau’s Athena connector and following the steps in Tableau configuration, you can configure ADA as a data source in Tableau. After a successful connection has been established between Tableau and ADA, Tableau will populate the three data products under the Tableau catalog cw_domain.

We then establish a relationship across the three databases using the HTTP status code as the joining column, as shown in the following screenshot. Tableau allows us to work in online and offline mode with the data sources. In online mode, Tableau will connect to ADA and query the data products live. In offline mode, we can use the Extract option to extract the data from ADA and import the data in to Tableau. In this demo, we import the data in to Tableau to make the querying more responsive. We then save the Tableau workbook. We can inspect the data from the data sources by choosing the database and Update Now.

With the data source configurations in place in Tableau, we can create custom reports, charts, and visualizations on the ADA data products. Let’s consider two use cases for visualizations.

As shown in the following figure, we visualized the frequency of the HTTP errors by application endpoints using Tableau’s built-in heat map chart. We filtered out the HTTP status codes to only include error codes in the 4xx and 5xx range.

We also created a bar chart to depict the application endpoints from the historical logs ordered by the count of HTTP error codes. In this chart, we can see that the /v1/server/admin endpoint has generated the most HTTP error status codes.

Clean up

Cleaning up the sample application infrastructure is a two-step process. First, to remove the infrastructure provisioned for the purposes of this demo, run the following command in the terminal:

cdk destroy

For the following question, enter y and AWS CDK will delete the resources deployed for the demo:

Are you sure you want to delete: CdkStack (y/n)? y

Alternatively, you can remove the resources via the AWS CloudFormation console by navigating to the CdkStack stack and choosing Delete.

The second step is to uninstall ADA. For instructions, refer to Uninstall the solution.

Conclusion

In this post, we demonstrated how to use the ADA solution to derive insights from application logs stored across two different data sources. We demonstrated how to install ADA on an AWS account and deploy the demo components using AWS CDK. We created data products in ADA and configured the data products with the respective data sources using the ADA’s built-in data connectors. We demonstrated how to query the data products using standard SQL queries and generate insights on the log data. We also connected the Tableau Desktop client, a third-party BI product, to ADA and demonstrated how to build visualizations against the data products.

ADA automates the process of ingesting, transforming, governing, and querying diverse datasets and simplifying the lifecycle management of data. ADA’s pre-built connectors allow you to ingest data from diverse data sources. Software teams with basic knowledge of AWS products and services will be able to set up an operational data analytics platform in a few hours and provide secure access to the data. The data can then be easily and quickly queried using an intuitive and standalone web user interface.

Try out ADA today to easily manage and gain insights from data.


About the authors

Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 23+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Machine Learning & Data Analytics with focus on Data and Feature Engineering domain. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.

Rashim Rahman is a Software Developer based out of Sydney, Australia with 10+ years of experience in software development and architecture. He works primarily on building large scale open-source AWS solutions for common customer use cases and business problems. In his spare time, he enjoys sports and spending time with friends and family.

Hafiz Saadullah is a Principal Technical Product Manager at Amazon Web Services. Hafiz focuses on AWS Solutions, designed to help customers by addressing common business problems and use cases.

Cost considerations and common options for AWS Network Firewall log management

Post Syndicated from Sharon Li original https://aws.amazon.com/blogs/security/cost-considerations-and-common-options-for-aws-network-firewall-log-management/

When you’re designing a security strategy for your organization, firewalls provide the first line of defense against threats. Amazon Web Services (AWS) offers AWS Network Firewall, a stateful, managed network firewall that includes intrusion detection and prevention (IDP) for your Amazon Virtual Private Cloud (VPC).

Logging plays a vital role in any firewall policy, as emphasized by the National Institute of Standards and Technology (NIST) Guidelines on Firewalls and Firewall Policy. Logging enables organizations to take proactive measures to help prevent and recover from failures, maintain proper firewall security configurations, and gather insights for effectively responding to security incidents.

Determining the optimal logging approach for your organization should be approached on a case-by-case basis. It involves striking a balance between your security and compliance requirements and the costs associated with implementing solutions to meet those requirements.

This blog post walks you through logging configuration best practices, discusses three common architectural patterns for Network Firewall logging, and provides guidelines for optimizing the cost of your logging solution. This information will help you make a more informed choice for your organization’s use case.

Stateless and stateful rules engines logging

When discussing Network Firewall best practices, it’s essential to understand the distinction between stateful and stateless rules. Note that stateless rules don’t support firewall logging, which can make them difficult to work with in use cases that depend on logs.

To verify that traffic is forwarded to the stateful inspection engine that generates logs, you can add a custom-defined stateless rule group that covers the traffic you need to monitor, or you can set a default action for stateless traffic to be forwarded to stateful rule groups in the firewall policy, as shown in the following figure.

Figure 1: Set up stateless default actions to forward to stateful rule groups

Figure 1: Set up stateless default actions to forward to stateful rule groups

Alert logs and flow logs

Network Firewall provides two types of logs:

  • Alert — Sends logs for traffic that matches a stateful rule whose action is set to Alert or Drop.
  • Flow — Sends logs for network traffic that the stateless engine forwards to the stateful rules engine.

To grasp the use cases of alert and flow logs, let’s begin by understanding what a flow is from the view of the firewall. For the network firewall, network flow is a one-way series of packets that share essential IP header information. It’s important to note that the Network Firewall flow log differs from the VPC flow log, as it captures the network flow from the firewall’s perspective and it is summarized in JSON format.

For example, the following sequence shows how an HTTP request passes through the Network Firewall.

Figure 2: HTTP request passes through Network Firewall

Figure 2: HTTP request passes through Network Firewall

When you’re using a stateful rule to block egress HTTP traffic, the TCP connection will be established initially. When an HTTP request comes in, it will be evaluated by the stateful rule. Depending on the rule’s action, the firewall may send a TCP reset to the sender when a Reject action is configured, or it may drop the packets to block them if a Drop action is configured. In the case of a Drop action, shown in Figure 3, the Network Firewall decides not to forward the packets at the HTTP layer, and the closure of the connection is determined by the TCP timers on both the client and server sides.

Figure 3: HTTP request blocked by Network Firewall

Figure 3: HTTP request blocked by Network Firewall

In the given example, the Network Firewall generates a flow log that provides information like IP addresses, port numbers, protocols, timestamps, number of packets, and bytes of the traffic. However, it doesn’t include details about the stateful inspection, such as whether the traffic was blocked or allowed.

Figure 4 shows the inbound flow log.

Figure 4: Inbound flow log

Figure 4: Inbound flow log

Figure 5 shows the outbound flow log.

Figure 5: Outbound flow log

Figure 5: Outbound flow log

The alert log entry complements the flow log by containing stateful inspection details. The entry includes information about whether the traffic was allowed or blocked and also provides the hostname associated with the traffic. This additional information enhances the understanding of network activities and security events, as shown in Figure 6.

Figure 6: Alert log

Figure 6: Alert log

In summary, flow logs provide stateless information and are valuable for identifying trends, like monitoring IP addresses that transmit the most data over time in your network. On the other hand, alert logs contain stateful inspection details, making them helpful for troubleshooting and threat hunting purposes.

Keep in mind that flow logs can become excessive. When you’re forwarding traffic to a stateful inspection engine, flow logs capture the network flows crossing your Network Firewall endpoints. Because log volume affects overall costs, it’s essential to choose the log type that suits your use case and security needs. If you don’t need flow logs for traffic flow trends, consider only enabling alert logs to help reduce expenses.

Effective logging with alert rules

When you write stateful rules using the Suricata format, set the alert rule to be evaluated before the pass rule to log allowed traffic. Be aware that:

  • You must enable strict rule evaluation order to allow the alert rule to be evaluated before the pass rule. Otherwise the order of evaluation by default is pass rules first, then drop, then alert. The engine stops processing rules when it finds a match.
  • When you use pass rules, it’s recommended to add a message to remind anyone looking at the policy that these rules do not generate messages. This will help when developing and troubleshooting your rules.

For example, the rules below will allow traffic to a target with a specific Server Name Indication (SNI) and log the traffic that was allowed. As you can see in the pass rule, it includes a message to remind the firewall policy maker that pass rules don’t alert. The alert rule evaluated before the pass rule logs a message to tell the log viewer which rule allows the traffic. This way you can see allowed domains in the logs.


alert tls $HOME_NET any -> $EXTERNAL_NET any (tls.sni; content:"www.example.com"; nocase; startswith; endswith; msg:"Traffic allowed by rule 72912"; flow:to_server, established; sid:82912;)
pass tls $HOME_NET any -> $EXTERNAL_NET any (tls.sni; content:"www.example.com"; nocase; startswith; endswith; msg:"Pass rules don't alert"; flow:to_server, established; sid:72912;)

This way you can see allowed domains in the alert logs.

Figure 7: Allowed domain in the alert log

Figure 7: Allowed domain in the alert log

Log destination considerations

Network Firewall supports the following log destinations:

You can select the destination that best fits your organization’s processes. In the next sections, we review the most common pattern for each log destination and walk you through the cost considerations, assuming a scenario in which you generate 15 TB Network Firewall logs in us-east-1 Region per month.

Amazon S3

Network Firewall is configured to inspect traffic and send logs to an S3 bucket in JSON format using Amazon CloudWatch vended logs, which are logs published by AWS services on behalf of the customer. Optionally, logs in the S3 bucket can then be queried using Amazon Athena for monitoring and analysis purposes. You can also create Amazon QuickSight dashboards with an Athena-based dataset to provide additional insight into traffic patterns and trends, as shown in Figure 8.

Figure 8: Architecture diagram showing AWS Network Firewall logs going to S3

Figure 8: Architecture diagram showing AWS Network Firewall logs going to S3

Cost considerations

Note that Network Firewall logging charges for the pattern above are the combined charges for CloudWatch Logs vended log delivery to the S3 buckets and for using Amazon S3.

CloudWatch vended log pricing can influence overall costs significantly in this pattern, depending on the amount of logs generated by Network Firewall, so it’s recommended that your team be aware of the charges described in Amazon CloudWatch Pricing – Amazon Web Services (AWS). From the CloudWatch pricing page, navigate to Paid Tier, choose the Logs tab, select your Region and then under Vended Logs, see the information for Delivery to S3.

For Amazon S3, go to Amazon S3 Simple Storage Service Pricing – Amazon Web Services, choose the Storage & requests tab, and view the information for your Region in the Requests & data retrievals section. Costs will be dependent on storage tiers and usage patterns and the number of PUT requests to S3.

In our example, 15 TB is converted and compressed to approximately 380 GB in the S3 bucket. The total monthly cost in the us-east-1 Region is approximately $3800.

Long-term storage

There are additional features in Amazon S3 to help you save on storage costs:

Analytics and reporting

Athena and QuickSight can be used for analytics and reporting:

  • Athena can perform SQL queries directly against data in the S3 bucket where Network Firewall logs are stored. In the Athena query editor, a single query can be run to set up the table that points to the Network Firewall logging bucket.
  • After data is available in Athena, you can use Athena as a data source for QuickSight dashboards. You can use QuickSight to visualize data from your Network Firewall logs, taking advantage of AWS serverless services.
  • Please note that using Athena to scan firewall data in S3 might increase costs, as can the number of authors, users, reports, alerts, and SPICE data used in QuickSight.

Amazon CloudWatch Logs

In this pattern, shown in Figure 9, Network Firewall is configured to send logs to Amazon CloudWatch as a destination. Once the logs are available in CloudWatch, CloudWatch Log Insights can be used to search, analyze, and visualize your logs to generate alerts, notifications, and alarms based on specific log query patterns.

Figure 9: Architecture diagram using CloudWatch for Network Firewall Logs

Figure 9: Architecture diagram using CloudWatch for Network Firewall Logs

Cost considerations

Configuring Network Firewall to send logs to CloudWatch incurs charges based on the number of metrics configured, metrics collection frequency, the number of API requests, and the log size. See Amazon CloudWatch Pricing for additional details.

In our example of 15 TB logs, this pattern in the us-east-1 Region results in approximately $6900.

CloudWatch dashboards offers a mechanism to create customized views of the metrics and alarms for your Network Firewall logs. These dashboards incur an additional charge of $3 per month for each dashboard.

Contributor Insights and CloudWatch alarms are additional ways that you can monitor logs for a pre-defined query pattern and take necessary corrective actions if needed. Contributor Insights are charged per Contributor Insights rule. To learn more, go to the Amazon CloudWatch Pricing page, and under Paid Tier, choose the Contributor Insights tab. CloudWatch alarms are charged based on the number of metric alarms configured and the number of CloudWatch Insights queries analyzed. To learn more, navigate to the CloudWatch pricing page and navigate to the Metrics Insights tab.

Long-term storage

CloudWatch offers the flexibility to retain logs from 1 day up to 10 years. The default behavior is never expire, but you should consider your use case and costs before deciding on the optimal log retention period. For cost optimization, the recommendation is to move logs that need to be preserved long-term or for compliance from CloudWatch to Amazon S3. Additional cost optimization can be achieved through S3 tiering. To learn more, see Managing your storage lifecycle in the S3 User Guide.

AWS Lambda with Amazon EventBridge, as shown in the following sample code, can be used to create an export task to send logs from CloudWatch to Amazon S3 based on an event rule, pattern matching rule, or scheduled time intervals for long-term storage and other use cases.

import boto3
import os
import datetime


GROUP_NAME = "/AnfwDemo/Anfw/Alert"
DESTINATION_BUCKET = "cwexportlogs-blog"
PREFIX = "network-logs"
NDAYS = 1
nDays = int(NDAYS)

currentTime = datetime.datetime.now()
StartDate = currentTime - datetime.timedelta(days=nDays)
EndDate = currentTime - datetime.timedelta(days=nDays - 1)


fromDate = int(StartDate.timestamp() * 1000)
toDate = int(EndDate.timestamp() * 1000)

BUCKET_PREFIX = os.path.join(PREFIX, StartDate.strftime('%Y{0}%m{0}%d').format(os.path.sep))

def lambda_handler(event, context):
    client = boto3.client('logs')
    response = client.create_export_task(
         logGroupName=GROUP_NAME,
         fromTime=fromDate,
         to=toDate,
         destination=DESTINATION_BUCKET,
         destinationPrefix=BUCKET_PREFIX
        )
    print(response)

Figure 10 shows how EventBridge is configured to trigger the Lambda function periodically.

Figure 10: EventBridge scheduler for daily export of CloudWatch logs

Figure 10: EventBridge scheduler for daily export of CloudWatch logs

Analytics and reporting

CloudWatch Insights offers a rich query language that you can use to perform complex searches and aggregations on your Network Firewall log data stored in log groups as shown in Figure 11.

The query results can be exported to CloudWatch dashboard for visualization and operational decision making. This will help you quickly identify patterns, anomalies, and trends in the log data to create the alarms for proactive monitoring and corrective actions.

Figure 11: Network Firewall logs ingested into CloudWatch and analyzed through CloudWatch Logs Insights

Figure 11: Network Firewall logs ingested into CloudWatch and analyzed through CloudWatch Logs Insights

Amazon Kinesis Data Firehose

For this destination option, Network Firewall sends logs to Amazon Kinesis Data Firehose. From there, you can choose the destination for your logs, including Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and an HTTP endpoint that’s owned by you or your third-party service providers. The most common approach for this option is to deliver logs to OpenSearch, where you can index log data, visualize, and analyze using dashboards as shown in Figure 12.

In the blog post How to analyze AWS Network Firewall logs using Amazon OpenSearch Service, you learn how to build network analytics and visualizations using OpenSearch in detail. Here, we discuss only some cost considerations of using this pattern.

Figure 12: Architecture diagram showing AWS Network Firewall logs going to OpenSearch

Figure 12: Architecture diagram showing AWS Network Firewall logs going to OpenSearch

Cost considerations

The charge when using Kinesis Data Firehose as a log destination is for CloudWatch Logs vended log delivery. Ingestion pricing is tiered and billed per GB ingested in 5 KB increments. See Amazon Kinesis Data Firehose Pricing under Vended Logs as source. There are no additional Kinesis Data Firehose charges for delivery unless optional features are used.

For 15 TB of log data, the cost of CloudWatch delivery and Kinesis Data Firehose ingestion is approximately $5400 monthly in the us-east-1 Region.

The cost for Amazon OpenSearch Service is based on three dimensions:

  • Instance hours, which are the number of hours that an instance is available to you for use
  • The amount of storage you request
  • The amount of data transferred in and out of OpenSearch Service

Storage pricing depends on the storage tier and type of instance that you choose. See pricing examples of using OpenSearch Service. When creating your OpenSearch domain, see Sizing Amazon OpenSearch Service domains to help you right-size your OpenSearch domain. Other cost optimization best practices include choosing the right storage tier and using AWS Graviton2 instances to improve performance.

For instance, allocating approximately 15 TB of UltraWarm storage in the us-east-1 Region will result in a monthly cost of $4700. Keep in mind that in addition to storage costs, you should also account for compute instances and hot storage.

In short, the estimated total cost for log ingestion and storage in the us-east-1 Region for this pattern is at least $10,100.

Leveraging OpenSearch will enable you to promptly investigate, detect, analyze, and respond to security threats.

Summary

The following table shows a summary of the expenses and advantages of each solution. Since storing logs is a fundamental aspect of log management, we use the monthly cost of using Amazon S3 as the log delivery destination as our baseline when making these comparisons.

Pattern Log delivery and storage cost as a multiple of the baseline cost Functionalities Dependencies
Amazon S3, Athena, QuickSight 1 The most economical option for log analysis. The solution requires security engineers to have a good analytics skillset. Familiarity with Athena query and query running time will impact the incident response time and the cost.
Amazon CloudWatch 1.8 Log analysis, dashboards, and reporting can be implemented from the CloudWatch console. No additional service is needed. The solution requires security engineers to be comfortable with CloudWatch Logs Insights query syntax. The CloudWatch Logs Insights query will impact the incident response time and the cost.
Amazon Kinesis Data Firehose, OpenSearch 2.7+ Investigate, detect, analyze, and respond to security threats quickly with OpenSearch. The solution requires you to invest in managing the OpenSearch cluster.

You have the flexibility to select distinct solutions for flow logs and alert logs based on your requirements. For flow logs, opting for Amazon S3 as the destination offers a cost-effective approach. On the other hand, for alert logs, using the Kinesis Data Firehose and OpenSearch solution allows for quick incident response. Minimizing the time required to address ongoing security challenges can translate to reduced business risk at different costs.

Conclusion

This blog post has explored various patterns for Network Firewall log management, highlighting the cost considerations associated with each approach. While cost is a crucial factor in designing an efficient log management solution, it’s important to consider other factors such as real-time requirements, solution complexity, and ownership. Ultimately, the key is to adopt a log management pattern that aligns with your operational needs and budgetary constraints. Network security is an iterative practice, and by optimizing your log management strategy, you can enhance your overall security posture while effectively managing costs.

For more information about working with Network Firewall, see What is AWS Network Firewall?

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Sharon Li

Sharon Li

Sharon is an Enterprise Solutions Architect at Amazon Web Services based in Boston, with a passion for designing and building secure workloads on AWS. Prior to her current role at AWS, Sharon worked as a software development engineer at Amazon, where she played a key role in bringing security into the development process.

Larry Tewksbury

Larry Tewksbury

Larry is an AWS Technical Account Manager based in New Hampshire. He works with enterprise customers in the Northeast to understand, scale, and optimize their cloud operations. Outside of work, he enjoys spending time with his family, hiking, and tech-based hobbies.

Shashidhar Makkapati

Shashidhar Makkapati

Shashidhar is an Enterprise Solutions Architect at Amazon Web Services, based in Charlotte, NC. With over two decades of experience as an enterprise architect, he has a keen focus on cloud adoption and digital transformation in the financial services industry. Shashidhar supports enterprise customers in the US Northeast. In his free time, he enjoys reading, traveling, and spending time with his family.

Configure SAML federation for Amazon OpenSearch Serverless with Okta

Post Syndicated from Aish Gunasekar original https://aws.amazon.com/blogs/big-data/configure-saml-federation-for-amazon-opensearch-serverless-with-okta/

Modern applications apply security controls across many systems and their subsystems. Keeping all of these systems in sync would be a major undertaking if you tried to implement it separately. Centralized identity management is the way to maintain a single identity provider (IdP) that can authenticate actors and manage and distribute their rights.

OpenSearch is an open-source search and analytics suite that enables you to ingest, store, analyze, and visualize full text and log data. Amazon OpenSearch Serverless makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud, freeing you from the undifferentiated heavy lifting of sizing, scaling, and operating an OpenSearch cluster. When you use OpenSearch Serverless, you can integrate with your existing Security Assertion Markup Language 2.0 (SAML)-compliant IdP to provide granular access control for your OpenSearch Serverless collections. Our customers use a variety of IdPs, including AWS IAM Identity Center (successor to AWS SSO), Okta, Keycloak, Active Directory Federation Services (AD FS), and Auth0.

In this post, you will learn how to use Okta as your IdP and integrate it with OpenSearch Serverless to securely manage your users and groups for secure access to your data.

Solution overview

The flow of access requests is depicted in the following figure.

When you navigate to OpenSearch Dashboards, the workflow steps are as follows:

  1. OpenSearch Serverless generates a SAML authentication request.
  2. OpenSearch Serverless redirects your request back to the browser.
  3. The browser redirects to the Okta URL via the Okta application setup.
  4. Okta parses the SAML request, authenticates the user, and generates a SAML response.
  5. Okta returns the encoded SAML response to the browser.
  6. The browser sends the SAML response back to the OpenSearch Serverless Assertion Consumer Services (ACS) URL.
  7. ACS verifies the SAML response and logs in the user with the permissions defined in the data access policy.

Prerequisites

Complete the following prerequisite steps:

  1. Create an OpenSearch Serverless collection. For instructions, refer to Preview: Amazon OpenSearch Serverless – Run Search and Analytics Workloads without Managing Clusters.
  2. Make a note of your AWS account ID to use while configuring your application in Okta.
  3. Create an Okta account, which you will use as an IdP.
  4. Create users and a group in Okta:
    1. Log in to your Okta account, and in the navigation pane, choose Directory, then choose Groups.
    2. Choose Add Group and name itopensearch-serverless, then choose Save.
    3. Choose Assign People to add users.
    4. You can add users to theopensearch-serverlessgroup by choosing the plus sign next to the user name, or you can choose Add All.
    5. Add your users, then choose Save.
    6. To create new users, choose People in the navigation pane under Directory, then choose Add Person.
    7. Provide your first name, last name, user name (email ID), and primary email address.
    8. For Password, choose Set by admin and First-time password.
    9. To create your user, choose Save.
    10. In the navigation pane, choose Groups, then choose theopensearch-serverless group you created earlier.

The following graphic gives a quick demonstration of setting up a user and group.

Configure an application in Okta

To configure an application in Okta, complete the following steps:

  1. Navigate to the Applications page on the Okta console.
  2. Choose App Integration, select SAML 2.0 web application, then choose Next.
  3. For Name, enter a name for the app (for example, myweblogs), then choose Next.
  4. Under Application ACS URL, enter the URL using the format https://collection.<REGION>.aoss.amazonaws.com/_saml/acs (replace <REGION> with the corresponding Region) to generate the IdP metadata.
  5. Select Use this for Recipient URL and Destination URL to use the same ACS URL as the recipient and destination.
  6. Specify aws:opensearch:<AWS-Account-ID> under Audience URI (SP Entity ID). This specifies who the assertion is intended for within the SAML assertion.
  7. Under Group Attribute Statements, enter a name that is relevant to your application, such as mygroup, and select unspecified as the name format. (Don’t forget this name, you’ll need it later.)
  8. Select equals as the filter and enter opensearch-serverless.
  9. Select I’m a software vendor. I’d like to integrate my app with Okta and choose Finish.
  10. After an app is created, choose the sign-on tab, scroll down to the metadata details, and copy the value for Metadata URL.

The following graphic gives a quick demonstration of setting up an application in Okta via the preceding steps.

Next, you associate the users and groups to the application that you created in the previous step.

  1. On the Applications page, choose the app you created earlier.
  2. On the Assignments tab, choose Assign.
  3. Select Assign To Groups and choose the group you wish to assign to (opensearch-serverlessin this case).
  4. Choose Done.

The following graphic gives a quick demonstration of assigning groups to the application via the preceding steps.

Set up SAML on OpenSearch Serverless

In this section, you create a SAML provider that you’ll use for your OpenSearch Serverless collection. Complete the following steps:

  1. Open the OpenSearch Serverless console on a new tab.
  2. In the navigation pane, under Serverless, choose SAML authentication.
  3. Select Add SAML provider.
  4. Provide a recognizable name (for example, okta) and a description.
  5. Open a new tab and enter the copied metadata URL into your browser.

You should see the metadata for the Okta application.

  1. Take note of this metadata and copy it to your clipboard.
  2. On the OpenSearch Service console tab, enter this metadata in the Provide metadata from your IdP section.
  3. Under Additional settings, enter mygroup or the group attribute provided in the Okta configuration.
  4. Choose Create a SAML provider.

The SAML provider has now been created.

The following graphic gives a quick demonstration of setting up the SAML provider in OpenSearch Serverless via the preceding steps.

Update the data access policy

You need to configure the right permissions in the data access policies associated with your OpenSearch collection so your Okta group members can access the OpenSearch Dashboards endpoint.

  1. On the OpenSearch Serverless console, open your collection.
  2. Choose the data access policy associated with the collection in the Data Access section.
  3. Choose Edit.
  4. Choose Principals and Add a SAML principal.
  5. Select the SAML provider you created earlier and enter group/opensearch-serverless next to it.
  6. The OpenSearch Dashboards endpoint can be accessed by all group members. You can grant access to collections, indexes, or both.
  7. Choose Save.

Log in to OpenSearch Dashboards

Now that you have set permissions to access the dashboards, choose the Dashboards URL under the general information for the OpenSearch Serverless collection. This should take you to the website
https://collection-endpoint/_dashboards/

You will see a list with all the access options. Choose the SAML provider that you created (okta in this case) and log in using your Okta credentials. You will now be logged into OpenSearch Dashboards with the permissions that are part of the data access policy. You can perform searches or create visualizations from the dashboard.

Clean up

To avoid unwanted charges, delete the OpenSearch Serverless collection, data access policy, and SAML provider created as part of this demonstration.

Summary

In this post, you learned how to set up Okta as an IdP to access OpenSearch Dashboards using SAML. You also learned how to set up users and groups within Okta and configure their access to OpenSearch Dashboards. For more details, refer to SAML authentication for Amazon OpenSearch Serverless.

You can also refer to the Getting started with Amazon OpenSearch Serverless workshop to know more about OpenSearch Serverless.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.


About the Authors

Aish Gunasekar is a Specialist Solutions architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Perform time series forecasting using Amazon Redshift ML and Amazon Forecast

Post Syndicated from Tahir Aziz original https://aws.amazon.com/blogs/big-data/perform-time-series-forecasting-using-amazon-redshift-ml-and-amazon-forecast/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

Many businesses use different software tools to analyze historical data and past patterns to forecast future demand and trends to make more accurate financial, marketing, and operational decisions. Forecasting acts as a planning tool to help enterprises prepare for the uncertainty that can occur in the future.

Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply machine learning (ML) models using familiar SQL commands in Amazon Redshift.

With Redshift ML, you can take advantage of Amazon SageMaker, a fully managed ML service, without learning new tools or languages. Simply use SQL statements to create and train SageMaker ML models using your Redshift data and then use these models to make predictions. For more information on how to use Redshift ML, refer to Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

With Redshift ML, you can now use Amazon Forecast, an ML-based time series forecasting service, without learning any new tools or having to create pipelines to move your data. You can use SQL statements to create and train forecasting models from your time series data in Amazon Redshift and use these models to generate forecasts about revenue, inventory, resource usage, or demand forecasting in your queries and reports.

For example, businesses use forecasting to do the following:

  • Use resources more efficiently
  • Time the launch of new products or services
  • Estimate recurring costs
  • Predict future events like sales volumes and earnings

In this post, we demonstrate how you can create forecasting models using Redshift ML and generate future forecasts using simple SQL commands.

When you use forecasting in Amazon Redshift, Redshift ML uses Forecast to train the forecasting model and to generate forecasts. You pay only the associated Forecast costs. There are no additional costs associated with Amazon Redshift for creating or using Forecast models to generate predictions. View Amazon Forecast pricing for details.

Solution overview

Amazon Forecast is a fully managed time series forecasting service based on machine learning. Forecast uses different ML algorithms to perform complex ML tasks for your datasets. Using historical data, Forecast automatically trains multiple algorithms and produces a forecasting model, also known as a predictor. Amazon Redshift provides a simple SQL command to create forecasting models. It seamlessly integrates with Forecast to create a dataset, predictor, and forecast automatically without you worrying about any of these steps. Redshift ML supports target time series data and related time series data.

As the following diagram demonstrates, Amazon Redshift will call Forecast, and data needed for Forecast model creation and training will be pushed from Amazon Redshift to Forecast through Amazon Simple Storage Service (Amazon S3). When the model is ready, it can be accessed using SQL from within Amazon Redshift using any business intelligence (BI) tool. In our case, we use Amazon Redshift Query Editor v2.0 to create forecast tables and visualize the data.

To show this capability, we demonstrate two use cases:

  • Forecast electricity consumption by customer
  • Predict bike sharing rentals

What is time series data?

Time series data is any dataset that collects information at various time intervals. This data is distinct because it orders data points by time. Time series data is plottable on a line graph and such time series graphs are valuable tools for visualizing the data. Data scientists use them to identify forecasting data characteristics.

Time series forecasting is a data science technique that uses machine learning and other computer technologies to study past observations and predict future values of time series data.

Prerequisites

Complete the following prerequisites before starting:

  1. Make sure you have an Amazon Redshift Serverless endpoint or a Redshift cluster.
  2. Have access to Amazon Redshift Query Editor v2.
  3. On the Amazon S3 console, create an S3 bucket that Redshift ML uses for uploading the training data that Forecast uses to train the model.
  4. Create an AWS Identity and Access Management (IAM role). For more information, refer to Creating an IAM role as the default.

Although it’s easy to get started with AmazonS3FullAccess, AmazonForecastFullAccess, AmazonRedshiftAllCommandsFullAccess, and AmazonSageMakerFullAccess, we recommend using the minimal policy that we have provided (if you already have an existing IAM role, just add it to that role). If you need to use AWS Key Management Service (AWS KMS) or VPC routing, refer to Cluster and configure setup for Amazon Redshift ML administration.

To use Forecast, you need to have the AmazonForecastFullAccess policy. For more restrictive IAM permissions, you can use the following IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "forecast:DescribeDataset",
                "forecast:DescribeDatasetGroup",
                "forecast:DescribeAutoPredictor",
                "forecast:CreateDatasetImportJob",
                "forecast:CreateForecast",
                "forecast:DescribeForecast",
                "forecast:DescribeForecastExportJob",
                "forecast:CreateMonitor",
                "forecast:CreateForecastExportJob",
                "forecast:CreateAutoPredictor",
                "forecast:DescribeDatasetImportJob",
                "forecast:CreateDatasetGroup",
                "forecast:CreateDataset",
                "forecast:TagResource",
                "forecast:UpdateDatasetGroup"
            ],
            "Resource": "*"
        } ,
		{
			"Effect": "Allow",
			"Action": [
				"iam:PassRole"
			],
			"Resource":"arn:aws:iam::<aws_account_id>:role/service-role/<Amazon_Redshift_cluster_iam_role_name>"
		}
    ]
}

To allow Amazon Redshift and Forecast to assume the role to interact with other services, add the following trust policy to the IAM role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "redshift.amazonaws.com",
           "redshift-serverless.amazonaws.com",
           "forecast.amazonaws.com",
           "sagemaker.amazonaws.com" 
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}[

Use case 1: Forecast electricity consumption

In our first use case, we demonstrate forecasting electricity consumption for individual households. Predicting or forecasting usage could help utility companies better manage their resources and keep them ahead on planning the distribution and supply. Typically, utility companies use software tools to perform the forecasting and perform a lot of steps to create the forecasting data. We show you how to use the data in your Redshift data warehouse to perform predictive analysis or create forecasting models.

For this post, we use a modified version of the individual household electric power consumption dataset. For more information, see ElectricityLoadDiagrams20112014 Data Set (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science).

Prepare the data

Refer to the following notebook for the steps needed to create this use case.

Using Query Editor V2, connect to your cluster and open a new notebook.

The data contains measurements of electric power consumption in different households for the year 2014. We aggregated the usage data hourly. Each row represents the total electricity usage for a given household at an hourly granularity.

For our use case, we use a subset of the source data’s attributes:

  • Usage_datetime – Electricity usage time
  • Consumptioninkw – Hourly electricity consumption data in kW
  • Customer_id – Household customer ID

Create the table electricity_consumption and load data using the COPY command:

CREATE TABLE electricity_consumption
(usage_datetime timestamp, 
consumptioninkw float, 
customer_id varchar(24)
);

COPY electricity_consumption
FROM 's3://redshift-blogs/amazon-forecast-blog/electricityusagedata/electricityusagedata.csv'
IAM_ROLE default
REGION 'us-east-1' delimiter ',' IGNOREHEADER 1;

You can verify the dataset by running a SQL query on your table.

As you can notice, the dataset has electricity consumption in the target field (consumptioninkw) at hourly intervals for individual consumers (customer_id).

Create a forecasting model in Redshift ML

We use the Create Model command to create and train a forecast model. For our forecasting dataset, we use electricity consumption data within the FROM clause. See the following code:

CREATE MODEL forecast_electricity_consumption
FROM electricity_consumption 
TARGET consumptioninkw 
IAM_ROLE 'arn:aws:your-IAM-Role'
AUTO ON MODEL_TYPE FORECAST
SETTINGS (S3_BUCKET 'your-S3-bucket-name',
 HORIZON 24,
 FREQUENCY 'H',
 S3_GARBAGE_COLLECT OFF);

Here, the model name is forecast_electricity_consumption. We use the following settings to create the model:

  • Target – The name of the field for prediction.
  • HORIZON – The number of time steps in the future to forecast.
  • FREQUENCY – The forecast frequency, which should match the input frequency in our case (H meaning hourly). Other acceptable frequency values are Y | M | W | D | H | 30min | 15min | 10min | 5min | 1min. For more details, refer to CREATE MODEL with Forecast.

The Create Model command must include one VARCHAR (customer_id) and a timestamp dimension (usage_datetime). All other related time series feature data must be INT or FLOAT data types.

For the Redshift ML forecasting model, make sure that when you issue a CREATE MODEL statement, you specify MODEL_TYPE as FORECAST. When Redshift ML trains a model or predictor on Forecast, it has a fixed forecast, meaning there is not a physical model to compile and run. Therefore, an inference function is not needed for Forecast models. Instead, we show you how you can pull an exported forecast from the training output location in Amazon S3 into a table locally in your Redshift data warehouse.

When using Forecast, the create model command is run in synchronous mode. This means that after the command is run, it will take 10–15 minutes to set up the required Forecast artifacts. The model will then start training in asynchronous mode, meaning that the training is done behind the scenes by Forecast. You can check when the model training is complete by running the show model command:

SHOW MODEL forecast_electricity_consumption;

The following screenshot shows the results.

The model is trained and deployed when status is shown as READY. If you see TRAINING, that means the model is still training and you need to wait for it to complete.

Generate a forecast

After a model has finished training, you can run a simple create table as command to instantiate all the forecast results into a table. This command will get all the forecast results from the S3 bucket where Forecast exported them.

Create the table locally and load the data in the new table:

CREATE TABLE forecast_electricty_predictions AS SELECT FORECAST(forecast_electricity_consumption);

Here, FORECAST is a function that takes your model’s name as input.

Next, check the forecasted data for the next 24 hours:

Select * from forecast_electricity_predictions;

The following screenshot shows the results.

As shown in the preceding screenshot, our forecast is generated for 24 hours because the HORIZON and FREQUENCY parameters at the model creation and training time were defined as 24H, and that can’t change after the model is trained.

Use case 2: Predict bike sharing rentals

Redshift ML supports historical related time series (RTS) datasets. Historical RTS datasets contain data points up to the forecast horizon, and don’t contain any data points within the forecast horizon.

For this use case, we use a modified version of the Bike Sharing Dataset (Fanaee-T,Hadi. (2013). Bike Sharing Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C5W894).

Our time series dataset contains the event_timestamp and item_id dimensions. It also contains additional attributes, including season, holiday, temperature, and workingday. These features are RTS because they may impact the no_of_bikes_rented target attribute.

For this post, we only include the workingday feature as RTS to help forecast the no_of_bikes_rented target. Based on following chart, we can see a correlation where the number of bikes rented has a direct relationship with working day.

Prepare the data

Refer to the following notebook for the steps needed to create this use case.

Load the dataset into Amazon Redshift using the following SQL. You can use Query Editor v2 or your preferred SQL tool to run these commands.

To create the table, use the following commands:

create table bike_sampledata
(
event_timestamp timestamp,
season float , 
holiday float , 
workingday float , 
weather float , 
temperature float , 
atemperature float, 
humidity float , 
windspeed float , 
casual float , 
registered float , 
no_of_bikes_rented float,
item_id varchar(255)
);

To load data into Amazon Redshift, use the following COPY command:

copy bike_sampledata
from 's3://redshift-blogs/amazon-forecast-blog/bike-data/bike.csv'
IAM_ROLE default
format as csv
region 'us-east-1';

Create a model in Redshift ML using Forecast

For this example, we are not considering any other RTS features and the goal is to forecast the number of bike rentals for the next 24 hours by accounting for the working day only. You can perform analysis and include additional RTS features in the SELECT query as desired.

Run the following SQL command to create your model—note our target is no_of_bikes_rented, which contains the number of total rentals, and we use item_id, event_timestamp, and workingday as inputs from our training set:

CREATE MODEL forecast_bike_consumption 
FROM (
     select
     s.item_id , s.event_timestamp, s.no_of_bikes_rented, s.workingday
     from     
     bike_sampledata s
     )
TARGET no_of_bikes_rented
IAM_ROLE 'arn:aws:your-IAM-Role'
AUTO ON MODEL_TYPE FORECAST
OBJECTIVE 'AverageWeightedQuantileLoss'
SETTINGS (S3_BUCKET 'your-s3-bucket-name',
          HORIZON 24,
          FREQUENCY 'H',
          PERCENTILES '0.25,0.50,0.75,mean',
          S3_GARBAGE_COLLECT ON);

The Create Model command must include one VARCHAR (item_id) and a timestamp dimension (event_timestamp). All other RTS feature data must be INT or FLOAT data types.

The OBJECTIVE parameter specifies a metric to minimize or maximize the objective of a job. For more details, refer to AutoMLJobObjective.

As in the previous use case, the Create Model command will take 10–15 minutes to set up the required Forecast artifacts and then will start the training in asynchronous mode so model training is done behind the scenes by Forecast. You can check if the model is in the Ready state by running the show model command:

SHOW MODEL forecast_bike_consumption;

Generate predictions

After a model has finished training, you can run a Create table command to instantiate all the forecast results into a table. This command gets all the forecast results from the S3 bucket where Forecast exported them.

Create the table locally and load the data in the new table:

CREATE TABLE forecast_bike_consumption_results 
AS SELECT FORECAST(forecast_bike_consumption);

Run following SQL to inspect the generated forecast results:

select * from forecast_bike_consumption_results;

To visualize the data to help us understand it more, select Chart. For the X axis, choose the time attribute and for the Y axis, choose mean.

You can also visualize all the three forecasts together to understand the differences between them:

  1. Choose Trace and choose Time for the X axis and for p50 for the Y axis.
  2. Choose Trace again and choose Time for the X axis and p75 for the Y axis.
  3. Edit the chart title and legend and provide suitable labels.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the Redshift Serverless workgroup or namespace you have for this post (this will also drop all the objects created).
  2. If you used an existing Redshift Serverless workgroup or namespace, use the following code to drop these objects:
    DROP TABLE forecast_electricty_predictions;
    DROP MODEL forecast_electricity_consumption;
    DROP TABLE electricity_consumption;
    DROP TABLE forecast_bike_consumption_results;
    DROP MODEL forecast_bike_consumption;
    DROP TABLE bike_sampledata;

Conclusion

Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Redshift ML to gain business insights for your data.

With Forecast, you can use time series data and related data to forecast different business outcomes using familiar Amazon Redshift SQL commands.

We encourage you to start using this amazing new feature and give us your feedback. For more details, refer to CREATE MODEL with Forecast.


About the authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms, Ahmed is passionate about helping customers build efficient, performant and scalable Analytic solutions.

Nikos Koulouris is a Software Development Engineer at AWS. He received his PhD from University of California, San Diego and he has been working in the areas of databases and analytics.

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/configure-cross-region-table-access-with-the-aws-glue-catalog-and-aws-lake-formation/

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. Companies also have employees and do business across multiple geographic regions and even around the world. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions.

The AWS Glue Data Catalog and AWS Lake Formation recently announced support for cross-Region table access. This feature lets users query AWS Glue databases and tables in one Region from another Region using resource links, without copying the metadata in the Data Catalog or the data in Amazon Simple Storage Service (Amazon S3). A resource link is a Data Catalog object that is a link to a database or table.

The AWS Glue Data Catalog is a centralized repository of technical metadata that holds the information about your datasets in AWS, and can be queried using AWS analytics services such as Amazon Athena, Amazon EMR, and AWS Glue for Apache Spark. The Data Catalog is localized to every Region in an AWS account, requiring users to replicate the metadata and the source data in S3 buckets for cross-Region queries. With the newly launched feature for cross-Region table access, you can create a resource link in any Region pointing to a database or table of the source Region. With the resource link in the local Region, you can query the source Region’s tables from Athena, Amazon EMR, and AWS Glue ETL in the local Region.

You can use the cross-Region table access feature of the Data Catalog in combination with the permissions management and cross-account sharing capability of Lake Formation. Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. By using cross-Region access support for Data Catalog, together with governance provided by Lake Formation, organizations can discover and access data across Regions without spending time making copies. Some businesses might have restrictions to run their compute in certain Regions. Organizations that need to share their Data Catalog with businesses that have such restrictions can now create and share cross-Region resource links.

In this post, we walk you through configuring cross-Region database and table access in two scenarios. In the first scenario, we go through an example where a customer wants to access an AWS Glue database in Region A from Region B in the same account. In scenario two, we demonstrate cross-account and cross-Region access where a customer wants to share a database in Region A across accounts and access it from Region B of the recipient account.

Scenario 1: Same account use case

In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region within the same AWS account. For our illustrations, we have a sample dataset in an S3 bucket in the us-east-2 Region and have used an AWS Glue crawler to crawl and catalog the dataset into a database in the Data Catalog of the us-east-2 Region. We share this dataset to the us-west-2 Region. You can use any of your datasets to follow along. The following diagram illustrates the architecture for cross-Region sharing within the same AWS account.

Prerequisites

To set up cross-Region sharing of a Data Catalog database for scenario 1, we recommend the following prerequisites:

  • An AWS account that is not used for production use cases.
  • Lake Formation set up already in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. For example, we are using a data lake administrator role called LF-Admin. The LF-Admin role also has the AWS Identity and Access Management (IAM) permission iam:PassRole on the AWS Glue crawler role. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.
  • A sample database in the Data Catalog with a few tables. For example, our sample database is called salesdb_useast2 and has a set of eight tables, as shown in the following screenshot.

Set up permissions for us-east-2

Complete the following steps to configure permissions in the us-east-2 Region:

  1. Log in to the Lake Formation console and choose the Region where your database resides. In our example, it is us-east-2 Region.
  2. Grant SELECT and DESCRIBE permissions to the LF-Admin role on all tables of the database salesdb_useast2.
  3. You can confirm if permissions are working by querying the database and tables as the data lake administrator role from Athena.

Set up permissions for us-west-2

Complete the following steps to configure permissions in the us-west-2 Region:

  1. Choose the us-west-2 Region on the Lake Formation console.
  2. Add LF-Admin as a data lake administrator and grant Create database permission to LF-Admin.
  3. In the navigation pane, under Data catalog, select Databases.
  4. Choose Create database and select Resource link.
  5. Enter rl_salesdb_from_useast2 as the name for the resource link.
  6. For Shared database’s region, choose US East (Ohio).
  7. For Shared database, choose salesdb_useast2.
  8. Choose Create.

This creates a database resource link in us-west-2 pointing to the database in us-east-2.

You will notice the Shared resource owner region column populate as us-east-2 for the resource link details on the Databases page.

Because the LF-Admin role created the resource link rl_salesdb_from_useast2, the role has implicit permissions on the resource link. LF-Admin already has permissions to query the table in the us-east-2 Region. There is no need to add a Grant on target permission for LF-Admin. If you are granting permission to another user or role, you need to grant Describe permissions on the resource link rl_salesdb_from_useast2.

  1. Query the database using the resource link in Athena as LF-Admin.

In the preceding steps, we saw how to create a resource link in us-west-2 for a Data Catalog database in us-east-2. You can also create a resource link to the source database in any additional Region where the Data Catalog is available. You can run extract, transform, and load (ETL) scripts in Amazon EMR and AWS Glue by providing the additional Region parameter when referring to the database and table. See the API documentation for GetTable() and GetDatabase() for additional details.

Also, Data Catalog permissions for the database, tables, and resource links and the underlying Amazon S3 data permissions can be managed by IAM policies and S3 bucket policies instead of Lake Formation permissions. For more information, see Identity and access management for AWS Glue.

Scenario 2: Cross-account use case

In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region between two accounts: a producer account and a consumer account. To show an advanced use case, we host the source dataset in us-east-2 of account A and crawl it using an AWS Glue crawler in the Data Catalog in us-east-1. The data lake administrator in account A then shares the database and tables to account B using Lake Formation permissions. The data lake administrator in account B accepts the share in us-east-1 and creates resource links to query the tables from eu-west-1. The following diagram illustrates the architecture for cross-Region sharing between producer account A and consumer account B.

Prerequisites

To set up cross-Region sharing of a Data Catalog database for scenario 2, we recommend the following prerequisites:

  • Two AWS accounts that are not used for production use cases
  • Lake Formation administrator roles in both accounts
  • Lake Formation set up in both accounts with cross-account sharing version 3. For more details, refer documentation.
  • A sample database in the Data Catalog with a few tables

For our example, we continue to use the same dataset and the data lake administrator role LF-Admin for scenario 2.

Set up account A for cross-Region sharing

To set up account A, complete the following steps:

  1. Sign in to the AWS Management Console as the data lake administrator role.
  2. Register the S3 bucket in Lake Formation in us-east-1 with an IAM role that has access to the S3 bucket. See registering your S3 location for instructions.
  3. Set up and run an AWS Glue crawler to catalog the data in the us-east-2 S3 bucket to the Data Catalog database useast2data_salesdb in us-east-1. Refer to AWS Glue crawlers support cross-account crawling to support data mesh architecture for instructions.

The database, as shown in the following screenshot, has a set of eight tables.

  1. Grant SELECT and DESCRIBE along with grantable permissions on all tables of the database to account B.

  2. Grant DESCRIBE with grantable permissions on the database.
  3. Verify the granted permissions on the Data permissions page.
  4. Log out of account A.

Set up account B for cross-Region sharing

To set up account B, complete the following steps:

  1. Sign in as the data lake administrator on the Lake Formation console in us-east-1.

In our example, we have created the data lake administrator role LF-Admin, similar to previous administrator roles in account A and scenario 1.

  1. On the AWS Resource Access Manager (AWS RAM) console, review and accept the AWS RAM invites corresponding to the shared database and tables from account A.

The LF-Admin role can see the shared database useast2data_salesdb from the producer account. LF-Admin has access to the database and tables and so doesn’t need additional permissions on the shared database.

  1. You can grant DESCRIBE on the database and SELECT on All_Tables permissions to any additional IAM principals from the us-east-1 Region on this shared database.
  2. Open the Lake Formation console in eu-west-1 (or any Region where you have Lake Formation and Athena already set up).
  3. Choose Create database and create a resource link named rl_useast1db_crossaccount, pointing to the us-east-1 database useast2data_salesdb.

You can choose any Region on the Shared database’s region drop-down menu and choose the databases from those Regions.

Because we’re using the data lake administrator role LF-Admin, we can see all databases from all Regions in the consumer account’s Data Catalog. A data lake user with restricted permissions will be able to see only those databases for which they have permissions to.

  1. Because LF-Admin created the resource link, this role has permissions to use the resource link rl_useast1db_crossaccount. For additional IAM principals, grant DESCRIBE permissions on the database resource link rl_useast1db_crossaccount.
  2. You can now query the database and tables from Athena.

Considerations

Cross-Region queries involve Amazon S3 data transfer by the analytics services, such as Athena, Amazon EMR, and AWS Glue ETL. As a result, cross-Region queries can be slower and will incur higher transfer costs compared to queries in the same Region. Some analytics services such as AWS Glue jobs and Amazon EMR may require internet access when accessing cross-Region data from Amazon S3, depending on your VPC set up. Refer to Considerations and limitations for more considerations.

Conclusion

In this post, you saw examples of how to set up cross-Region resource links for a database in the same account and across two accounts. You also saw how to use cross-Region resource links to query in Athena. You can share selected tables from a database instead of sharing an entire database. With cross-Region sharing, you can create a resource link for the table using the Create table option.

There are two key things to remember when using the cross-Region table access feature:

  • Grant permissions on the source database or table from its source Region.
  • Grant permissions on the resource link from the Region it was created in.

That is, the original shared database or table is always available in the source Region, and resource links are created and shared in their local Region.

To get started, see Accessing tables across Regions. Share your comments on the post or contact your AWS account team for more details.


About the author

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/create-an-apache-hudi-based-near-real-time-transactional-data-lake-using-aws-dms-amazon-kinesis-aws-glue-streaming-etl-and-data-visualization-using-amazon-quicksight/

With the rapid growth of technology, more and more data volume is coming in many different formats—structured, semi-structured, and unstructured. Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. In most real-world use cases, it’s important to replicate the data from the relational database source to the target in real time. Change data capture (CDC) is one of the most common design patterns to capture the changes made in the source database and reflect them to other data stores.

We recently announced support for streaming extract, transform, and load (ETL) jobs in AWS Glue version 4.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds. AWS also offers a broad selection of services to support your needs. A database replication service such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3), which commonly hosts the storage layer of the data lake. Although it’s straightforward to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s difficult to apply this CDC process on your data lakes. Apache Hudi, an open-source data management framework used to simplify incremental data processing and data pipeline development, is a good option to solve this problem.

This post demonstrates how to apply CDC changes from Amazon Relational Database Service (Amazon RDS) or other relational databases to an S3 data lake, with flexibility to denormalize, transform, and enrich the data in near-real time.

Solution overview

We use an AWS DMS task to capture near-real-time changes in the source RDS instance, and use Amazon Kinesis Data Streams as a destination of the AWS DMS task CDC replication. An AWS Glue streaming job reads and enriches changed records from Kinesis Data Streams and performs an upsert into the S3 data lake in Apache Hudi format. Then we can query the data with Amazon Athena visualize it in Amazon QuickSight. AWS Glue natively supports continuous write operations for streaming data to Apache Hudi-based tables.

The following diagram illustrates the architecture used for this post, which is deployed through an AWS CloudFormation template.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Source data overview

To illustrate our use case, we assume a data analyst persona who is interested in analyzing near-real-time data for sport events using the table ticket_activity. An example of this table is shown in the following screenshot.

Apache Hudi connector for AWS Glue

For this post, we use AWS Glue 4.0, which already has native support for the Hudi framework. Hudi, an open-source data lake framework, simplifies incremental data processing in data lakes built on Amazon S3. It enables capabilities including time travel queries, ACID (Atomicity, Consistency, Isolation, Durability) transactions, streaming ingestion, CDC, upserts, and deletes.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An RDS database instance (source).
  • An AWS DMS replication instance, used to replicate the data from the source table to Kinesis Data Streams.
  • A Kinesis data stream.
  • Four AWS Glue Python shell jobs:
    • rds-ingest-rds-setup-<CloudFormation Stack name> – creates one source table called ticket_activity on Amazon RDS.
    • rds-ingest-data-initial-<CloudFormation Stack name> – Sample data is automatically generated at random by the Faker library and loaded to the ticket_activity table.
    • rds-ingest-data-incremental-<CloudFormation Stack name> – Ingests new ticket activity data into the source table ticket_activity continuously. This job simulates customer activity.
    • rds-upsert-data-<CloudFormation Stack name> – Upserts specific records in the source table ticket_activity. This job simulates administrator activity.
  • AWS Identity and Access Management (IAM) users and policies.
  • An Amazon VPC, a public subnet, two private subnets, internet gateway, NAT gateway, and route tables.
    • We use private subnets for the RDS database instance and AWS DMS replication instance.
    • We use the NAT gateway to have reachability to pypi.org to use the MySQL connector for Python from the AWS Glue Python shell jobs. It also provides reachability to Kinesis Data Streams and an Amazon S3 API endpoint

To set up these resources, you must have the following prerequisites:

The following diagram illustrates the architecture of our provisioned resources.

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack
  3. Choose Next.
  4. For S3BucketName, enter the name of your new S3 bucket.
  5. For VPCCIDR, enter a CIDR IP address range that doesn’t conflict with your existing networks.
  6. For PublicSubnetCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  7. For PrivateSubnetACIDR and PrivateSubnetBCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  8. For SubnetAzA and SubnetAzB, choose the subnets you want to use.
  9. For DatabaseUserName, enter your database user name.
  10. For DatabaseUserPassword, enter your database user password.
  11. Choose Next.
  12. On the next page, choose Next.
  13. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  14. Choose Create stack.

Stack creation can take about 20 minutes.

Set up an initial source table

The AWS Glue job rds-ingest-rds-setup-<CloudFormation stack name> creates a source table called event on the RDS database instance. To set up the initial source table in Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-rds-setup-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

This job will only create the one table, ticket_activity, in the MySQL instance (DDL). See the following code:

CREATE TABLE ticket_activity (
ticketactivity_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
sport_type VARCHAR(256) NOT NULL,
start_date DATETIME NOT NULL,
location VARCHAR(256) NOT NULL,
seat_level VARCHAR(256) NOT NULL,
seat_location VARCHAR(256) NOT NULL,
ticket_price INT NOT NULL,
customer_name VARCHAR(256) NOT NULL,
email_address VARCHAR(256) NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL )

Ingest new records

In this section, we detail the steps to ingest new records. Implement following steps to star the execution of the jobs.

Start data ingestion to Kinesis Data Streams using AWS DMS

To start data ingestion from Amazon RDS to Kinesis Data Streams, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task rds-to-kinesis-<CloudFormation stack name>.
  3. On the Actions menu, choose Restart/Resume.
  4. Wait for the status to show as Load complete and Replication ongoing.

The AWS DMS replication task ingests data from Amazon RDS to Kinesis Data Streams continuously.

Start data ingestion to Amazon S3

Next, to start data ingestion from Kinesis Data Streams to Amazon S3, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose streaming-cdc-kinesis2hudi-<CloudFormation stack name> to open the job.
  3. Choose Run.

Do not stop this job; you can check the run status on the Runs tab and wait for it to show as Running.

Start the data load to the source table on Amazon RDS

To start data ingestion to the source table on Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-initial-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

Validate the ingested data

After about 2 minutes from starting the job, the data should be ingested into the Amazon S3. To validate the ingested data in the Athena, complete the following steps:

  1. On the Athena console, complete the following steps if you’re running an Athena query for the first time:
    • On the Settings tab, choose Manage.
    • Specify the stage directory and the S3 path where Athena saves the query results.
    • Choose Save.

  1. On the Editor tab, run the following query against the table to check the data:
SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

Note that AWS Cloud Formation will create the database with the account number as database_<your-account-number>_hudi_cdc_demo.

Update existing records

Before you update the existing records, note down the ticketactivity_id value of a record from the ticket_activity table. Run the following SQL using Athena. For this post, we use ticketactivity_id = 46 as an example:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

To simulate a real-time use case, update the data in the source table ticket_activity on the RDS database instance to see that the updated records are replicated to Amazon S3. Complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-incremental-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Choose the Runs tab and wait for Run status to show as SUCCEEDED.

To upsert the records in the source table, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job rds-upsert-data-<CloudFormation stack name>.
  3. On the Job details tab, under Advanced properties, for Job parameters, update the following parameters:
    • For Key, enter --ticketactivity_id.
    • For Value, replace 1 with one of the ticket IDs you noted above (for this post, 46).

  1. Choose Save.
  2. Choose Run and wait for the Run status to show as SUCCEEDED.

This AWS Glue Python shell job simulates a customer activity to buy a ticket. It updates a record in the source table ticket_activity on the RDS database instance using the ticket ID passed in the job argument --ticketactivity_id. It will update ticket_price=500 and updated_at with the current timestamp.

To validate the ingested data in Amazon s3, run the same query from Athena and check the ticket_activity value you noted earlier to observe the ticket_price and updated_at fields:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" where ticketactivity_id = 46 ;

Visualize the data in QuickSight

After you have the output file generated by the AWS Glue streaming job in the S3 bucket, you can use QuickSight to visualize the Hudi data files. QuickSight is a scalable, serverless, embeddable, ML-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device and seamlessly embedded into your applications, portals, and websites.

Build a QuickSight dashboard

To build a QuickSight dashboard, complete the following steps:

  1. Open the QuickSight console.

You’re presented with the QuickSight welcome page. If you haven’t signed up for QuickSight, you may have to complete the signup wizard. For more information, refer to Signing up for an Amazon QuickSight subscription.

After you have signed up, QuickSight presents a “Welcome wizard.” You can view the short tutorial, or you can close it.

  1. On the QuickSight console, choose your user name and choose Manage QuickSight.
  2. Choose Security & permissions, then choose Manage.
  3. Select Amazon S3 and select the buckets that you created earlier with AWS CloudFormation.
  4. Select Amazon Athena.
  5. Choose Save.
  6. If you changed your Region during the first step of this process, change it back to the Region that you used earlier during the AWS Glue jobs.

Create a dataset

Now that you have QuickSight up and running, you can create your dataset. Complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Athena.
  4. For Data source name, enter a name (for example, hudi-blog).
  5. Choose Validate.
  6. After the validation is successful, choose Create data source.
  7. For Database, choose database_<your-account-number>_hudi_cdc_demo.
  8. For Tables, select ticket_activity.
  9. Choose Select.
  10. Choose Visualize.
  11. Choose hour and then ticket_activity_id to get the count of ticket_activity_id by hour.

Clean up

To clean up your resources, complete the following steps:

  1. Stop the AWS DMS replication task rds-to-kinesis-<CloudFormation stack name>.
  2. Navigate to the RDS database and choose Modify.
  3. Deselect Enable deletion protection, then choose Continue.
  4. Stop the AWS Glue streaming job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
  5. Delete the CloudFormation stack.
  6. On the QuickSight dashboard, choose your user name, then choose Manage QuickSight.
  7. Choose Account settings, then choose Delete account.
  8. Choose Delete account to confirm.
  9. Enter confirm and choose Delete account.

Conclusion

In this post, we demonstrated how you can stream data—not only new records, but also updated records from relational databases—to Amazon S3 using an AWS Glue streaming job to create an Apache Hudi-based near-real-time transactional data lake. With this approach, you can easily achieve upsert use cases on Amazon S3. We also showcased how to visualize the Apache Hudi table using QuickSight and Athena. As a next step, refer to the Apache Hudi performance tuning guide for a high-volume dataset. To learn more about authoring dashboards in QuickSight, check out the QuickSight Author Workshop.


About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Sundeep Kumar is a Sr. Data Architect, Data Lake at AWS, helping customers build data lake and analytics platform and solutions. When not building and designing data lakes, Sundeep enjoys listening music and playing guitar.

Estimating Scope 1 Carbon Footprint with Amazon Athena

Post Syndicated from Thomas Burns original https://aws.amazon.com/blogs/big-data/estimating-scope-1-carbon-footprint-with-amazon-athena/

Today, more than 400 organizations have signed The Climate Pledge, a commitment to reach net-zero carbon by 2040. Some of the drivers that lead to setting explicit climate goals include customer demand, current and anticipated government relations, employee demand, investor demand, and sustainability as a competitive advantage. AWS customers are increasingly interested in ways to drive sustainability actions. In this blog, we will walk through how we can apply existing enterprise data to better understand and estimate Scope 1 carbon footprint using Amazon Simple Storage Service (S3) and Amazon Athena, a serverless interactive analytics service that makes it easy to analyze data using standard SQL.

The Greenhouse Gas Protocol

The Greenhouse Gas Protocol (GHGP) provides standards for measuring and managing global warming impacts from an organization’s operations and value chain.

The greenhouse gases covered by the GHGP are the seven gases required by the UNFCCC/Kyoto Protocol (which is often called the “Kyoto Basket”). These gases are carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), the so-called F-gases (hydrofluorocarbons and perfluorocarbons), sulfur hexafluoride (SF6) nitrogen trifluoride (NF3). Each greenhouse gas is characterized by its global warming potential (GWP), which is determined by the gas’s greenhouse effect and its lifetime in the atmosphere. Since carbon dioxide (CO2) accounts for about 76 percent of total man-made greenhouse gas emissions, the global warming potential of greenhouse gases are measured relative to CO2, and are thus expressed as CO2-equivalent (CO2e).

The GHGP divides an organization’s emissions into three primary scopes:

  • Scope 1 – Direct greenhouse gas emissions (for example from burning fossil fuels)
  • Scope 2 – Indirect emissions from purchased energy (typically electricity)
  • Scope 3 – Indirect emissions from the value chain, including suppliers and customers

How do we estimate greenhouse gas emissions?

There are different methods to estimating GHG emissions that includes the Continuous Emissions Monitoring System (CEMS) Method, the Spend-Based Method, and the Consumption-Based Method.

Direct Measurement – CEMS Method

An organization can estimate its carbon footprint from stationary combustion sources by performing a direct measurement of carbon emissions using the CEMS method. This method requires continuously measuring the pollutants emitted in exhaust gases from each emissions source using equipment such as gas analyzers, gas samplers, gas conditioning equipment (to remove particulate matter, water vapor and other contaminants), plumbing, actuated valves, Programmable Logic Controllers (PLCs) and other controlling software and hardware. Although this approach may yield useful results, CEMS requires specific sensing equipment for each greenhouse gas to be measured, requires supporting hardware and software, and is typically more suitable for Environment Health and Safety applications of centralized emission sources. More information on CEMS is available here.

Spend-Based Method

Because the financial accounting function is mature and often already audited, many organizations choose to use financial controls as a foundation for their carbon footprint accounting. The Economic Input-Output Life Cycle Assessment (EIO LCA) method is a spend-based method that combines expenditure data with monetary-based emission factors to estimate the emissions produced. The emission factors are published by the U.S. Environment Protection Agency (EPA) and other peer-reviewed academic and government sources. With this method, you can multiply the amount of money spent on a business activity by the emission factor to produce the estimated carbon footprint of the activity.

For example, you can convert the amount your company spends on truck transport to estimated kilograms (KG) of carbon dioxide equivalent (CO₂e) emitted as shown below.

Estimated Carbon Footprint = Amount of money spent on truck transport * Emission Factor [1]

Although these computations are very easy to make from general ledgers or other financial records, they are most valuable for initial estimates or for reporting minor sources of greenhouse gases. As the only user-provided input is the amount spent on an activity, EIO LCA methods aren’t useful for modeling improved efficiency. This is because the only way to reduce EIO-calculated emissions is to reduce spending. Therefore, as a company continues to improve its carbon footprint efficiency, other methods of estimating carbon footprint are often more desirable.

Consumption-Based Method

From either Enterprise Resource Planning (ERP) systems or electronic copies of fuel bills, it’s straightforward to determine the amount of fuel an organization procures during a reporting period. Fuel-based emission factors are available from a variety of sources such as the US Environmental Protection Agency and commercially-licensed databases. Multiplying the amount of fuel procured by the emission factor yields an estimate of the CO2e emitted through combustion. This method is often used for estimating the carbon footprint of stationary emissions (for instance backup generators for data centers or fossil fuel ovens for industrial processes).

If for a particular month an enterprise consumed a known amount of motor gasoline for stationary combustion, the Scope 1 CO2e footprint of the stationary gasoline combustion can be estimated in the following manner:

Estimated Carbon Footprint = Amount of Fuel Consumed * Stationary Combustion Emission Factor[2]

Organizations may estimate their carbon emissions by using existing data found in fuel and electricity bills, ERP data, and relevant emissions factors, which are then consolidated in to a data lake. Using existing analytics tools such as Amazon Athena and Amazon QuickSight an organization can gain insight into its estimated carbon footprint.

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint.

Analytics Architecture

Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case. For example, in the data ingestion phase, depending on the existing data requirements, there are many options to ingest data into the data lake such as using the AWS Command Line Interface (CLI), AWS DataSync, or AWS Database Migration Service.

Example of calculating a Scope 1 stationary emissions footprint with AWS services

Let’s assume you burned 100 standard cubic feet (scf) of natural gas in an oven. Using the US EPA emission factors for stationary emissions we can estimate the carbon footprint associated with the burning. In this case the emission factor is 0.05449555 Kg CO2e /scf.[3]

Amazon S3 is ideal for building a data lake on AWS to store disparate data sources in a single repository, due to its virtually unlimited scalability and high durability. Athena, a serverless interactive query service, allows the analysis of data directly from Amazon S3 using standard SQL without having to load the data into Athena or run complex extract, transform, and load (ETL) processes. Amazon QuickSight supports creating visualizations of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to extract a subset of the data. QuickSight dashboards can provide you with insights (such as your company’s estimated carbon footprint) quickly, and also provide the ability to generate standardized reports for your business and sustainability users.

In this example, the sample data is stored in a file system and uploaded to Amazon S3 using the AWS Command Line Interface (CLI) as shown in the following architecture diagram. AWS recommends creating AWS resources and managing CLI access in accordance with the Best Practices for Security, Identity, & Compliance guidance.

The AWS CLI command below demonstrates how to upload the sample data folders into the S3 target location.

aws s3 cp /path/to/local/file s3://bucket-name/path/to/destination

The snapshot of the S3 console shows two newly added folders that contains the files.

S3 Bucket Overview of Files

To create new table schemas, we start by running the following script for the gas utilization table in the Athena query editor using Hive DDL. The script defines the data format, column details, table properties, and the location of the data in S3.

CREATE EXTERNAL TABLE `gasutilization`(
`fuel_id` int,
`month` string,
`year` int,
`usage_therms` float,
`usage_scf` float,
`g-nr1_schedule_charge` float,
`accountfee` float,
`gas_ppps` float,
`netcharge` float,
`taxpercentage` float,
`totalcharge` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gasutilization'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

Athena Hive DDLThe script below shows another example of using Hive DDL to generate the table schema for the gas emission factor data.

CREATE EXTERNAL TABLE `gas_emission_factor`(
`fuel_id` int,
`gas_name` string,
`emission_factor` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gas_emission_factor'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

After creating the table schema in Athena, we run the below query against the gas utilization table that includes details of gas bills to show the gas utilization and the associated charges, such as gas public purpose program surcharge (PPPS) and total charges after taxes for the year of 2020:

SELECT * FROM "gasutilization" where year = 2020;

Athena gas utilization overview by month

We are also able to analyze the emission factor data showing the different fuel types and their corresponding CO2e emission as shown in the screenshot.

athena co2e emission factor

With the emission factor and the gas utilization data, we can run the following query below to get an estimated Scope 1 carbon footprint alongside other details. In this query, we joined the gas utilization table and the gas emission factor table on fuel id and multiplied the gas usage in standard cubic foot (scf) by the emission factor to get the estimated CO2e impact. We also selected the month, year, total charge, and gas usage measured in therms and scf, as these are often attributes that are of interest for customers.

SELECT "gasutilization"."usage_scf" * "gas_emission_factor"."emission_factor" 
AS "estimated_CO2e_impact", 
"gasutilization"."month", 
"gasutilization"."year", 
"gasutilization"."totalcharge", 
"gasutilization"."usage_therms", 
"gasutilization"."usage_scf" 
FROM "gasutilization" 
JOIN "gas_emission_factor" 
on "gasutilization"."fuel_id"="gas_emission_factor"."fuel_id";

athena join

Lastly, Amazon QuickSight allows visualization of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to get a subset of the data. The following is an example of a QuickSight dashboard showing the gas utilization, gas charges, and estimated carbon footprint across different years.

QuickSight sample dashboard

We have just estimated the Scope 1 carbon footprint for one source of stationary combustion. If we were to do the same process for all sources of stationary and mobile emissions (with different emissions factors) and add the results together, we could roll up an accurate estimate of our Scope 1 carbon emissions for the entire business by only utilizing native AWS services and our own data. A similar process will yield an estimate of Scope 2 emissions, with grid carbon intensity in the place of Scope 1 emission factors.

Summary

This blog discusses how organizations can use existing data in disparate sources to build a data architecture to gain better visibility into Scope 1 greenhouse gas emissions. With Athena, S3, and QuickSight, organizations can now estimate their stationary emissions carbon footprint in a repeatable way by applying the consumption-based method to convert fuel utilization into an estimated carbon footprint.

Other approaches available on AWS include Carbon Accounting on AWS, Sustainability Insights Framework, Carbon Data Lake on AWS, and general guidance detailed at the AWS Carbon Accounting Page.

If you are interested in information on estimating your organization’s carbon footprint with AWS, please reach out to your AWS account team and check out AWS Sustainability Solutions.

References

  1. An example from page four of Amazon’s Carbon Methodology document illustrates this concept.
    Amount spent on truck transport: $100,000
    EPA Emission Factor: 1.556 KG CO2e /dollar of truck transport
    Estimated CO₂e emission: $100,000 * 1.556 KG CO₂e/dollar of truck transport = 155,600 KG of CO2e
  2. For example,
    Gasoline consumed: 1,000 US Gallons
    EPA Emission Factor: 8.81 Kg of CO2e /gallon of gasoline combusted
    Estimated CO2e emission = 1,000 US Gallons * 8.81 Kg of CO2e per gallon of gasoline consumed= 8,810 Kg of CO2e.
    EPA Emissions Factor for stationary emissions of motor gasoline is 8.78 kg CO2 plus .38 grams of CH4, plus .08 g of N2O.
    Combining these emission factors using 100-year global warming potential for each gas (CH4:25 and N2O:298) gives us Combined Emission Factor = 8.78 kg + 25*.00038 kg + 298 *.00008 kg = 8.81 kg of CO2e per gallon.
  3. The Emission factor per scf is 0.05444 kg of CO2 plus 0.00103 g of CH4 plus 0.0001 g of N2O. To get this in terms of CO2e we need to multiply the emission factor of the other two gases by their global warming potentials (GWP). The 100-year GWP for CH4  and N2O are 25 and 298 respectively. Emission factors and GWPs come from the US EPA website.


About the Authors


Thomas Burns
, SCR, CISSP is a Principal Sustainability Strategist and Principal Solutions Architect at Amazon Web Services. Thomas supports manufacturing and industrial customers world-wide. Thomas’s focus is using the cloud to help companies reduce their environmental impact both inside and outside of IT.

Aileen Zheng is a Solutions Architect supporting US Federal Civilian Sciences customers at Amazon Web Services (AWS). She partners with customers to provide technical guidance on enterprise cloud adoption and strategy and helps with building well-architected solutions. She is also very passionate about data analytics and machine learning. In her free time, you’ll find Aileen doing pilates, taking her dog Mumu out for a hike, or hunting down another good spot for food! You’ll also see her contributing to projects to support diversity and women in technology.

Amazon Kinesis Data Streams on-demand capacity mode now scales up to 1 GB/second ingest capacity

Post Syndicated from Nihar Sheth original https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-streams-on-demand-capacity-mode-now-scales-up-to-1-gb-second-ingest-capacity/

Amazon Kinesis Data Streams is a serverless data streaming service that makes it easy to capture, process, and store streaming data at any scale. As customers collect and stream more types of data, they have asked for simpler, elastic data streams that can handle variable and unpredictable data traffic. In November 2021, Amazon Web Services launched the on-demand capacity mode for Kinesis Data Streams, which is capable of serving gigabytes of write and read throughput per minute and helps reduce the operational pain point of manually updating data stream capacity. You can create a new on-demand data stream or convert an existing data stream to on-demand mode with a single click and never have to provision and manage servers, storage, or throughput. By default, on-demand capacity mode can automatically scale up to 200 MB/s of write throughput.

We were encouraged by customers’ adoption of on-demand capacity mode, but as customers scaled their workloads, some ran into the 200 MB/s data ingestion limit and asked for a solution. The team worked backward from customer feedback to raise that limit. As of March 2023, Kinesis Data Streams supports an increased on-demand write throughput limit to 1 GB/s, a five-times increase from the current limit of 200 MB/s. It’s like having a truly serverless and elastic data streaming service that works for all your use cases. If you require an increase in capacity, you can contact AWS Support to enable on-demand streams to scale up to 1 GB/s write throughput for each requested account. You pay for throughput consumed rather than for provisioned resources, making it easier to balance costs and performance. Overall, if your data volume can spike unpredictably or you don’t want to manage the number of shards, use on-demand streams.

In this post, we explore how to use Kinesis Data Streams on-demand scaling and best practices to build an efficient data-streaming solution. We discuss different scenarios to avoid write throughput exceptions and scale ingest capacity of Kinesis Data Streams to 1 GB/s in on-demand capacity mode.

Kinesis Data Streams on-demand scaling

A shard serves as a base throughput unit of Kinesis Data Streams. A shard supports 1 MB/s and 1,000 records/s for writes and 2 MB/s for reads. The shard limits ensure predictable performance, making it easy to design and operate a highly reliable data streaming workflow. In on-demand capacity mode, scaling happens at the individual shard level. When the average ingest shard utilization reaches 50% (0.5 MB/s or 500 records/s) in 1 minute, then a shard is split into two shards. If you use random values as a partition key, all shards of the stream will have even traffic, and they will be scaled at the same time. If you use a business-specific key as a partition key, the shards will have uneven traffic. In that scenario, only the shards exceeding an average of 50% utilization will be scaled. Depending upon the number of shards being scaled, it will take up to 15 minutes to split the shards.

When we create a new Kinesis data stream in on-demand capacity mode, by default, Kinesis Data Streams provisions four shards, which provides 4 MB/s write and 8 MB/s read throughput. As the workload ramps up, Kinesis Data Streams increases the number of shards in the stream by monitoring ingest throughput at the shard level. The 4 MB/s default ingest throughput and scaling at shard level in on-demand capacity mode works for most use cases. However, in some specific scenarios, producers may face WriteThroughputExceeded and Rate Exceeded errors, even in on-demand capacity mode. We discuss a few of these scenarios in the following sections and strategies to avoid these errors.

You can create and save record templates and easily send data to Kinesis Data Streams using the Amazon Kinesis Data Generator (KDG) to test the streaming data solution. Alternatively, you can also use the modern load testing framework Locust to run large-scale Kinesis Data Streams load testing. For this post, we use the Locust tool to produce and ingest messages in Kinesis Data Streams for our different use cases.

Scenario 1: A baseline ingest throughput greater than 4 MB/s is needed

To simulate this scenario, run the following AWS Command Line Interface (AWS CLI) command to create the kds-od-default-shards data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-default-shards --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

When the kds-od-default-shards data stream is active, run following AWS CLI command to check the number of shards in the data stream:

aws kinesis describe-stream-summary --stream-name kds-od-default-shards --region us-east-1

You can observe that the OpenShardCount value is 4, which means the kds-od-default-shards data stream has an ingest capacity of 4 MB/s.

Next, we use the Locust tool to set the baseline to approximately 25 MB/s records. As displayed in the following Amazon CloudWatch metrics graph, records are getting throttled for the first couple of minutes. Then the kds-od-default-shards data stream scales the number of shards to support 25 MB/s ingest throughput, and records stop getting throttled. You can also rerun the describe-stream-summary AWS CLI command to check the increased number of shards in the data stream.

BDB-3047-scenario-1-incoming-data

BDB-3047-scenario-1-record-throttle

In a scenario where we know our ingest throughput baseline (25 MB/s) ahead of the time and we don’t want to observe any write throttles, we can create a stream in provisioned mode by specifying the number of shards (30), as shown in the following AWS CLI command (make sure to delete kds-od-default-shards manually from the Kinesis Data Streams console before running the following command):

aws kinesis create-stream --stream-name kds-od-default-shards --stream-mode-details StreamMode=PROVISIONED --shard-count 30 --region us-east-1

When the kds-od-default-shards data stream is active, run the following AWS CLI command to convert the data stream’s capacity mode to on-demand:

aws kinesis update-stream-mode --stream-arn arn:aws:kinesis:us-east-1:<AccountId>:stream/kds-od-default-shards --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

Next, we send 25 MB/s records to the kds-od-default-shards data stream. As displayed in the following CloudWatch metrics graph, we can observe no write throttles, and the kds-od-default-shards data stream scales the number of shards to handle the increase in ingest volume.

BDB-3047-scenario-1-incoming-data1

BDB-3047-scenario-1-record-throttle1

After we send 25 MB/s traffic to the data stream for some time, we can run following AWS CLI command to see that the OpenShardCount value is increased to more than 30 now:

aws kinesis describe-stream-summary --stream-name kds-od-default-shards --region us-east-1

Scenario 2: A significant ingestion spike is expected, which needs ingest throughput greater than the number of shards in the stream

To simulate the scenario, run the following AWS CLI command to create the kds-od-significant-spike data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-significant-spike --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

As mentioned earlier, by default, the kds-od-significant-spike data stream will have four shards initially because this stream is created in on-demand mode. When the data stream is active, we send 4 MB/s ingest throughput initially and grow the ingest throughput by 30–50% every 5–10 minutes. As displayed in the following CloudWatch metrics graph, the kds-od-significant-spike data stream scales the number of shards to handle the increase in ingest volume.

After approximately 15 minutes, run the following AWS CLI command to find the OpenShardCount value (x) of the kds-od-significant-spike data stream. Then send (x * 2) MB/s ingest throughput in the data stream for 2–3 minutes and reduced ingest throughput to the prior level:

aws kinesis describe-stream-summary --stream-name kds-od-significant-spike --region us-east-1

As displayed in the following CloudWatch metrics graph, the records are getting throttled for a few minutes, and then the throttling goes away.

BDB-3047-scenario-2-incoming-data

BDB-3047-scenario-2-record-throttle

Typically, we face a significant spike scenario when running planned events, such as shopping holidays and product launches. To handle such scenarios, we can proactively change capacity mode from on-demand to provisioned. We can configure the number of shards and pick the ingest capacity we anticipate. After we successfully scale the number of shards to our desired peak capacity in provisioned capacity mode, we can change the capacity mode back to on-demand mode.

Scenario 3: A single partition key starts pushing more than 1 MB/s

Partition keys are used to segregate and route records to different shards of a stream. A partition key is specified by the data producer while adding data to the data stream. For example, let’s assume we have a stream with two shards (shard 1 and shard 2). We can configure the data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2. Choosing a partition key is a very important decision, and we should carefully pick the partition key to ensure equal distribution of records across all the shards of the stream. Messages tied to a single partition key A will be sent to a single shard (shard 1), and at any given instance, messages tied to a single partition key A cannot be distributed across different shards. As mentioned earlier, by default, one shard supports 1 MB/s and 1,000 records/s for writes, and we may end up with an edge case scenario where we are trying to push more than 1 MB/s for a specific partition key. In this scenario, producers will continue to experience throttles and keep retrying indefinitely.

To simulate the scenario, run the following AWS CLI command to create the kds-od-partition-key-throttle data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-partition-key-throttle --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

As mentioned earlier, by default, the data stream will have four shards initially because this stream is created in on-demand mode. When the data stream is active, we send 1.5 MB/s ingest throughput continuously for the specific partition key A. As displayed in the following CloudWatch metrics graph, we can observe that throttling continues from a single shard even if we are sending 1.5 MB/s ingest throughput, and the kds-od-partition-key-throttle data stream has an overall ingest capacity of 4 MB/s.

BDB-3047-scenario-3-incoming-data

BDB-3047-scenario-3-record-throttle

To avoid this scenario, we should carefully pick our partition key and ensure that this specific partition key won’t be continuously sending more than 1 MB/s ingest throughput in the data stream.

Scale the ingest capacity of Kinesis Data Streams to 1 GB/s in on-demand capacity mode

To test, we start with approximately 100 MB/s baseline ingest throughput to Kinesis Data Streams in on-demand capacity mode, then we increase ingest throughput rate by 30–50% every 5–10 minutes using Locust load testing tool.

To set up the scenario, first create the kds-od-1gb-stream data stream in provisioned capacity mode and provide a value of 120 for the provisioned shards field:

aws kinesis create-stream --stream-name kds-od-1gb-stream --stream-mode-details StreamMode=PROVISIONED --shard-count 120 --region us-east-1

When the kds-od-1gb-stream data stream is active, switch its capacity mode to on-demand, as shown in the following code. When we change capacity mode from provisioned to on-demand, the shard count (120) remains the same for the data stream even in on-demand capacity mode.

aws kinesis update-stream-mode --stream-arn arn:aws:kinesis:us-east-1:<AccountId>:stream/kds-od-1gb-stream --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

When the kds-od-1gb-stream data stream is in on-demand mode, start the experiment. We send approximately 100 MB/s baseline ingest throughput using the Locust tool and increase 30–50% ingest throughput every 5–10 minutes. As displayed in the following CloudWatch metrics graph, the kds-od-1gb-stream data stream seamlessly scaled to 1 GB/s in on-demand capacity mode. We can also observe that the producers didn’t encounter any write throttles while the data stream was scaling in on-demand capacity mode.

BDB-3047-scale-to-1-GB

Clean up

To avoid ongoing costs, delete all the data streams that you created as part of this post using the Kinesis Data Streams console.

Conclusion

This post demonstrated the on-demand scaling policy of Kinesis Data Streams with a few scenarios using best practices and showed how to scale ingest capacity to 1 GB/s in on-demand capacity mode. You can have an on-demand write throughput limit that is five times larger than the previous limit of 200 MB/s. Choose on-demand mode if you create new data streams with unknown workloads, have unpredictable application traffic, or prefer not to manage capacity. You can switch between on-demand and provisioned capacity modes two times per 24-hour rolling period. Please leave any feedback in the comments section.


About the Authors

Nihar Sheth is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Pratik Patel is Sr. Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively keep customers’ AWS environments operationally healthy.

Nisha Dekhtawala is a Partner Solutions Architect and data analytics specialist. She works with global consulting partners as their trusted advisor, providing technical guidance and support in building Well-Architected innovative industry solutions.

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

Post Syndicated from Mitesh Patel original https://aws.amazon.com/blogs/big-data/migrate-your-existing-sql-based-etl-workload-to-an-aws-serverless-etl-infrastructure-using-aws-glue/

Data has become an integral part of most companies, and the complexity of data processing is increasing rapidly with the exponential growth in the amount and variety of data. Data engineering teams are faced with the following challenges:

  • Manipulating data to make it consumable by business users
  • Building and improving extract, transform, and load (ETL) pipelines
  • Scaling their ETL infrastructure

Many customers migrating data to the cloud are looking for ways to modernize by using native AWS services to further scale and efficiently handle ETL tasks. In the early stages of their cloud journey, customers may need guidance on modernizing their ETL workload with minimal effort and time. Customers often use many SQL scripts to select and transform the data in relational databases hosted either in an on-premises environment or on AWS and use custom workflows to manage their ETL.

AWS Glue is a serverless data integration and ETL service with the ability to scale on demand. In this post, we show how you can migrate your existing SQL-based ETL workload to AWS Glue using Spark SQL, which minimizes the refactoring effort.

Solution overview

The following diagram describes the high-level architecture for our solution. This solution decouples the ETL and analytics workloads from our transactional data source Amazon Aurora, and uses Amazon Redshift as the data warehouse solution to build a data mart. In this solution, we employ AWS Database Migration Service (AWS DMS) for both full load and continuous replication of changes from Aurora. AWS DMS enables us to capture deltas, including deletes from the source database, through the use of Change Data Capture (CDC) configuration. CDC in DMS enables us to capture deltas without writing code and without missing any changes, which is critical for the integrity of the data. Please refer CDC support in DMS to extend the solutions for ongoing CDC.

The workflow includes the following steps:

  1. AWS Database Migration Service (AWS DMS) connects to the Aurora data source.
  2. AWS DMS replicates data from Aurora and migrates to the target destination Amazon Simple Storage Service (Amazon S3) bucket.
  3. AWS Glue crawlers automatically infer schema information of the S3 data and integrate into the AWS Glue Data Catalog.
  4. AWS Glue jobs run ETL code to transform and load the data to Amazon Redshift.

For this post, we use the TPCH dataset for sample transactional data. The components of TPCH consist of eight tables. The relationships between columns in these tables are illustrated in the following diagram.

We use Amazon Redshift as the data warehouse to implement the data mart solution. The data mart fact and dimension tables are created in the Amazon Redshift database. The following diagram illustrates the relationships between the fact (ORDER) and dimension tables (DATE, PARTS, and REGION).

Set up the environment

To get started, we set up the environment using AWS CloudFormation. Complete the following steps:

  1. Sign in to the AWS Management Console with your AWS Identity and Access Management (IAM) user name and password.
  2. Choose Launch Stack and open the page on a new tab:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. In the Parameters section, enter the required parameters.
  6. Choose Next.

  1. On the Configure stack options page, leave all values as default and choose Next.
  2. On the Review stack page, select the check boxes to acknowledge the creation of IAM resources.
  3. Choose Submit.

Wait for the stack creation to complete. You can examine various events from the stack creation process on the Events tab. When the stack creation is complete, you will see the status CREATE_COMPLETE. The stack takes approximately 25–30 minutes to complete.

This template configures the following resources:

  • The Aurora MySQL instance sales-db.
  • The AWS DMS task dmsreplicationtask-* for full load of data and replicating changes from Aurora (source) to Amazon S3 (destination).
  • AWS Glue crawlers s3-crawler and redshift_crawler.
  • The AWS Glue database salesdb.
  • AWS Glue jobs insert_region_dim_tbl, insert_parts_dim_tbl, and insert_date_dim_tbl. We use these jobs for the use cases covered in this post. We create the insert_orders_fact_tbl AWS Glue job manually using AWS Glue Visual Studio.
  • The Redshift cluster blog_cluster with database sales and fact and dimension tables.
  • An S3 bucket to store the output of the AWS Glue job runs.
  • IAM roles and policies with appropriate permissions.

Replicate data from Aurora to Amazon S3

Now let’s look at the steps to replicate data from Aurora to Amazon S3 using AWS DMS:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task dmsreplicationtask-* and on the Action menu, choose Restart/Resume.

This will start the replication task to replicate the data from Aurora to the S3 bucket. Wait for the task status to change to Full Load Complete. The data from the Aurora tables is now copied to the S3 bucket under a new folder, sales.

Create AWS Glue Data Catalog tables

Now let’s create AWS Glue Data Catalog tables for the S3 data and Amazon Redshift tables:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Connections.
  2. Select RedshiftConnection and on the Actions menu, choose Edit.
  3. Choose Save changes.
  4. Select the connection again and on the Actions menu, choose Test connection.
  5. For IAM role¸ choose GlueBlogRole.
  6. Choose Confirm.

Testing the connection can take approximately 1 minute. You will see the message “Successfully connected to the data store with connection blog-redshift-connection.” If you have trouble connecting successfully, refer to Troubleshooting connection issues in AWS Glue.

  1. Under Data Catalog in the navigation pane, choose Crawlers.
  2. Select s3_crawler and choose Run.

This will generate eight tables in the AWS Glue Data Catalog. To view the tables created, in the navigation pane, choose Databases under Data Catalog, then choose salesdb.

  1. Repeat the steps to run redshift_crawler and generate four additional tables.

If the crawler fails, refer to Error: Running crawler failed.

Create SQL-based AWS Glue jobs

Now let’s look at how the SQL statements are used to create ETL jobs using AWS Glue. AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs on virtual resources that it provisions and manages in its own service account. AWS Glue Studio is a graphical interface that makes it simple to create, run, and monitor ETL jobs in AWS Glue. You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.

Let’s go through the steps of creating an AWS Glue job for loading the orders fact table using AWS Glue Studio.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose Create job.
  3. Select Visual with a blank canvas, then choose Create.

  1. Navigate to the Job details tab.
  2. For Name, enter insert_orders_fact_tbl.
  3. For IAM Role, choose GlueBlogRole.
  4. For Job bookmark, choose Enable.
  5. Leave all other parameters as default and choose Save.

  1. Navigate to the Visual tab.
  2. Choose the plus sign.
  3. Under Add nodes, enter Glue in the search bar and choose AWS Glue Data Catalog (Source) to add the Data Catalog as the source.

  1. In the right pane, on the Data source properties – Data Catalog tab, choose salesdb for Database and customer for Table.

  1. On the Node properties tab, for Name, enter Customers.

  1. Repeat these steps for the Orders and LineItem tables.

This concludes creating data sources on the AWS Glue job canvas. Next, we add transformations by combining data from these different tables.

Transform the data

Complete the following steps to add data transformations:

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose SQL Query.
  3. On the Transform tab, for Node parents, select all the three data sources.
  4. On the Transform tab, under SQL query, enter the following query:
SELECT orders.o_orderkey        AS ORDERKEY,
orders.o_orderdate       AS ORDERDATE,
lineitem.l_linenumber    AS LINENUMBER,
lineitem.l_partkey       AS PARTKEY,
lineitem.l_receiptdate   AS RECEIPTDATE,
lineitem.l_quantity      AS QUANTITY,
lineitem.l_extendedprice AS EXTENDEDPRICE,
orders.o_custkey         AS CUSTKEY,
customer.c_nationkey     AS NATIONKEY,
CURRENT_TIMESTAMP        AS UPDATEDATE
FROM   orders orders,
lineitem lineitem,
customer customer
WHERE  orders.o_orderkey = lineitem.l_orderkey
AND orders.o_custkey = customer.c_custkey
  1. Update the SQL aliases values as shown in the following screenshot.

  1. On the Data preview tab, choose Start data preview session.
  2. When prompted, choose GlueBlogRole for IAM role and choose Confirm.

The data preview process will take a minute to complete.

  1. On the Output schema tab, choose Use data preview schema.

You will see the output schema similar to the following screenshot.

Now that we have previewed the data, we change a few data types.

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose Change Schema.
  3. Select the node.
  4. On the Transform tab, update the Data type values as shown in the following screenshot.

Now let’s add the target node.

  1. Choose the Change Schema node and choose the plus sign.
  2. In the search bar, enter target.
  3. Choose Amazon Redshift as the target.

  1. Choose the Amazon Redshift node, and on the Data target properties – Amazon Redshift tab, for Redshift access type, select Direct data connection.
  2. Choose RedshiftConnection for Redshift Connection, public for Schema, and order_table for Table.
  3. Select Merge data into target table under Handling of data and target table.
  4. Choose orderkey for Matching keys.

  1. Choose Save.

AWS Glue Studio automatically generates the Spark code for you. You can view it on the Script tab. If you would like to do any out-of-the-box transformations, you can modify the Spark code. The AWS Glue job uses the Apache SparkSQL query for SQL query transformation. To find the available SparkSQL transformations, refer to the Spark SQL documentation.

  1. Choose Run to run the job.

As part of the CloudFormation stack, three other jobs are created to load the dimension tables.

  1. Navigate back to the Jobs page on the AWS Glue console, select the job insert_parts_dim_tbl, and choose Run.

This job uses the following SQL to populate the parts dimension table:

SELECT part.p_partkey,
part.p_type,
part.p_brand
FROM   part part
  1. Select the job insert_region_dim_tbl and choose Run.

This job uses the following SQL to populate the region dimension table:

SELECT nation.n_nationkey,
nation.n_name,
region.r_name
FROM   nation,
region
WHERE  nation.n_regionkey = region.r_regionkey
  1. Select the job insert_date_dim_tbl and choose Run.

This job uses the following SQL to populate the date dimension table:

SELECT DISTINCT( l_receiptdate )        AS DATEKEY,
Dayofweek(l_receiptdate) AS DAYOFWEEK,
Month(l_receiptdate)     AS MONTH,
Year(l_receiptdate)      AS YEAR,
Day(l_receiptdate)       AS DATE
FROM   lineitem lineitem

You can view the status of the running jobs by navigating to the Job run monitoring section on the Jobs page. Wait for all the jobs to complete. These jobs will load the data into the facts and dimension tables in Amazon Redshift.

To help optimize the resources and cost, you can use the AWS Glue Auto Scaling feature.

Verify the Amazon Redshift data load

To verify the data load, complete the following steps:

  1. On the Amazon Redshift console, select the cluster blog-cluster and on the Query Data menu, choose Query in query editor 2.
  2. For Authentication, select Temporary credentials.
  3. For Database, enter sales.
  4. For User name, enter admin.
  5. Choose Save.

  1. Run the following commands in the query editor to verify that the data is loaded into the Amazon Redshift tables:
SELECT *
FROM   sales.PUBLIC.order_table;

SELECT *
FROM   sales.PUBLIC.date_table;

SELECT *
FROM   sales.PUBLIC.parts_table;

SELECT *
FROM   sales.PUBLIC.region_table;

The following screenshot shows the results from one of the SELECT queries.

Now for the CDC, update the quantity of a line item for order number 1 in Aurora database using the below query. (To connect to your Aurora cluster use Cloud9 or any SQL client tools like MySQL command-line client).

UPDATE lineitem SET l_quantity = 100 WHERE l_orderkey = 1 AND l_linenumber = 4;

DMS will replicate the changes into the S3 bucket as shown in the below screenshot.

Re-running the Glue job insert_orders_fact_tbl will update the changes to the ORDER fact table as shown in the below screenshot

Clean up

To avoid incurring future charges, delete the resources created for the solution:

  1. On the Amazon S3 console, select the S3 bucket created as part of the CloudFormation stack, then choose Empty.
  2. On the AWS CloudFormation console, select the stack that you created initially and choose Delete to delete all the resources created by the stack.

Conclusion

In this post, we showed how you can migrate existing SQL-based ETL to an AWS serverless ETL infrastructure using AWS Glue jobs. We used AWS DMS to migrate data from Aurora to an S3 bucket, then SQL-based AWS Glue jobs to move the data to fact and dimension tables in Amazon Redshift.

This solution demonstrates a one-time data load from Aurora to Amazon Redshift using AWS Glue jobs. You can extend this solution for moving the data on a scheduled basis by orchestrating and scheduling jobs using AWS Glue workflows. To learn more about the capabilities of AWS Glue, refer to AWS Glue.


About the Authors

Mitesh Patel is a Principal Solutions Architect at AWS with specialization in data analytics and machine learning. He is passionate about helping customers building scalable, secure and cost effective cloud native solutions in AWS to drive the business growth. He lives in DC Metro area with his wife and two kids.

Sumitha AP is a Sr. Solutions Architect at AWS. She works with customers and help them attain their business objectives by  designing secure, scalable, reliable, and cost-effective solutions in the AWS Cloud. She has a focus on data and analytics and provides guidance on building analytics solutions on AWS.

Deepti Venuturumilli is a Sr. Solutions Architect in AWS. She works with commercial segment customers and AWS partners to accelerate customers’ business outcomes by providing expertise in AWS services and modernize their workloads. She focuses on data analytics workloads and setting up modern data strategy on AWS.

Deepthi Paruchuri is an AWS Solutions Architect based in NYC. She works closely with customers to build cloud adoption strategy and solve their business needs by designing secure, scalable, and cost-effective solutions in the AWS cloud.

Deploy serverless applications in a multicloud environment using Amazon CodeCatalyst

Post Syndicated from Deepak Kovvuri original https://aws.amazon.com/blogs/devops/deploy-serverless-applications-in-a-multicloud-environment-using-amazon-codecatalyst/

Amazon CodeCatalyst is an integrated service for software development teams adopting continuous integration and deployment practices into their software development process. CodeCatalyst puts the tools you need all in one place. You can plan work, collaborate on code, and build, test, and deploy applications by leveraging CodeCatalyst Workflows.

Introduction

In the first post of the blog series, we showed you how organizations can deploy workloads to instances, and virtual machines (VMs), across hybrid and multicloud environment. The second post of the series covered deploying containerized application in a multicloud environment. Finally, in this post, we explore how organizations can deploy modern, cloud-native, serverless application across multiple cloud platforms. Figure 1 shows the solution which we walk through in the post.

Figure 1 – Architecture diagram

The post walks through how to develop, deploy and test a HTTP RESTful API to Azure Functions using Amazon CodeCatalyst. The solution covers the following steps:

  • Set up CodeCatalyst development environment and develop your application using the Serverless Framework.
  • Build a CodeCatalyst workflow to test and then deploy to Azure Functions using GitHub Actions in Amazon CodeCatalyst.

An Amazon CodeCatalyst workflow is an automated procedure that describes how to build, test, and deploy your code as part of a continuous integration and continuous delivery (CI/CD) system. You can use GitHub Actions alongside native CodeCatalyst actions in a CodeCatalyst workflow.

Pre-requisites

Walkthrough

In this post, we will create a hello world RESTful API using the Serverless Framework. As we progress through the solution, we will focus on building a CodeCatalyst workflow that deploys and tests the functionality of the application. At the end of the post, the workflow will look similar to the one shown in Figure 2.

 CodeCatalyst CI/CD workflow

Figure 2 – CodeCatalyst CI/CD workflow

Environment Setup

Before we start developing the application, we need to setup a CodeCatalyst project and then link a code repository to the project. The code repository can be CodeCatalyst Repo or GitHub. In this scenario, we’ve used GitHub repository. By the time we develop the solution, the repository should look as shown below.

Files in solution's GitHub repository

Figure 3 – Files in GitHub repository

In Amazon CodeCatalyst, there’s an option to create Dev Environments, which can used to work on the code stored in the source repositories of a project. In the post, we create a Dev Environment, and associate it with the source repository created above and work off it. But you may choose not to use a Dev Environment, and can run the following commands, and commit to the repository. The /projects directory of a Dev Environment stores the files that are pulled from the source repository. In the dev environment, install the Serverless Framework using this command:

npm install -g serverless

and then initialize a serverless project in the source repository folder:

├── README.md
├── host.json
├── package.json
├── serverless.yml
└── src
    └── handlers
        ├── goodbye.js
        └── hello.js

We can push the code to the CodeCatalyst project using git. Now, that we have the code in CodeCatalyst, we can turn our focus to building the workflow using the CodeCatalyst console.

CI/CD Setup in CodeCatalyst

Configure access to the Azure Environment

We’ll use the GitHub action for Serverless to create and manage Azure Function. For the action to be able to access the Azure environment, it requires credentials associated with a Service Principal passed to the action as environment variables.

Service Principals in Azure are identified by the CLIENT_ID, CLIENT_SECRET, SUBSCRIPTION_ID, and TENANT_ID properties. Storing these values in plaintext anywhere in your repository should be avoided because anyone with access to the repository which contains the secret can see them. Similarly, these values shouldn’t be used directly in any workflow definitions because they will be visible as files in your repository. With CodeCatalyst, we can protect these values by storing them as secrets within the project, and then reference the secret in the CI\CD workflow.

We can create a secret by choosing Secrets (1) under CI\CD and then selecting ‘Create Secret’ (2) as shown in Figure 4. Now, we can key in the secret name and value of each of the identifiers described above.

Figure 4 – CodeCatalyst Secrets

Building the workflow

To create a new workflow, select CI/CD from navigation on the left and then select Workflows (1). Then, select Create workflow (2), leave the default options, and select Create (3) as shown in Figure 5.

Create CodeCatalyst CI/CD workflow

Figure 5 – Create CI/CD workflow

If the workflow editor opens in YAML mode, select Visual to open the visual designer. Now, we can start adding actions to the workflow.

Configure the Deploy action

We’ll begin by adding a GitHub action for deploying to Azure. Select “+ Actions” to open the actions list and choose GitHub from the dropdown menu. Find the Build action and click “+” to add a new GitHub action to the workflow.

Next, configure the GitHub action from the configurations tab by adding the following snippet to the GitHub Actions YAML property:

- name: Deploy to Azure Functions
  uses: serverless/[email protected]
  with:
    args: -c "serverless plugin install --name serverless-azure-functions && serverless deploy"
    entrypoint: /bin/sh
  env:
    AZURE_SUBSCRIPTION_ID: ${Secrets.SUBSCRIPTION_ID}
    AZURE_TENANT_ID: ${Secrets.TENANT_ID}
    AZURE_CLIENT_ID: ${Secrets.CLIENT_ID}
    AZURE_CLIENT_SECRET: ${Secrets.CLIENT_SECRET}

The above workflow configuration makes use of Serverless GitHub Action that wraps the Serverless Framework to run serverless commands. The action is configured to package and deploy the source code to Azure Functions using the serverless deploy command.

Please note how we were able to pass the secrets to GitHub action by referencing the secret identifiers in the above configuration.

Configure the Test action

Similar to the previous step, we add another GitHub action which will use the serverless framework’s serverless invoke command to test the API deployed on to Azure Functions.

- name: Test Function
  uses: serverless/[email protected]
  with:
    args: |
      -c "serverless plugin install --name serverless-azure-functions && \
          serverless invoke -f hello -d '{\"name\": \"CodeCatalyst\"}' && \
          serverless invoke -f goodbye -d '{\"name\": \"CodeCatalyst\"}'"
    entrypoint: /bin/sh
  env:
    AZURE_SUBSCRIPTION_ID: ${Secrets.SUBSCRIPTION_ID}
    AZURE_TENANT_ID: ${Secrets.TENANT_ID}
    AZURE_CLIENT_ID: ${Secrets.CLIENT_ID}
    AZURE_CLIENT_SECRET: ${Secrets.CLIENT_SECRET}

The workflow is now ready and can be validated by choosing ‘Validate’ and then saved to the repository by choosing ‘Commit’. The workflow should automatically kick-off after commit and the application is automatically deployed to Azure Functions.

The functionality of the API can now be verified from the logs of the test action of the workflow as shown in Figure 6.

Test action in CodeCatalyst CI/CD workfl

Figure 6 – CI/CD workflow Test action

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the Azure Function App (usually prefixed ‘sls’) using the Azure console. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project. There’s no cost associated with the CodeCatalyst project and you can continue using it.

Conclusion

In summary, this post highlighted how Amazon CodeCatalyst can help organizations deploy cloud-native, serverless workload into multi-cloud environment. The post also walked through the solution detailing the process of setting up Amazon CodeCatalyst to deploy a serverless application to Azure Functions by leveraging GitHub Actions. Though we showed an application deployment to Azure Functions, you can follow a similar process and leverage CodeCatalyst to deploy any type of application to almost any cloud platform. Learn more and get started with your Amazon CodeCatalyst journey!

We would love to hear your thoughts, and experiences, on deploying serverless applications to multiple cloud platforms. Reach out to us if you’ve any questions, or provide your feedback in the comments section.

About Authors

Picture of Deepak

Deepak Kovvuri

Deepak Kovvuri is a Senior Solutions Architect at supporting Enterprise Customers at AWS in the US East area. He has over 6 years of experience in helping customers architecting a DevOps strategy for their cloud workloads. Deepak specializes in CI/CD, Systems Administration, Infrastructure as Code and Container Services. He holds an Masters in Computer Engineering from University of Illinois at Chicago.

Picture of Amandeep

Amandeep Bajwa

Amandeep Bajwa is a Senior Solutions Architect at AWS supporting Financial Services enterprises. He helps organizations achieve their business outcomes by identifying the appropriate cloud transformation strategy based on industry trends, and organizational priorities. Some of the areas Amandeep consults on are cloud migration, cloud strategy (including hybrid & multicloud), digital transformation, data & analytics, and technology in general.

Picture of Brian

Brian Beach

Brian Beach has over 20 years of experience as a Developer and Architect. He is currently a Principal Solutions Architect at Amazon Web Services. He holds a Computer Engineering degree from NYU Poly and an MBA from Rutgers Business School. He is the author of “Pro PowerShell for Amazon Web Services” from Apress. He is a regular author and has spoken at numerous events. Brian lives in North Carolina with his wife and three kids.

Picture of Pawan

Pawan Shrivastava

Pawan Shrivastava is a Partner Solution Architect at AWS in the WWPS team. He focusses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. Pawan is passionate about DevOps, automation and CI CD pipelines. He enjoys watching MMA, playing cricket and working out in the Gym.

Deploy container applications in a multicloud environment using Amazon CodeCatalyst

Post Syndicated from Pawan Shrivastava original https://aws.amazon.com/blogs/devops/deploy-container-applications-in-a-multicloud-environment-using-amazon-codecatalyst/

In the previous post of this blog series, we saw how organizations can deploy workloads to virtual machines (VMs) in a hybrid and multicloud environment. This post shows how organizations can address the requirement of deploying containers, and containerized applications to hybrid and multicloud platforms using Amazon CodeCatalyst. CodeCatalyst is an integrated DevOps service which enables development teams to collaborate on code, and build, test, and deploy applications with continuous integration and continuous delivery (CI/CD) tools.

One prominent scenario where multicloud container deployment is useful is when organizations want to leverage AWS’ broadest and deepest set of Artificial Intelligence (AI) and Machine Learning (ML) capabilities by developing and training AI/ML models in AWS using Amazon SageMaker, and deploying the model package to a Kubernetes platform on other cloud platforms, such as Azure Kubernetes Service (AKS) for inference. As shown in this workshop for operationalizing the machine learning pipeline, we can train an AI/ML model, push it to Amazon Elastic Container Registry (ECR) as an image, and later deploy the model as a container application.

Scenario description

The solution described in the post covers the following steps:

  • Setup Amazon CodeCatalyst environment.
  • Create a Dockerfile along with a manifest for the application, and a repository in Amazon ECR.
  • Create an Azure service principal which has permissions to deploy resources to Azure Kubernetes Service (AKS), and store the credentials securely in Amazon CodeCatalyst secret.
  • Create a CodeCatalyst workflow to build, test, and deploy the containerized application to AKS cluster using Github Actions.

The architecture diagram for the scenario is shown in Figure 1.

Solution architecture diagram

Figure 1 – Solution Architecture

Solution Walkthrough

This section shows how to set up the environment, and deploy a HTML application to an AKS cluster.

Setup Amazon ECR and GitHub code repository

Create a new Amazon ECR and a code repository. In this case we’re using GitHub as the repository but you can create a source repository in CodeCatalyst or you can choose to link an existing source repository hosted by another service if that service is supported by an installed extension. Then follow the application and Docker image creation steps outlined in Step 1 in the environment creation process in exposing Multiple Applications on Amazon EKS. Create a file named manifest.yaml as shown, and map the “image” parameter to the URL of the Amazon ECR repository created above.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multicloud-container-deployment-app
  labels:
    app: multicloud-container-deployment-app
spec:
  selector:
    matchLabels:
      app: multicloud-container-deployment-app
  replicas: 2
  template:
    metadata:
      labels:
        app: multicloud-container-deployment-app
    spec:
      nodeSelector:
        "beta.kubernetes.io/os": linux
      containers:
      - name: ecs-web-page-container
        image: <aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>
        imagePullPolicy: Always
        ports:
            - containerPort: 80
        resources:
          limits:
            memory: "100Mi"
            cpu: "200m"
      imagePullSecrets:
          - name: ecrsecret
---
apiVersion: v1
kind: Service
metadata:
  name: multicloud-container-deployment-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: multicloud-container-deployment-app

Push the files to Github code repository. The multicloud-container-app github repository should look similar to Figure 2 below

Files in multicloud container app github repository 

Figure 2 – Files in Github repository

Configure Azure Kubernetes Service (AKS) cluster to pull private images from ECR repository

Pull the docker images from a private ECR repository to your AKS cluster by running the following command. This setup is required during the azure/k8s-deploy Github Actions in the CI/CD workflow. Authenticate Docker to an Amazon ECR registry with get-login-password by using aws ecr get-login-password. Run the following command in a shell where AWS CLI is configured, and is used to connect to the AKS cluster. This creates a secret called ecrsecret, which is used to pull an image from the private ECR repository.

kubectl create secret docker-registry ecrsecret\
 --docker-server=<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>\
 --docker-username=AWS\
 --docker-password= $(aws ecr get-login-password --region us-west-2)

Provide ECR URI in the variable “–docker-server =”.

CodeCatalyst setup

Follow these steps to set up CodeCatalyst environment:

Configure access to the AKS cluster

In this solution, we use three GitHub Actions – azure/login, azure/aks-set-context and azure/k8s-deploy – to login, set the AKS cluster, and deploy the manifest file to the AKS cluster respectively. For the Github Actions to access the Azure environment, they require credentials associated with an Azure Service Principal.

Service Principals in Azure are identified by the CLIENT_ID, CLIENT_SECRET, SUBSCRIPTION_ID, and TENANT_ID properties. Create the Service principal by running the following command in the azure cloud shell:

az ad sp create-for-rbac \
    --name "ghActionHTMLapplication" \
    --scope /subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP> \
    --role Contributor \
    --sdk-auth

The command generates a JSON output (shown in Figure 3), which is stored in CodeCatalyst secret called AZURE_CREDENTIALS. This credential is used by azure/login Github Actions.

JSON output stored in AZURE-CREDENTIALS secret

Figure 3 – JSON output

Configure secrets inside CodeCatalyst Project

Create three secrets CLUSTER_NAME (Name of AKS cluster), RESOURCE_GROUP(Name of Azure resource group) and AZURE_CREDENTIALS(described in the previous step) as described in the working with secret document. The secrets are shown in Figure 4.

Secrets in CodeCatalyst

Figure 4 – CodeCatalyst Secrets

CodeCatalyst CI/CD Workflow

To create a new CodeCatalyst workflow, select CI/CD from the navigation on the left and select Workflows (1). Then, select Create workflow (2), leave the default options, and select Create (3) as shown in Figure 5.

Create CodeCatalyst CI/CD workflow

Figure 5 – Create CodeCatalyst CI/CD workflow

Add “Push to Amazon ECR” Action

Add the Push to Amazon ECR action, and configure the environment where you created the ECR repository as shown in Figure 6. Refer to adding an action to learn how to add CodeCatalyst action.

Create ‘Push to ECR’ CodeCatalyst Action

Figure 6 – Create ‘Push to ECR’ Action

Select the Configuration tab and specify the configurations as shown in Figure7.

Configure ‘Push to ECR’ CodeCatalyst Action

Figure 7 – Configure ‘Push to ECR’ Action

Configure the Deploy action

1. Add a GitHub action for deploying to AKS as shown in Figure 8.

Github action to deploy to AKS

Figure 8 – Github action to deploy to AKS

2. Configure the GitHub action from the configurations tab by adding the following snippet to the GitHub Actions YAML property:

- name: Install Azure CLI
  run: pip install azure-cli
- name: Azure login
  id: login
  uses: azure/[email protected]
  with:
    creds: ${Secrets.AZURE_CREDENTIALS}
- name: Set AKS context
  id: set-context
  uses: azure/aks-set-context@v3
  with:
    resource-group: ${Secrets.RESOURCE_GROUP}
    cluster-name: ${Secrets.CLUSTER_NAME}
- name: Setup kubectl
  id: install-kubectl
  uses: azure/setup-kubectl@v3
- name: Deploy to AKS
  id: deploy-aks
  uses: Azure/k8s-deploy@v4
  with:
    namespace: default
    manifests: manifest.yaml
    pull-images: true

Github action configuration for deploying application to AKS

Figure 9 – Github action configuration

3. The workflow is now ready and can be validated by choosing ‘Validate’ and then saved to the repository by choosing ‘Commit’.
We have implemented an automated CI/CD workflow that builds the container image of the application (refer Figure 10), pushes the image to ECR, and deploys the application to AKS cluster. This CI/CD workflow is triggered as application code is pushed to the repository.

Automated CI/CD workflow

Figure 10 – Automated CI/CD workflow

Test the deployment

When the HTML application runs, Kubernetes exposes the application using a public facing load balancer. To find the external IP of the load balancer, connect to the AKS cluster and run the following command:

kubectl get service multicloud-container-deployment-service

The output of the above command should look like the image in Figure 11.

Output of kubectl get service command

Figure 11 – Output of kubectl get service

Paste the External IP into a browser to see the running HTML application as shown in Figure 12.

HTML application running successfully in AKS

Figure 12 – Application running in AKS

Cleanup

If you have been following along with the workflow described in the post, you should delete the resources you deployed so you do not continue to incur charges. First, delete the Amazon ECR repository using the AWS console. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project. There’s no cost associated with the CodeCatalyst project and you can continue using it. Finally, if you deployed the application on a new AKS cluster, delete the cluster from the Azure console. In case you deployed the application to an existing AKS cluster, run the following commands to delete the application resources.

kubectl delete deployment multicloud-container-deployment-app
kubectl delete services multicloud-container-deployment-service

Conclusion

In summary, this post showed how Amazon CodeCatalyst can help organizations deploy containerized workloads in a hybrid and multicloud environment. It demonstrated in detail how to set up and configure Amazon CodeCatalyst to deploy a containerized application to Azure Kubernetes Service, leveraging a CodeCatalyst workflow, and GitHub Actions. Learn more and get started with your Amazon CodeCatalyst journey!

If you have any questions or feedback, leave them in the comments section.

About Authors

Picture of Pawan

Pawan Shrivastava

Pawan Shrivastava is a Partner Solution Architect at AWS in the WWPS team. He focusses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. Pawan is passionate about DevOps, automation and CI CD pipelines. He enjoys watching MMA, playing cricket and working out in the gym.

Picture of Brent

Brent Van Wynsberge

Brent Van Wynsberge is a Solutions Architect at AWS supporting enterprise customers. He accelerates the cloud adoption journey for organizations by aligning technical objectives to business outcomes and strategic goals, and defining them where needed. Brent is an IoT enthusiast, specifically in the application of IoT in manufacturing, he is also interested in DevOps, data analytics and containers.

Picture of Amandeep

Amandeep Bajwa

Amandeep Bajwa is a Senior Solutions Architect at AWS supporting Financial Services enterprises. He helps organizations achieve their business outcomes by identifying the appropriate cloud transformation strategy based on industry trends, and organizational priorities. Some of the areas Amandeep consults on are cloud migration, cloud strategy (including hybrid & multicloud), digital transformation, data & analytics, and technology in general.

Picture of Brian

Brian Beach

Brian Beach has over 20 years of experience as a Developer and Architect. He is currently a Principal Solutions Architect at Amazon Web Services. He holds a Computer Engineering degree from NYU Poly and an MBA from Rutgers Business School. He is the author of “Pro PowerShell for Amazon Web Services” from Apress. He is a regular author and has spoken at numerous events. Brian lives in North Carolina with his wife and three kids.

Extend your data mesh with Amazon Athena and federated views

Post Syndicated from Saurabh Bhutyani original https://aws.amazon.com/blogs/big-data/extend-your-data-mesh-with-amazon-athena-and-federated-views/

Amazon Athena is a serverless, interactive analytics service built on the Trino, PrestoDB, and Apache Spark open-source frameworks. You can use Athena to run SQL queries on petabytes of data stored on Amazon Simple Storage Service (Amazon S3) in widely used formats such as Parquet and open-table formats like Apache Iceberg, Apache Hudi, and Delta Lake. However, Athena also allows you to query data stored in 30 different data sources—in addition to Amazon S3—including relational, non-relational, and object stores running on premises or in other cloud environments.

In Athena, we refer to queries on non-Amazon S3 data sources as federated queries. These queries run on the underlying database, which means you can analyze the data without learning a new query language and without the need for separate extract, transform, and load (ETL) scripts to extract, duplicate, and prepare data for analysis.

Recently, Athena added support for creating and querying views on federated data sources to bring greater flexibility and ease of use to use cases such as interactive analysis and business intelligence reporting. Athena also updated its data connectors with optimizations that improve performance and reduce cost when querying federated data sources. The updated connectors use dynamic filtering and an expanded set of predicate pushdown optimizations to perform more operations in the underlying data source rather than in Athena. As a result, you get faster queries with less data scanned, especially on tables with millions to billions of rows of data.

In this post, we show how to create and query views on federated data sources in a data mesh architecture featuring data producers and consumers.

The term data mesh refers to a data architecture with decentralized data ownership. A data mesh enables domain-oriented teams with the data they need, emphasizes self-service, and promotes the notion of purpose-built data products. In a data mesh, data producers expose datasets to the organization and data consumers subscribe to and consume the data products created by producers. By distributing data ownership to cross-functional teams, a data mesh can foster a culture of collaboration, invention, and agility around data.

Let’s dive into the solution.

Solution overview

For this post, imagine a hypothetical ecommerce company that uses multiple data sources, each playing a different role:

  • In an S3 data lake, ecommerce records are stored in a table named Lineitems
  • Amazon ElastiCache for Redis stores Nations and ActiveOrders data, ensuring ultra-fast reads of operational data by downstream ecommerce systems
  • On Amazon Relational Database Service (Amazon RDS), MySQL is used to store data like email addresses and shipping addresses in the Orders, Customer, and Suppliers tables
  • For flexibility and low-latency reads and writes, an Amazon DynamoDB table holds Part and Partsupp data

We want to query these data sources in a data mesh design. In the following sections, we set up Athena data source connectors for MySQL, DynamoDB, and Redis, and then run queries that perform complex joins across these data sources. The following diagram depicts our data architecture.

Architecture diagram

As you proceed with this solution, note that you will create AWS resources in your account. We have provided you with an AWS CloudFormation template that defines and configures the required resources, including the sample MySQL database, S3 tables, Redis store, and DynamoDB table. The template also creates the AWS Glue database and tables, S3 bucket, Amazon S3 VPC endpoint, AWS Glue VPC endpoint, and other AWS Identity and Access Management (IAM) resources that are used in the solution.

The template is designed to demonstrate how to use federated views in Athena, and is not intended for production use without modification. Additionally, the template uses the us-east-1 Region and will not work in other Regions without modification. The template creates resources that incur costs while they are in use. Follow the cleanup steps at the end of this post to delete the resources and avoid unnecessary charges.

Prerequisites

Before you launch the CloudFormation stack, ensure you have the following prerequisites:

  • An AWS account that provides access to AWS services
  • An IAM user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI), and permissions to create an IAM role, IAM policies, and stacks in AWS CloudFormation

Create resources with AWS CloudFormation

To get started, complete the following steps:

  1. Choose Launch Stack: Cloudformation Launch Stack
  2. Select I acknowledge that this template may create IAM resources.

The CloudFormation stack takes approximately 20–30 minutes to complete. You can monitor its progress on the AWS CloudFormation console. When status reads CREATE_COMPLETE, your AWS account will have the resources necessary to implement this solution.

Deploy connectors and connect to data sources

With our resources provisioned, we can begin to connect the dots in our data mesh. Let’s start by connecting the data sources created by the CloudFormation stack with Athena.

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Data sources, select MySQL, then choose Next.
  4. For Data source name, enter a name, such as mysql. The Athena connector for MySQL is an AWS Lambda function that was created for you by the CloudFormation template.
  5. For Connection details, choose Select or enter a Lambda function.
  6. Choose mysql, then choose Next.
  7. Review the information and choose Create data source.
  8. Return to the Data sources page and choose mysql.
  9. On the connector details page, choose the link under Lambda function to access the Lambda console and inspect the function associated with this connector.
    mysql Data Soruce details
  10. Return to the Athena query editor.
  11. For Data source, choose mysql.
  12. For Database, choose the sales database.
  13. For Tables, you should see a listing of MySQL tables that are ready for you to query.
  14. Repeat these steps to set up the connectors for DynamoDB and Redis.

After all four data sources are configured, we can see the data sources on the Data source drop-down menu. All other databases and tables, like the lineitem table, which is stored on Amazon S3, are defined in the AWS Glue Data Catalog and can be accessed by choosing AwsDataCatalog as the data source.

This image shows AwsDataCatalog is being selected as Data Source

Analyze data with Athena

With our data sources configured, we are ready to start running queries and using federated views in a data mesh architecture. Let’s start by trying to find out how much profit was made on a given line of parts, broken out by supplier nation and year.

For such a query, we need to calculate, for each nation and year, the profit for parts ordered in each year that were filled by a supplier in each nation. Profit is defined as the sum of [(l_extendedprice*(1-l_discount)) - (ps_supplycost * l_quantity)] for all line items describing parts in the specified line.

Answering this question requires querying all four data sources—MySQL, DynamoDB, Redis, and Amazon S3—and is accomplished with the following SQL:

SELECT 
    n_name nation,
	year(CAST(o_orderdate AS date)) as o_year,
	((l_extendedprice * (1 - l_discount)) - (CAST(ps_supplycost AS double) * l_quantity)) as amount
FROM
    awsdatacatalog.data_lake.lineitem,
    dynamo.default.part,
    dynamo.default.partsupp,
    mysql.sales.supplier,
    mysql.sales.orders,
    redis.redis.nation
WHERE 
    ((s_suppkey = l_suppkey)
    AND (ps_suppkey = l_suppkey)
	AND (ps_partkey = l_partkey)
	AND (p_partkey = l_partkey)
	AND (o_orderkey = l_orderkey)
	AND (s_nationkey = CAST(Regexp_extract(_key_, '.*-(.*)', 1) AS int)))

Running this query on the Athena console produces the following result.

Result of above query

This query is fairly complex: it involves multiple joins and requires special knowledge of the correct way to calculate profit metrics that other end-users may not possess.

To simplify the analysis experience for those users, we can hide this complexity behind a view. For more information on using views with federated data sources, see Querying federated views.

Use the following query to create the view in the data_lake database under the AwsDataCatalog data source:

CREATE OR REPLACE VIEW "data_lake"."federated_view" AS
SELECT 
    n_name nation,
	year(CAST(o_orderdate AS date)) as o_year,
	((l_extendedprice * (1 - l_discount)) - (CAST(ps_supplycost AS double) * l_quantity)) as amount
FROM
    awsdatacatalog.data_lake.lineitem,
    dynamo.default.part,
    dynamo.default.partsupp,
    mysql.sales.supplier,
    mysql.sales.orders,
    redis.redis.nation
WHERE 
    ((s_suppkey = l_suppkey)
    AND (ps_suppkey = l_suppkey)
	AND (ps_partkey = l_partkey)
	AND (p_partkey = l_partkey)
	AND (o_orderkey = l_orderkey)
	AND (s_nationkey = CAST(Regexp_extract(_key_, '.*-(.*)', 1) AS int)))

Next, run a simple select query to validate the view was created successfully: SELECT * FROM federated_view limit 10

The result should be similar to our previous query.

With our view in place, we can perform new analyses to answer questions that would be challenging without the view due to the complex query syntax that would be required. For example, we can find the total profit by nation:

SELECT nation, sum(amount) AS total
from federated_view
GROUP BY nation 
ORDER BY nation ASC

Your results should resemble the following screenshot.

Result of above query

As you now see, the federated view makes it simpler for end-users to run queries on this data. Users are free to query a view of the data, defined by a knowledgeable data producer, rather than having to first acquire expertise in each underlying data source. Because Athena federated queries are processed where the data is stored, with this approach, we avoid duplicating data from the source system, saving valuable time and cost.

Use federated views in a multi-user model

So far, we have satisfied one of the principles of a data mesh: we created a data product (federated view) that is decoupled from its originating source and is available for on-demand analysis by consumers.

Next, we take our data mesh a step further by using federated views in a multi-user model. To keep it simple, assume we have one producer account, the account we used to create our four data sources and federated view, and one consumer account. Using the producer account, we give the consumer account permission to query the federated view from the consumer account.

The following figure depicts this setup and our simplified data mesh architecture.

Multi-user model setup

Follow these steps to share the connectors and AWS Glue Data Catalog resources from the producer, which includes our federated view, with the consumer account:

  1. Share the data sources mysql, redis, dynamo, and data_lake with the consumer account. For instructions, refer to Sharing a data source in Account A with Account B. Note that Account A represents the producer and Account B represents the consumer. Make sure you use the same data source names from earlier when sharing data. This is necessary for the federated view to work in a cross-account model.
  2. Next, share the producer account’s AWS Glue Data Catalog with the consumer account by following the steps in Cross-account access to AWS Glue data catalogs. For the data source name, use shared_federated_catalog.
  3. Switch to the consumer account, navigate to the Athena console, and verify that you see federated_view listed under Views in the shared_federated_catalog Data Catalog and data_lake database.
  4. Next, run a sample query on the shared view to see the query results.

Result of sample query

Clean up

To clean up the resources created for this post, complete the following steps:

  1. On the Amazon S3 console, empty the bucket athena-federation-workshop-<account-id>.
  2. If you’re using the AWS CLI, delete the objects in the athena-federation-workshop-<account-id> bucket with the following code. Make sure you run this command on the correct bucket.
    aws s3 rm s3://athena-federation-workshop-<account-id> --recursive
  3. On the AWS CloudFormation console or the AWS CLI, delete the stack athena-federated-view-blog.

Summary

In this post, we demonstrated the functionality of Athena federated views. We created a view spanning four different federated data sources and ran queries against it. We also saw how federated views could be extended to a multi-user data mesh and ran queries from a consumer account.

To take advantage of federated views, ensure you are using Athena engine version 3 and upgrade your data source connectors to the latest version available. For information on how to upgrade a connector, see Updating a data source connector.


About the Authors

Saurabh Bhutyani is a Principal Big Data Specialist Solutions Architect at AWS. He is passionate about new technologies. He joined AWS in 2019 and works with customers to provide architectural guidance for running scalable analytics solutions and data mesh architectures using AWS analytics services like Amazon EMR, Amazon Athena, AWS Glue, AWS Lake Formation, and Amazon DataZone.

Pathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Find the best Amazon Redshift configuration for your workload using Redshift Test Drive

Post Syndicated from Sathiish Kumar original https://aws.amazon.com/blogs/big-data/find-the-best-amazon-redshift-configuration-for-your-workload-using-redshift-test-drive/

Amazon Redshift is a widely used, fully managed, petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. With the launch of Amazon Redshift Serverless and the various deployment options Amazon Redshift provides (such as instance types and cluster sizes), customers are looking for tools that help them determine the most optimal data warehouse configuration to support their Redshift workload.

In this post, we answer that question by using Redshift Test Drive, an open-source tool that lets you evaluate which different data warehouse configurations options are best suited for your workload. We created Redshift Test Drive from SimpleReplay and redshift-config-compare (see Compare different node types for your workload using Amazon Redshift for more details) to provide a single entry point for finding the best Amazon Redshift configuration for your workload. Redshift Test Drive also provides additional features such as a self-hosted analysis UI and the ability to replicate external objects that a Redshift workload may interact with.

Amazon Redshift RA3 with managed storage is the newest instance type for Provisioned clusters. It allows you to scale and pay for compute and storage independently, as well as use advanced features such as cross-cluster data sharing and cross-Availability Zone cluster relocation. Many customers using previous generation instance types want to upgrade their clusters to RA3 instance types. In this post, we show you how to use Redshift Test Drive to evaluate the performance of an RA3 cluster configuration for your Redshift workloads.

Solution overview

At its core, Redshift Test Drive replicates a workload by extracting queries from the source Redshift data warehouse logs (shown as Workload Extractor in the following figure) and replays the extracted workload against the target Redshift data warehouses (Workload Replayer).

If these workloads interact with external objects via Amazon Redshift Spectrum (such as the AWS Glue Data Catalog) or COPY commands, Redshift Test Drive offers an external object replicator utility to clone these objects to facilitate replay.

Workload replicator architecture

Redshift Test Drive uses this process of workload replication for two main functionalities: comparing configurations and comparing replays.

Compare Amazon Redshift configurations

Redshift Test Drive’s ConfigCompare utility (based on redshift-config-compare tool) helps you find the best Redshift data warehouse configuration by using your workload to run performance and functional tests on different configurations in parallel. This utility’s automation starts by creating a new AWS CloudFormation stack based on this CloudFormation template. The CloudFormation stack creates an AWS Step Function state machine, which internally uses AWS Lambda functions to trigger AWS Batch jobs to run workload comparison across different Redshift instance types. These jobs extract the workload from the source Redshift data warehouse log location across the specified workload time (as provided in the config parameters) and then replays the extracted workload against a list of different target Redshift data warehouse configurations as provided in the configuration file. When the replay is complete, the Step Functions state machine uploads the performance stats for the target configurations to an Amazon Simple Storage Service (Amazon S3) bucket and creates external schemas that can then be queried from any Redshift target to identify a target configuration that meets your performance requirements.

The following diagram illustrates the architecture of this utility.

Architecture of ConfigCompare utility

Compare replay performance

Redshift Test Drive also provides the ability to compare the replay runs visually using a self-hosted UI tool. This tool reads the stats generated by the workload replicator (stored in Amazon S3) and helps compare the replay runs across key performance indicators such as longest running queries, error distribution, queries with most deviation of latency across runs, and more.

The following diagram illustrates the architecture for the UI.

Replay Performance analysis UI architecture

Walkthrough overview

In this post, we provide a step-by-step walkthrough of using Redshift Test Drive to automatically replay your workload against different Amazon Redshift configurations with the ConfigCompare utility. Subsequently, we use the self-hosted analysis UI utility to analyze the output of ConfigCompare for determining the optimal target warehouse configuration to migrate or upgrade. The following diagram illustrates the workflow.

Walkthrough Steps

Prerequisites

The following prerequisites should be addressed before we run the ConfigCompare utility:

  • Enable audit logging and user-activity logging in your source cluster.
  • Take a snapshot of the source Redshift data warehouse.
  • Export your source parameter group and WLM configurations to Amazon S3. The parameter group can be exported using the AWS Command Line Interface (AWS CLI), for example, using CloudShell, by running the following code:
    aws redshift describe-cluster-parameters —parameter-group-name <YOUR-SOURCE-CLUSTER-PARAMETER-GROUP-NAME> —output json >> param_group_src.json
    
    aws s3 cp param_group_src.json s3://<YOUR-BUCKET-NAME>/param_group_src.json

  • The WLM configurations can be copied as JSON in the console, from where you can enter them into a file and upload it to Amazon S3. If you want to test any alternative WLM configurations (such as comparing manual vs. auto WLM or enabling concurrency scaling), you can create a separate file with that target configuration and upload it to Amazon S3 as well.
  • Identify the target configurations you want to test. If you’re upgrading from DC2 to RA3 node types, refer to Upgrading to RA3 node types for recommendations.

For this walkthrough, let’s assume you have an existing Redshift data warehouse configuration with a two-node dc2.8xlarge provisioned cluster. You want to validate whether upgrading your current configuration to a decoupled architecture using the RA3 provisioned node type or Redshift Serverless would meet your workload price/performance requirements.

The following table summarizes the Redshift data warehouse configurations that are evaluated as part of this test.

Warehouse Type Number of Nodes/Base RPU Option
dc2.8xlarge 2 default auto WLM
ra3.4xlarge 4 default auto WLM
Redshift Serverless 64 auto scaling
Redshift Serverless 128 auto scaling

Run the ConfigCompare utility

Before you run the utility, customize the details of the workload to replay, including the time period and the target warehouse configurations to test, in a JSON file. Upload this file to Amazon S3 and copy the S3 URI path to use as an input parameter for the CloudFormation template that deploys the resources for the remaining orchestration.

You can read more about the individual components and inputs of JSON file in the Readme.

For our use case, we use the following JSON file as an input to the utility:

{
   "SNAPSHOT_ID": "redshift-cluster-manual-snapshot",
   "SNAPSHOT_ACCOUNT_ID": "123456789012",

   "PARAMETER_GROUP_CONFIG_S3_PATH": "s3://nodeconfig-artifacts/pg_config.json",

   "DDL_AND_COPY_SCRIPT_S3_PATH": "N/A",
   "SQL_SCRIPT_S3_PATH":"N/A",
   "NUMBER_OF_PARALLEL_SESSIONS_LIST": "N/A",
   "SIMPLE_REPLAY_LOG_LOCATION":"s3://redshift-logging-xxxxxxxx/RSLogs/",
   "SIMPLE_REPLAY_EXTRACT_START_TIME":"2023-01-28T15:45:00+00:00",
   "SIMPLE_REPLAY_EXTRACT_END_TIME":"2023-01-28T16:15:00+00:00",

   "SIMPLE_REPLAY_EXTRACT_OVERWRITE_S3_PATH":"N/A",
   "SIMPLE_REPLAY_OVERWRITE_S3_PATH":"N/A",

   "SIMPLE_REPLAY_UNLOAD_STATEMENTS": "true",

   "AUTO_PAUSE": true,
   "DATABASE_NAME": "database_name",

   "CONFIGURATIONS": [
    	{
      	"TYPE": "Provisioned",
      	"NODE_TYPE": "dc2.8xlarge",
      	"NUMBER_OF_NODES": "6",
      	"WLM_CONFIG_S3_PATH": "s3://nodeconfig-artifacts/wlm.json"
     },
     {
      	"TYPE": "Provisioned",
      	"NODE_TYPE": "ra3.4xlarge",
      	"NUMBER_OF_NODES": "12",
      	"WLM_CONFIG_S3_PATH": "s3://nodeconfig-artifacts/wlm.json"
     },
     {
      	"TYPE": "Serverless",
      	"BASE_RPU": "128"
     },
     {
      	"TYPE": "Serverless",
      	"BASE_RPU": "64"
     }
   ]
}

The utility deploys all the data warehouse configurations included in the CONFIGURATIONS section of the JSON file. A replica of the source configuration is also included to be used for a baseline of the existing workload performance.

After this file is fully configured and uploaded to Amazon S3, navigate to the AWS CloudFormation console and create a new stack based on the this CloudFormation template and specify the relevant parameters. For more details on the individual parameters, refer to the GitHub repo. The following screenshot shows the parameters used for this walkthrough.

Configuration parameters for Cloudformation Template

After this is updated, proceed with the subsequent steps on the AWS CloudFormation console to launch a new stack.

When the stack is fully created, select the stack and open the Resources tab. Here, you can search for the term StepFunctions and choose the hyperlink next to the RedshiftConfigTestingStepFunction physical ID to open the Step Functions state machine to run the utility.

Searching for ConfigTestingStepFunction

On the Step Functions page that opens, choose Start execution. Leave the default values and choose Start execution to trigger the run. Monitor the progress of the state machine’s run on the graph view of the page. The full run will take approximately the same time as the time window that was specified in the JSON configuration file.

StepFunction Execution example

When the status of the run changes from Running to Succeeded, the run is complete.

Analyze the results

When the Step Functions state machine run is complete, the performance metrics are uploaded to the S3 bucket created by the CloudFormation template initially. To analyze the performance of the workload across different configurations, you can use the self-hosted UI tool that comes with Redshift Test Drive. Set up this tool for your workload by following the instructions provided in this Readme.

After you point the UI to the S3 location that has the stats from the ConfigCompare run, the Replays section will populate with the analysis for replays found in the input S3 location. Select the target configurations you want to compare and choose Analysis to navigate to the comparisons page.

AnalysisUI example list of replays

You can use the Filter Results section to denote which query types, users, and time frame to compare, and the Analysis section will expand to a section providing analysis of all the selected replays. Here you can see a comparison of the SELECT queries run by the ad hoc user of the replay.

AnalysisUI filter results

The following screenshot shows an example of the analysis of a replay. These results show the distribution of queries completed over the full run for a given user and query type, allowing us to identify periods of high and low activity. We can also see runtimes of these queries, aggregated as percentiles, average, and standard deviation. For example, the P50 value indicates that 50% of queries ran within 26.564 seconds. The parameters used to filter for specific users, query types, and runtimes can be dynamically updated to allow the results and comparisons to be comprehensively investigated according to the specific performance requirements each individual use case demands.

AnalysisUI compare throughput example

Troubleshooting

As shown in the solution architecture, the main moving parts in the ConfigCompare automation are AWS CloudFormation, Step Functions (internally using Lambda), and AWS Batch.

If any resource in the CloudFormation stack fails to deploy, we recommend troubleshooting the issue based on the error shown on the AWS CloudFormation console.

To troubleshoot errors with the Step Functions state machine, locate the Amazon CloudWatch logs for a step by navigating to the state machine’s latest run on the Step Functions console and choosing CloudWatch Logs for the failed Step Functions step. After resolving the error, you can restart the state machine by choosing New execution.

Troubleshooting Step Function

For AWS Batch errors, locate the AWS Batch logs by navigating to the AWS CloudFormation console and choosing the Resources tab in the CloudFormation stack. On this tab, search for LogGroup to find the AWS Batch run logs.

Troubleshooting Cloudwatch logs

For more information about common errors and their solutions, refer to the Test Drive Readme.

Clean up

When you have completed the evaluation, we recommend manually deleting the deployed Redshift warehouses to avoid any on-demand charges that could accrue. After this, you can delete the CloudFormation stack to clean up other resources.

Limitations

Some of the limitations for the WorkloadReplicator (the core utility supporting the ConfigCompare tool) are outlined in the Readme.

Conclusion

In this post, we demonstrated the process of finding the right Redshift data warehouse configuration using Redshift Test Drive. The utility offers an easy-to-use tool to replicate the workload of your choice against customizable data warehouse configurations. It also provides a self-hosted analysis UI to help you dive deeper into the stats generated during the replication process.

Get started with Test Drive today by following the instructions provided in the Readme. For an in-depth overview of the config compare automation, refer to Compare different node types for your workload using Amazon Redshift. If you’re migrating from DC2 or DS2 node types to RA3, refer to our recommendations on node count and type as a benchmark.


About the Authors

Sathiish Kumar is a Software Development Manager at Amazon Redshift and has worked on building end-to-end applications using different database and technology solutions over the last 10 years. He is passionate about helping his customers find the quickest and the most optimized solution to their problems by leveraging open-source technologies.

Julia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Post Syndicated from Praveen Kumar original https://aws.amazon.com/blogs/big-data/implement-tag-based-access-control-for-your-data-lake-and-amazon-redshift-data-sharing-with-aws-lake-formation/

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Many organizations have a distributed tools and infrastructure across various business units. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another. Customers want to be able to manage their permissions in a central place across all of their assets. Previously, the management of Redshift datashares was limited to only within Amazon Redshift, which made it difficult to manage your data lake permissions and Amazon Redshift permissions in a single place. For example, you had to navigate to an individual account to view and manage access information for Amazon Redshift and the data lake on Amazon Simple Storage Service (Amazon S3). As an organization grows, administrators want a mechanism to effectively and centrally manage data sharing across data lakes and data warehouses for governance and auditing, and to enforce fine-grained access control.

We recently announced the integration of Amazon Redshift data sharing with AWS Lake Formation. With this feature, Amazon Redshift customers can now manage sharing, apply access policies centrally, and effectively scale the permission using LF-Tags.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles. Lake Formation also provides tag-based access control (TBAC), which can be used to simplify and scale governance of data catalog objects such as databases and tables.

In this post, we discuss this new feature and how to implement TBAC for your data lake and Amazon Redshift data sharing on Lake Formation.

Solution overview

Lake Formation tag-based access control (LF-TBAC) allows you to group similar AWS Glue Data Catalog resources together and define the grant or revoke permissions policy by using an LF-Tag expression. LF-Tags are hierarchical in that when a database is tagged with an LF-Tag, all tables in that database inherit the tag, and when a LF-Tag is applied to a table, all the columns within that table inherit the tag. Inherited tags then can be overridden if needed. You then can create access policies within Lake Formation using LF-Tag expressions to grant principals access to tagged resources using an LF-Tag expression. See Managing LF-Tags for metadata access control for more details.

To demonstrate LF-TBAC with central data access governance capability, we use the scenario where two separate business units own particular datasets and need to share data across teams.

We have a customer care team who manages and owns the customer information database including customer demographics data. And have a marketing team who owns a customer leads dataset, which includes information on prospective customers and contact leads.

To be able to run effective campaigns, the marketing team needs access to the customer data. In this post, we demonstrate the process of sharing this data that is stored in the data warehouse and giving the marketing team access. Furthermore, there are personally identifiable information (PII) columns within the customer dataset that should only be accessed by a subset of power users on a need-to-know basis. This way, data analysts within marketing can only see non-PII columns to be able to run anonymous customer segment analysis, but a group of power users can access PII columns (for example, customer email address) to be able to run campaigns or surveys for specific groups of customers.

The following diagram shows the structure of the datasets that we work with in this post and a tagging strategy to provide fine-grained column-level access.

Beyond our tagging strategy on the data resources, the following table gives an overview of how we should grant permissions to our two personas via tags.

IAM Role Persona Resource Type Permission LF-Tag expression
marketing-analyst A data analyst in the marketing team DB describe (department:marketing OR department:customer) AND classification:private
. Table select (department:marketing OR department:customer) AND classification:private
. . . . .
marketing-poweruser A privileged user in the marketing team DB describe (department:marketing OR department:customer) AND classification: private
. Table (Column) select (department:marketing OR department:customer) AND (classification:private OR classification:pii-sensitive)

The following diagram gives a high-level overview of the setup that we deploy in this post.

The following is a high-level overview of how to use Lake Formation to control datashare permissions:

Producer Setup:

  1. In the producers AWS account, the Amazon Redshift administrator that owns the customer database creates a Redshift datashare on the producer cluster and grants usage to the AWS Glue Data Catalog in the same account.
  2. The producer cluster administrator authorizes the Lake Formation account to access the datashare.
  3. In Lake Formation, the Lake Formation administrator discovers and registers the datashares. They must discover the AWS Glue ARNs they have access to and associate the datashares with an AWS Glue Data Catalog ARN. If you’re using the AWS Command Line Interface (AWS CLI), you can discover and accept datashares with the Redshift CLI operations describe-data-shares and associate-data-share-consumer. To register a datashare, use the Lake Formation CLI operation register-resource.
  4. The Lake Formation administrator creates a federated database in the AWS Glue Data Catalog; assigns tags to the databases, tables, and columns; and configures Lake Formation permissions to control user access to objects within the datashare. For more information about federated databases in AWS Glue, see Managing permissions for data in an Amazon Redshift datashare.

Consumer Setup:

  1. On the consumer side (marketing), the Amazon Redshift administrator discovers the AWS Glue database ARNs they have access to, creates an external database in the Redshift consumer cluster using an AWS Glue database ARN, and grants usage to database users authenticated with IAM credentials to start querying the Redshift database.
  2. Database users can use the views SVV_EXTERNAL_TABLES and SVV_EXTERNAL_COLUMNS to find all the tables or columns within the AWS Glue database that they have access to; then they can query the AWS Glue database’s tables.

When the producer cluster administrator decides to no longer share the data with the consumer cluster, the producer cluster administrator can revoke usage, deauthorize, or delete the datashare from Amazon Redshift. The associated permissions and objects in Lake Formation are not automatically deleted.

Prerequisites:

To follow the steps in this post, you must satisfy the following prerequisites:

Deploy environment including producer and consumer Redshift clusters

To follow along the steps outlined in this post, deploy following AWS CloudFormation stack that includes necessary resources to demonstrate the subject of this post:

  1. Choose Launch stack to deploy a CloudFormation template.
  2. Provide an IAM role that you have already configured as a Lake Formation administrator.
  3. Complete the steps to deploy the template and leave all settings as default.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.

This CloudFormation stack creates the following resources:

  • Producer Redshift cluster – Owned by the customer care team and has customer and demographic data on it.
  • Consumer Redshift cluster – Owned by the marketing team and is used to analyze data across data warehouses and data lakes.
  • S3 data lake – Contains the web activity and leads datasets.
  • Other necessary resources to demonstrate the process of sharing data – For example, IAM roles, Lake Formation configuration, and more. For a full list of resources created by the stack, examine the CloudFormation template.

After you deploy this CloudFormation template, resources created will incur cost to your AWS account. At the end of the process, make sure that you clean up resources to avoid unnecessary charges.

After the CloudFormation stack is deployed successfully (status shows as CREATE_COMPLETE), take note of the following items on the Outputs tab:

  • Marketing analyst role ARN
  • Marketing power user role ARN
  • URL for Amazon Redshift admin password stored in AWS Secrets Manager

Create a Redshift datashare and add relevant tables

On the AWS Management Console, switch to the role that you nominated as Lake Formation admin when deploying the CloudFormation template. Then go to Query Editor v2. If this is the first time using Query Editor V2 in your account, follow these steps to configure your AWS account.

The first step in Query Editor is to log in to the customer Redshift cluster using the database admin credentials to make your IAM admin role a DB admin on the database.

  1. Choose the options menu (three dots) next to the lfunified-customer-dwh cluster and choose Create connection.

  2. Select Database user name and password.
  3. Leave Database as dev.
  4. For User name, enter admin.
  5. For Password, complete the following steps:
    1. Go to the console URL, which is the value of the RedShiftClusterPassword CloudFormation output in previous step. The URL is the Secrets Manager console for this password.
    2. Scroll down to the Secret value section and choose Retrieve secret value.
    3. Take note of the password to use later when connecting to the marketing Redshift cluster.
    4. Enter this value for Password.
  6. Choose Create connection.

Create a datashare using a SQL command

Complete the following steps to create a datashare in the data producer cluster (customer care) and share it with Lake Formation:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
  2. Choose (right-click) the cluster name and choose Edit connection or Create connection.
  3. For Authentication, select Temporary credentials using your IAM identity.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

  1. For Database, enter a database name (for this post, dev).
  2. Choose Create connection to connect to the database.
  3. Run the following SQL commands to create the datashare and add the data objects to be shared:
    create datashare customer_ds;
    ALTER DATASHARE customer_ds ADD SCHEMA PUBLIC;
    ALTER DATASHARE customer_ds ADD TABLE customer;

  4. Run the following SQL command to share the customer datashare to the current account via the AWS Glue Data Catalog:
    GRANT USAGE ON DATASHARE customer_ds TO ACCOUNT '<aws-account-id>' via DATA CATALOG;

  5. Verify the datashare was created and objects shared by running the following SQL command:
    DESC DATASHARE customer_ds;

Take note of the datashare producer cluster name space and account ID, which will be used in the following step. You can complete the following actions on the console, but for simplicity, we use AWS CLI commands.

  1. Go to CloudShell or your AWS CLI and run the following AWS CLI command to authorize the datashare to the Data Catalog so that Lake Formation can manage them:
    aws redshift authorize-data-share \
    --data-share-arn 'arn:aws:redshift:<aws-region>:<aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds' \
    --consumer-identifier DataCatalog/<aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-2:<aws-account-id>:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/customer_ds",
    "ProducerArn": "arn:aws:redshift:us-east-2:<aws-account-id>:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/<aws-account-id>XX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

Take note of your datashare ARN that you used in this command to use in the next steps.

Accept the datashare in the Lake Formation catalog

To accept the datashare, complete the following steps:

  1. Run the following AWS CLI command to accept and associate the Amazon Redshift datashare to the AWS Glue Data Catalog:
    aws redshift associate-data-share-consumer --data-share-arn 'arn:aws:redshift:<aws-region>:<aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds' \
    --consumer-arn arn:aws:glue:<aws-region>:<aws-account-id>:catalog

The following is an example output:

{
 "DataShareArn": "arn:aws:redshift:us-east-2:<aws-account-id>:datashare:cfd5fcbd-3492-42b5-9507-dad5d87f7427/customer_ds",
 "ProducerArn": "arn:aws:redshift:us-east-2:<aws-account-id>:namespace:cfd5fcbd-3492-42b5-9507-dad5d87f7427",
 "AllowPubliclyAccessibleConsumers": false,
 "DataShareAssociations": [
 {
 "ConsumerIdentifier": "arn:aws:glue:us-east-2:<aws-account-id>:catalog",
 "Status": "ACTIVE",
 "ConsumerRegion": "us-east-2",
 "CreatedDate": "2023-05-18T12:25:11.178000+00:00",
 "StatusChangeDate": "2023-05-18T12:25:11.178000+00:00"
 }
 ]
}
  1. Register the datashare in Lake Formation:
    aws lakeformation register-resource \
     --resource-arn arn:aws:redshift:<aws-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds

  2. Create the AWS Glue database that points to the accepted Redshift datashare:
    aws glue create-database --region <aws-region> --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "DatabaseInput": {
            "Name": "customer_db_shared",
            "FederatedDatabase": {
                "Identifier": "arn:aws:redshift:<aws-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds",
                "ConnectionName": "aws:redshift"
            }
        }
    }'

  3. To verify, go to the Lake Formation console and check that the database customer_db_shared is created.

Now the data lake administrator can view and grant access on both the database and tables to the data consumer team (marketing) personas using Lake Formation TBAC.

Assign Lake Formation tags to resources

Before we grant appropriate access to the IAM principals of the data analyst and power user within the marketing team, we have to assign LF-tags to tables and columns of the customer_db_shared database. We then grant these principals permission to appropriate LF-tags.

To assign LF-tags, follow these steps:

  1. Assign the department and classification LF-tag to customer_db_shared (Redshift datashare) based on the tagging strategy table in the solution overview. You can run the following actions on the console, but for this post, we use the following AWS CLI command:
    aws lakeformation add-lf-tags-to-resource --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "Resource": {
        "Database": {
        "CatalogId": "<aws-account-id>",
        "Name": "customer_db_shared"
        }
        },
        "LFTags": [
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "department",
        "TagValues": [
        "customer"]
        },
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "classification",
        "TagValues": [
        "private"]
        }
        ]
        }'

If the command is successful, you should get a response like the following:

{
"Failures": []
}
  1. Assign the appropriate department and classification LF-tag to marketing_db (on the S3 data lake):
    aws lakeformation add-lf-tags-to-resource --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "Resource": {
        "Database": {
        "CatalogId": "<aws-account-id>",
        "Name": "lfunified_marketing_dl_db"
        }
        },
        "LFTags": [
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "department",
        "TagValues": [
        "marketing"]
        },
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "classification",
        "TagValues": [
        "private"]
        }
        ]
        }'

Note that although you only assign the department and classification tag on the database level, it gets inherited by the tables and columns within that database.

  1. Assign the classification pii-sensitive LF-tag to PII columns of the customer table to override the inherited value from the database level:
    aws lakeformation add-lf-tags-to-resource --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "Resource": {
        "TableWithColumns": {
        "CatalogId": "<aws-account-id>",
        "DatabaseName": "customer_db_shared",
        "Name": "public.customer",
        "ColumnNames":["c_first_name","c_last_name","c_email_address"]
        }
        },
        "LFTags": [
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "classification",
        "TagValues": [
        "pii-sensitive"]
        }
        ]
        }'

Grant permission based on LF-tag association

Run the following two AWS CLI commands to allow the marketing data analyst access to the customer table excluding the pii-sensitive (PII) columns. Replace the value for DataLakePrincipalIdentifier with the MarketingAnalystRoleARN that you noted from the outputs of the CloudFormation stack:

aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingAnalystRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
}'
aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingAnalystRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
}'

We have now granted marketing analysts access to the customer database and tables that are not pii-sensitive.

To allow marketing power users access to table columns with restricted LF-tag (PII columns), run the following AWS CLI command:

aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingPowerUserRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
}'
aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingPowerUserRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private", "pii-sensitive"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
}'

We can combine the grants into a single batch grant permissions call:

aws lakeformation batch-grant-permissions --region us-east-1 --cli-input-json '{
    "CatalogId": "<aws-account-id>",
 "Entries": [
 {  "Id": "1",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingAnalystRole-1CYV6JSNN14E3"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
    },
    {  "Id": "2",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingAnalystRole-1CYV6JSNN14E3"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
    },
     {  "Id": "3",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingPoweruserRole-RKKM0TWQBP0W"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private", "pii-sensitive"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
    },
    {  "Id": "4",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingPoweruserRole-RKKM0TWQBP0W"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private", "pii-sensitive"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
    }
    ]
 }'

Validate the solution

In this section, we go through the steps to test the scenario.

Consume the datashare in the consumer (marketing) data warehouse

To enable the consumers (marketing team) to access the customer data shared with them via the datashare, first we have to configure Query Editor v2. This configuration is to use IAM credentials as the principal for the Lake Formation permissions. Complete the following steps:

  1. Sign in to the console using the admin role you nominated in running the CloudFormation template step.
  2. On the Amazon Redshift console, go to Query Editor v2.
  3. Choose the gear icon in the navigation pane, then choose Account settings.
  4. Under Connection settings, select Authenticate with IAM credentials.
  5. Choose Save.

Now let’s connect to the marketing Redshift cluster and make the customer database available to the marketing team.

  1. Choose the options menu (three dots) next to the Serverless:lfunified-marketing-wg cluster and choose Create connection.
  2. Select Database user name and password.
  3. Leave Database as dev.
  4. For User name, enter admin.
  5. For Password, enter the same password you retrieved from Secrets Manger in an earlier step.
  6. Choose Create connection.
  7. Once successfully connected, choose the plus sign and choose Editor to open a new Query Editor tab.
  8. Make sure that you specify the Serverless: lfunified-marketing-wg workgroup and dev database.
  9. To create the Redshift database from the shared catalog database, run the following SQL command on the new tab:
    CREATE DATABASE ext_customerdb_shared FROM ARN 'arn:aws:glue:<aws-region>:<aws-account-id>:database/customer_db_shared' WITH DATA CATALOG SCHEMA "customer_db_shared"

  10. Run the following SQL commands to create and grant usage on the Redshift database to the IAM roles for the power users and data analyst. You can get the IAM role names from the CloudFormation stack outputs:
    CREATE USER IAMR:"lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX" password disable;
    GRANT USAGE ON DATABASE ext_customerdb_shared to IAMR:"lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX";
    
    CREATE USER IAMR:"lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY" password disable;
    GRANT USAGE ON DATABASE ext_customerdb_shared to IAMR:"lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY";

Create the data lake schema in AWS Glue and allow the marketing power role to query the lead and web activity data

Run the following SQL commands to make the lead data in the S3 data lake available to the marketing team:

create external schema datalake from data catalog
database 'lfunified_marketing_dl_db' 
iam_role 'SESSION'
catalog_id '<aws-account-id>';
GRANT USAGE ON SCHEMA datalake TO IAMR:"lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX";
GRANT USAGE ON SCHEMA datalake TO IAMR:"lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY";

Query the shared dataset as a marketing analyst user

To validate that the marketing team analysts (IAM role marketing-analyst-role) have access to the shared database, perform the following steps:

  1. Sign in to the console (for convenience, you can use a different browser) and switch your role to lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX.
  2. On the Amazon Redshift console, go to Query Editor v2.
  3. To connect to the consumer cluster, choose the Serverless: lfunified-marketing-wg consumer data warehouse in the navigation pane.
  4. When prompted, for Authentication, select Federated user.
  5. For Database, enter the database name (for this post, dev).
  6. Choose Save.
  7. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
    select current_user;

  8. To find the federated databases created on the consumer account, run the following SQL command:
    SHOW DATABASES FROM DATA CATALOG ACCOUNT '<aws-account-id>';

  9. To validate permissions for the marketing analyst role, run the following SQL command:
    select * from ext_customerdb_shared.public.customer limit 10;

As you can see in the following screenshot, the marketing analyst is able to successfully access the customer data but only the non-PII attributes, which was our intention.

  1. Now let’s validate that the marketing analyst doesn’t have access to the PII columns of the same table:
    select c_customer_email from ext_customerdb_shared.public.customer limit 10;

Query the shared datasets as a marketing power user

To validate that the marketing power users (IAM role lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY) have access to pii-sensetive columns in the shared database, perform the following steps:

  1. Sign in to the console (for convenience, you can use a different browser) and switch your role to lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY.
  2. On the Amazon Redshift console, go to Query Editor v2.
  3. To connect to the consumer cluster, choose the Serverless: lfunified-marketing-wg consumer data warehouse in the navigation pane.
  4. When prompted, for Authentication, select Federated user.
  5. For Database, enter the database name (for this post, dev).
  6. Choose Save.
  7. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
    select current_user;

  8. Now let’s validate that the marketing power role has access to the PII columns of the customer table:
    select c_customer_id, c_first_name, c_last_name,c_customer_email from customershareddb.public.customer limit 10;

  9. Validate that the power users within the marketing team can now run a query to combine data across different datasets that they have access to in order to run effective campaigns:
    SELECT
        emailaddress as emailAddress,  customer.c_first_name as firstName, customer.c_last_name as lastName, leadsource, contactnotes, usedpromo
    FROM
        "dev"."datalake"."lead" as lead
    JOIN ext_customerdb_shared.public.customer as customer
    ON lead.emailaddress = customer.c_email_address
    WHERE lead.donotreachout = 'false'

Clean up

After you complete the steps in this post, to clean up resources, delete the CloudFormation stack:

  1. On the AWS CloudFormation console, select the stack you deployed in the beginning of this post.
  2. Choose Delete and follow the prompts to delete the stack.

Conclusion

In this post, we showed how you can use Lake Formation tags and manage permissions for your data lake and Amazon Redshift data sharing using Lake Formation. Using Lake Formation LF-TBAC for data governance helps you manage your data lake and Amazon Redshift data sharing permissions at scale. Also, it enables data sharing across business units with fine-grained access control. Managing access to your data lake and Redshift datashares in a single place enables better governance, helping with data security and compliance.

If you have questions or suggestions, submit them in the comments section.

For more information on Lake Formation managed Amazon Redshift data sharing and tag-based access control, refer to Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation and Easily manage your data lake at scale using AWS Lake Formation Tag-based access control.


About the Authors

Praveen Kumar is an Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and ML applications.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python.

Mostafa Safipour is a Solutions Architect at AWS based out of Sydney. He works with customers to realize business outcomes using technology and AWS. Over the past decade, he has helped many large organizations in the ANZ region build their data, digital, and enterprise workloads on AWS.