Tag Archives: Technical How-to

Load test your applications in a CI/CD pipeline using CDK pipelines and AWS Distributed Load Testing Solution

Post Syndicated from Krishnakumar Rengarajan original https://aws.amazon.com/blogs/devops/load-test-applications-in-cicd-pipeline/

Load testing is a foundational pillar of building resilient applications. Today, load testing practices across many organizations are often based on desktop tools, where someone must manually run the performance tests and validate the results before a software release can be promoted to production. This leads to increased time to market for new features and products. Load testing applications in automated CI/CD pipelines provides the following benefits:

  • Early and automated feedback on performance thresholds based on clearly defined benchmarks.
  • Consistent and reliable load testing process for every feature release.
  • Reduced overall time to market due to eliminated manual load testing effort.
  • Improved overall resiliency of the production environment.
  • The ability to rapidly identify and document bottlenecks and scaling limits of the production environment.

In this blog post, we demonstrate how to automatically load test your applications in an automated CI/CD pipeline using AWS Distributed Load Testing solution and AWS CDK Pipelines.

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework to define cloud infrastructure in code and provision it through AWS CloudFormation. AWS CDK Pipelines is a construct library module for continuous delivery of AWS CDK applications, powered by AWS CodePipeline. AWS CDK Pipelines can automatically build, test, and deploy the new version of your CDK app whenever the new source code is checked in.

Distributed Load Testing is an AWS Solution that automates software applications testing at scale to help you identify potential performance issues before their release. It creates and simulates thousands of users generating transactional records at a constant pace without the need to provision servers or instances.

Prerequisites

To deploy and test this solution, you will need:

  • AWS Command Line Interface (AWS CLI): This tutorial assumes that you have configured the AWS CLI on your workstation. Alternatively, you can use also use AWS CloudShell.
  • AWS CDK V2: This tutorial assumes that you have installed AWS CDK V2 on your workstation or in the CloudShell environment.

Solution Overview

In this solution, we create a CI/CD pipeline using AWS CDK Pipelines and use it to deploy a sample RESTful CDK application in two environments; development and production. We load test the application using AWS Distributed Load Testing Solution in the development environment. Based on the load test result, we either fail the pipeline or proceed to production deployment. You may consider running the load test in a dedicated testing environment that mimics the production environment.

For demonstration purposes, we use the following metrics to validate the load test results.

  • Average Response Time – the average response time, in seconds, for all the requests generated by the test. In this blog post we define the threshold for average response time to 1 second.
  • Error Count – the total number of errors. In this blog post, we define the threshold for for total number of errors to 1.

For your application, you may consider using additional metrics from the Distributed Load Testing solution documentation to validate your load test.

Architecture diagram

Architecture diagram of the solution to execute load tests in CI/CD pipeline

Solution Components

  • AWS CDK code for the CI/CD pipeline, including AWS Identity and Access Management (IAM) roles and policies. The pipeline has the following stages:
    • Source: fetches the source code for the sample application from the AWS CodeCommit repository.
    • Build: compiles the code and executes cdk synth to generate CloudFormation template for the sample application.
    • UpdatePipeline: updates the pipeline if there are any changes to our code or the pipeline configuration.
    • Assets: prepares and publishes all file assets to Amazon S3 (S3).
    • Development Deployment: deploys application to the development environment and runs a load test.
    • Production Deployment: deploys application to the production environment.
  • AWS CDK code for a sample serverless RESTful application.Architecture diagram of the sample RESTful application
    • The AWS Lambda (Lambda) function in the architecture contains a 500 millisecond sleep statement to add latency to the API response.
  • Typescript code for starting the load test and validating the test results. This code is executed in the ‘Load Test’ step of the ‘Development Deployment’ stage. It starts a load test against the sample restful application endpoint and waits for the test to finish. For demonstration purposes, the load test is started with the following parameters:
    • Concurrency: 1
    • Task Count: 1
    • Ramp up time: 0 secs
    • Hold for: 30 sec
    • End point to test: endpoint for the sample RESTful application.
    • HTTP method: GET
  • Load Testing service deployed via the AWS Distributed Load Testing Solution. For costs related to the AWS Distributed Load Testing Solution, see the solution documentation.

Implementation Details

For the purposes of this blog, we deploy the CI/CD pipeline, the RESTful application and the AWS Distributed Load Testing solution into the same AWS account. In your environment, you may consider deploying these stacks into separate AWS accounts based on your security and governance requirements.

To deploy the solution components

  1. Follow the instructions in the the AWS Distributed Load Testing solution Automated Deployment guide to deploy the solution. Note down the value of the CloudFormation output parameter ‘DLTApiEndpoint’. We will need this in the next steps. Proceed to the next step once you are able to login to the User Interface of the solution.
  2. Clone the blog Git repository
    git clone https://github.com/aws-samples/aws-automatically-load-test-applications-cicd-pipeline-blog

  3. Update the Distributed Load Testing Solution endpoint URL in loadTestEnvVariables.json.
  4. Deploy the CloudFormation stack for the CI/CD pipeline. This step will also commit the AWS CDK code for the sample RESTful application stack and start the application deployment.
    cd pipeline && cdk bootstrap && cdk deploy --require-approval never
  5. Follow the below steps to view the load test results:
      1. Open the AWS CodePipeline console.
      2. Click on the pipeline named “blog-pipeline”.
      3. Observe that one of the stages (named ‘LoadTest’) in the CI/CD pipeline (that was provisioned by the CloudFormation stack in the previous step) executes a load test against the application Development environment.
        Diagram representing CodePipeline highlighting the LoadTest stage passing successfully
      4. Click on the details of the ‘LoadTest’ step to view the test results. Notice that the load test succeeded.
        Diagram showing sample logs when load tests pass successfully

Change the response time threshold

In this step, we will modify the response time threshold from 1 second to 200 milliseconds in order to introduce a load test failure. Remember from the steps earlier that the Lambda function code has a 500 millisecond sleep statement to add latency to the API response time.

  1. From the AWS Console and then go to CodeCommit. The source for the pipeline is a CodeCommit repository named “blog-repo”.
  2. Click on the “blog-repo” repository, and then browse to the “pipeline” folder. Click on file ‘loadTestEnvVariables.json’ and then ‘Edit’.
  3. Set the response time threshold to 200 milliseconds by changing attribute ‘AVG_RT_THRESHOLD’ value to ‘.2’. Click on the commit button. This will start will start the CI/CD pipeline.
  4. Go to CodePipeline from the AWS console and click on the ‘blog-pipeline’.
  5. Observe the ‘LoadTest’ step in ‘Development-Deploy’ stage will fail in about five minutes, and the pipeline will not proceed to the ‘Production-Deploy’ stage.
    Diagram representing CodePipeline highlighting the LoadTest stage failing
  6. Click on the details of the ‘LoadTest’ step to view the test results. Notice that the load test failed.
    Diagram showing sample logs when load tests fail
  7. Log into the Distributed Load Testing Service console. You will see two tests named ‘sampleScenario’. Click on each of them to see the test result details.

Cleanup

  1. Delete the CloudFormation stack that deployed the sample application.
    1. From the AWS Console, go to CloudFormation and delete the stacks ‘Production-Deploy-Application’ and ‘Development-Deploy-Application’.
  2. Delete the CI/CD pipeline.
    cd pipeline && cdk destroy
  3. Delete the Distributed Load Testing Service CloudFormation stack.
    1. From CloudFormation console, delete the stack for Distributed Load Testing service that you created earlier.

Conclusion

In the post above, we demonstrated how to automatically load test your applications in a CI/CD pipeline using AWS CDK Pipelines and AWS Distributed Load Testing solution. We defined the performance bench marks for our application as configuration. We then used these benchmarks to automatically validate the application performance prior to production deployment. Based on the load test results, we either proceeded to production deployment or failed the pipeline.

About the Authors

Usman Umar

Usman Umar

Usman Umar is a Sr. Applications Architect at AWS Professional Services. He is passionate about developing innovative ways to solve hard technical problems for the customers. In his free time, he likes going on biking trails, doing car modifications, and spending time with his family.

Krishnakumar Rengarajan

Krishnakumar Rengarajan

Krishnakumar Rengarajan is a Senior DevOps Consultant with AWS Professional Services. He enjoys working with customers and focuses on building and delivering automated solutions that enable customers on their AWS cloud journey.

How to use AWS Verified Access logs to write and troubleshoot access policies

Post Syndicated from Ankush Goyal original https://aws.amazon.com/blogs/security/how-to-use-aws-verified-access-logs-to-write-and-troubleshoot-access-policies/

On June 19, 2023, AWS Verified Access introduced improved logging functionality; Verified Access now logs more extensive user context information received from the trust providers. This improved logging feature simplifies administration and troubleshooting of application access policies while adhering to zero-trust principles.

In this blog post, we will show you how to manage the Verified Access logging configuration and how to use Verified Access logs to write and troubleshoot access policies faster. We provide an example showing the user context information that was logged before and after the improved logging functionality and how you can use that information to transform a high-level policy into a fine-grained policy.

Overview of AWS Verified Access

AWS Verified Access helps enterprises to provide secure access to their corporate applications without using a virtual private network (VPN). Using Verified Access, you can configure fine-grained access policies to help limit application access only to users who meet the specified security requirements (for example, user identity and device security status). These policies are written in Cedar, a new policy language developed and open-sourced by AWS.

Verified Access validates each request based on access policies that you set. You can use user context—such as user, group, and device risk score—from your existing third-party identity and device security services to define access policies. In addition, Verified Access provides you an option to log every access attempt to help you respond quickly to security incidents and audit requests. These logs also contain user context sent from your identity and device security services and can help you to match the expected outcomes with the actual outcomes of your policies. To capture these logs, you need to enable logging from the Verified Access console.

Figure 1: Overview of AWS Verified Access architecture showing Verified Access connected to an application

Figure 1: Overview of AWS Verified Access architecture showing Verified Access connected to an application

After a Verified Access administrator attaches a trust provider to a Verified Access instance, they can write policies using the user context information from the trust provider. This user context information is custom to an organization, and you need to gather it from different sources when writing or troubleshooting policies that require more extensive user context.

Now, with the improved logging functionality, the Verified Access logs record more extensive user context information from the trust providers. This eliminates the need to gather information from different sources. With the detailed context available in the logs, you have more information to help validate and troubleshoot your policies.

Let’s walk through an example of how this detailed context can help you improve your Verified Access policies. For this example, we set up a Verified Access instance using AWS IAM Identity Center (successor to AWS Single Sign-on) and CrowdStrike as trust providers. To learn more about how to set up a Verified Access instance, see Getting started with Verified Access. To learn how to integrate Verified Access with CrowdStrike, see Integrating AWS Verified Access with device trust providers.

Then we wrote the following simple policy, where users are allowed only if their email matches the corporate domain.

permit(principal,action,resource)
when {
    context.sso.user.email.address like "*@example.com"
};

Before improved logging, Verified Access logged basic information only, as shown in the following example log.

    "identity": {
        "authorizations": [
            {
                "decision": "Allow",
                "policy": {
                    "name": "inline"
                }
            }
        ],
        "idp": {
            "name": "user",
            "uid": "vatp-09bc4cbce2EXAMPLE"
        },
        "user": {
            "email_addr": "[email protected]",
            "name": "Test User Display",
            "uid": "[email protected]",
            "uuid": "00u6wj48lbxTAEXAMPLE"
        }
    }

Modify an existing Verified Access instance

To improve the preceding policy and make it more granular, you can include checks for various user and device details. For example, you can check if the user belongs to a particular group, has a verified email, should be logging in from a device with an OS that has an assessment score greater than 50, and has an overall device score greater than 15.

Modify the Verified Access instance logging configuration

You can modify the instance logging configuration of an existing Verified Access instance by using either the AWS Management Console or AWS Command Line Interface (AWS CLI).

  1. Open the Verified Access console and select Verified Access instances.
  2. Select the instance that you want to modify, and then, on the Verified Access instance logging configuration tab, select Modify Verified Access instance logging configuration.
    Figure 2: Modify Verified Access logging configuration

    Figure 2: Modify Verified Access logging configuration

  3. Under Update log version, select ocsf-1.0.0-rc.2, turn on Include trust context, and select where the logs should be delivered.
    Figure 3: Verified Access log version and trust context

    Figure 3: Verified Access log version and trust context

After you’ve completed the preceding steps, Verified Access will start logging more extensive user context information from the trust providers for every request that Verified Access receives. This context information can have sensitive information. To learn more about how to protect this sensitive information, see Protect Sensitive Data with Amazon CloudWatch Logs.

The following example log shows information received from the IAM Identity Center identity provider (IdP) and the device provider CrowdStrike.

"data": {
    "context": {
        "crowdstrike": {
            "assessment": {
                "overall": 21,
                "os": 53,
                "sensor_config": 4,
                "version": "3.6.1"
            },
            "cid": "7545bXXXXXXXXXXXXXXX93cf01a19b",
            "exp": 1692046783,
            "iat": 1690837183,
            "jwk_url": "https://assets-public.falcon.crowdstrike.com/zta/jwk.json",
            "platform": "Windows 11",
            "serial_number": "ec2dXXXXb-XXXX-XXXX-XXXX-XXXXXX059f05",
            "sub": "99c185e69XXXXXXXXXX4c34XXXXXX65a",
            "typ": "crowdstrike-zta+jwt"
        },
        "sso": {
            "user": {
                "user_id": "24a80468-XXXX-XXXX-XXXX-6db32c9f68fc",
                "user_name": "XXXX",
                "email": {
                    "address": "[email protected]",
                    "verified": false
                }
            },
            "groups": {
                "04c8d4d8-e0a1-XXXX-383543e07f11": {
                    "group_name": "XXXX"
                }
            }
        },
        "http_request": {
            "hostname": "sales.example.com",
            "http_method": "GET",
            "x_forwarded_for": "52.XX.XX.XXXX",
            "port": 80,
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0",
            "client_ip": "52.XX.XX.XXXX"
        }
    }
}

The following example log shows the user context information received from the OpenID Connect (OIDC) trust provider Okta. You can see the difference in the information provided by the two different trust providers: IAM Identity Center and Okta.

"data": {
    "context": {
        "http_request": {
            "hostname": "sales.example.com",
            "http_method": "GET",
            "x_forwarded_for": "99.X.XX.XXX",
            "port": 80,
            "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15",
            "client_ip": "99.X.XX.XXX"
        },
        "okta": {
            "sub": "00uXXXXXXXJNbWyRI5d7",
            "name": "XXXXXX",
            "locale": "en_US",
            "preferred_username": "[email protected]",
            "given_name": "XXXX",
            "family_name": "XXXX",
            "zoneinfo": "America/Los_Angeles",
            "groups": [
                "Everyone",
                "Sales",
                "Finance",
                "HR"
            ],
            "exp": 1690835175,
            "iss": "https://example.okta.com"
        }
    }
}

The following is a sample policy written using the information received from the trust providers.

permit(principal,action,resource)
when {
  context.idcpolicy.groups has "<hr-group-id>" &&
  context.idcpolicy.user.email.address like "*@example.com" &&
  context.idcpolicy.user.email.verified == true &&
  context has "crdstrikepolicy" &&
  context.crdstrikepolicy.assessment.os > 50 &&
  context.crdstrikepolicy.assessment.overall > 15
};

This policy only grants access to users who belong to a particular group, have a verified email address, and have a corporate email domain. Also, users can only access the application from a device with an OS that has an assessment score greater than 50, and has an overall device score greater than 15.

Conclusion

In this post, you learned how to manage Verified Access logging configuration from the Verified Access console and how to use improved logging information to write AWS Verified Access policies. To get started with Verified Access, see the Amazon VPC console.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ankush Goyal

Ankush Goyal

Ankush is an Enterprise Support Lead in AWS Enterprise Support who helps Enterprise Support customers streamline their cloud operations on AWS. He enjoys working with customers to help them design, implement, and support cloud infrastructure. He is a results-driven IT professional with over 18 years of experience.

Anbu Kumar Krishnamurthy

Anbu Kumar Krishnamurthy

Anbu is a Technical Account Manager who specializes in helping clients integrate their business processes with the AWS Cloud to achieve operational excellence and efficient resource utilization. Anbu helps customers design and implement solutions, troubleshoot issues, and optimize their AWS environments. He works with customers to architect solutions aimed at achieving their desired business outcomes.

Perform Amazon Kinesis load testing with Locust

Post Syndicated from Luis Morales original https://aws.amazon.com/blogs/big-data/perform-amazon-kinesis-load-testing-with-locust/

Building a streaming data solution requires thorough testing at the scale it will operate in a production environment. Streaming applications operating at scale often handle large volumes of up to GBs per second, and it’s challenging for developers to simulate high-traffic Amazon Kinesis-based applications to generate such load easily.

Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose are capable of capturing and storing terabytes of data per hour from numerous sources. Creating Kinesis data streams or Firehose delivery streams is straightforward through the AWS Management Console, AWS Command Line Interface (AWS CLI), or Kinesis API. However, generating a continuous stream of test data requires a custom process or script to run continuously. Although the Amazon Kinesis Data Generator (KDG) provides a user-friendly UI for this purpose, it has some limitations, such as bandwidth constraints and increased round trip latency. (For more information on the KDG, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.)

To overcome these limitations, this post describes how to use Locust, a modern load testing framework, to conduct large-scale load testing for a more comprehensive evaluation of the streaming data solution.

Overview

This project emits temperature sensor readings via Locust to Kinesis. We set up the Amazon Elastic Compute Cloud (Amazon EC2) Locust instance via the AWS Cloud Development Kit (AWS CDK) to load test Kinesis-based applications. You can access the Locust dashboard to perform and observe the load test and connect via Session Manager, a capability of AWS Systems Manager, for configuration changes. The following diagram illustrates this architecture.

Architecture overview

In our testing with the largest recommended instance (c7g.16xlarge), the setup was capable of emitting over 1 million events per second to Kinesis data streams in on-demand capacity mode, with a batch size (simulated users per Locust user) of 500. You can find more details on what this means and how to configure the load test later in this post.

Locust overview

Locust is an open-source, scriptable, and scalable performance testing tool that allows you to define user behavior using Python code. It offers an easy-to-use interface, making it developer-friendly and highly expandable. With its distributed and scalable design, Locust can simulate millions of simultaneous users to mimic real user behavior during a performance test.

Each Locust user represents a scenario or a specific set of actions that a real user might perform on your system. When you run a performance test with Locust, you can specify the number of concurrent Locust users you want to simulate, and Locust will create an instance for each user, allowing you to assess the performance and behavior of your system under different user loads.

For more information on Locust, refer to the Locust documentation.

Prerequisites

To get started, clone or download the code from the GitHub repository.

Test locally

To test Locust out locally first before deploying it to the cloud, you have to install the necessary Python dependencies. If you’re new to Python, refer the README for more information on getting started.

Navigate to the load-test directory and run the following code:

pip install -r requirements.txt

To send events to a Kinesis data stream from your local machine, you will need to have AWS credentials. For more information, refer to Configuration and credential file settings.

To perform the test locally, stay in the load-test directory and run the following code:

locust -f locust-load-test.py

You can now access the Locust dashboard via http://0.0.0.0:8089/. Enter the number of Locust users, the spawn rate (users added per second), and the target Amazon Kinesis data stream name for Host. By default, it deploys the Kinesis data stream DemoStream that you can use for testing.

Locust Dashboard - Enter details

To see the generated events logged, run the following command, which filters only Locust and root logs (for example, no Botocore logs):

locust -f locust-load-test.py --loglevel DEBUG 2&gt;&amp;1 | grep -E "(locust|root)"

Set up resources with the AWS CDK

The GitHub repository contains the AWS CDK code to create all the necessary resources for the load test. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. To deploy the resources, complete the following steps:

  1. If not already downloaded, clone the GitHub repository to your local computer using the following command:
git clone https://github.com/aws-samples/amazon-kinesis-load-testing-with-locust
  1. Download and install the latest Node.js.
  2. Navigate to the root folder of the project and run the following command to install the latest version of AWS CDK:
npm install -g aws-cdk
  1. Install the necessary dependencies:
npm install
  1. Run cdk bootstrap to initialize the AWS CDK environment in your AWS account. Replace your AWS account ID and Region before running the following command:
cdk bootstrap

To learn more about the bootstrapping process, refer to Bootstrapping.

  1. After the dependencies are installed, you can run the following command to deploy the stack of the AWS CDK template, which sets up the infrastructure within 5 minutes:
cdk deploy

The template sets up the Locust EC2 test instance, which is by default a c7g.xlarge instance, which at the time of publishing costs approximately $0.145 per hour in us-east-1. To find the most accurate pricing information, see Amazon EC2 On-Demand Pricing. You can find more details on how to change your instance size according to your scale of load testing later in this post.

It’s crucial to consider that the expenses incurred during load testing are not solely attributed to EC2 instance costs, but also heavily influenced by data transfer costs.

Accessing the Locust dashboard

You can access the dashboard by using the AWS CDK output KinesisLocustLoadTestingStack.locustdashboardurl to open the dashboard, for example http://1.2.3.4:8089.

The Locust dashboard is password protected. By default, it’s set to user name locust-user and password locust-dashboard-pwd.

With the default configuration, you can achieve up to 15,000 emitted events per second. Enter the number of Locust users (times the batch size), the spawn rate (users added per second), and the target Kinesis data stream name for Host.

Locust Dashboard - Enter details

After you have started the load test, you can look at the load test on the Charts tab.

Locust Dashboard - Charts

You can also monitor the load test on the Kinesis Data Streams console by navigating to the stream that you are load testing. If you used the default settings, navigate to DemoStream. On the detail page, choose the Monitoring tab to see the ingested load.

Kinesis Data Streams - Monitoring

Adapt workloads

By default, this project generates random temperature sensor readings for every sensor with the following format:

{
    "sensorId": "bfbae19c-2f0f-41c2-952b-5d5bc6e001f1_1",
    "temperature": 147.24,
    "status": "OK",
    "timestamp": 1675686126310
}

The project comes packaged with Faker, which you can use to adapt the payload to your needs. You just have to update the generate_sensor_reading function in the locust-load-test.py file:

class SensorAPIUser(KinesisBotoUser):
    # ...

    def generate_sensor_reading(self, sensor_id, sensor_reading):
        current_temperature = round(10 + random.random() * 170, 2)

        if current_temperature > 160:
            status = "ERROR"
        elif current_temperature > 140 or random.randrange(1, 100) > 80:
            status = random.choice(["WARNING", "ERROR"])
        else:
            status = "OK"

        return {
            'sensorId': f"{sensor_id}_{sensor_reading}",
            'temperature': current_temperature,
            'status': status,
            'timestamp': round(time.time()*1000)
        }

    # ...

Change configurations

After the initial deployment of the load testing tool, you can change configuration in two ways:

  1. Connect to the EC2 instance, make any configuration and code changes, and restart the Locust process
  2. Change the configuration and load testing code locally and redeploy it via cdk deploy

The first option helps you iterate more quickly on the remote instance without a need to redeploy. The latter uses the infrastructure as code (IaC) approach and makes sure that your configuration changes can be committed to your source control system. For a fast development cycle, it’s recommended to test your load test configuration locally first, connect to your instance to apply the changes, and after successful implementation, codify it as part of your IaC repository and then redeploy.

Locust is created on the EC2 instance as a systemd service and can therefore be controlled with systemctl. If you want to change the configuration of Locust as needed without redeploying the stack, you can connect to the instance via Systems Manager, navigate to the project directory on /usr/local/load-test, change the locust.env file, and restart the service by running sudo systemctl restart locust.

Large-scale load testing

This setup is capable of emitting over 1 million events per second to Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge.

To achieve peak performance with Locust and Kinesis, keep the following in mind:

  • Instance size – Your performance is bound by the underlying EC2 instance, so refer to EC2 instance type for more information about scaling. To set the correct instance size, you can configure the instance size in the file kinesis-locust-load-testing.ts.
  • Number of secondaries – Locust benefits from a distributed setup. Therefore, the setup spins up a primary, which does the coordination, and multiple secondaries, which do the actual work. To fully take advantage of the cores, you should specify one secondary per core. You can configure the number in the locust.env file.
  • Batch size – The amount of Kinesis data stream events you can send per Locust user is limited due to the resource overhead of switching Locust users and threads. To overcome this, you can configure a batch size to define how much users are simulated per Locust user. These are sent as a Kinesis data stream put_records call. You can configure the number in the locust.env file.

This setup is capable of emitting over 1 million events per second to the Kinesis data stream, with a batch size of 500 and 64 secondaries on a c7g.16xlarge instance.

Locust Dashboard - Large Scale Load Test Charts

You can observe this on the Monitoring tab for the Kinesis data stream as well.

Kinesis Data Stream - Large Scale Load Test Monitoring

Clean up

In order to not incur any unnecessary costs, delete the stack by running the following code:

cdk destroy

Summary

Kinesis is already popular for its ease of use among users building streaming applications. With this load testing capability using Locust, you can now test your workloads in a more straightforward and faster way. Visit the GitHub repo to embark on your testing journey.

The project is licensed under the Apache 2.0 license, providing the freedom to clone and modify it according to your needs. Furthermore, you can contribute to the project by submitting issues or pull requests via GitHub, fostering collaboration and improvement in the testing ecosystem.


About the author

Luis Morales works as Senior Solutions Architect with digital native businesses to support them in constantly reinventing themselves in the cloud. He is passionate about software engineering, cloud-native distributed systems, test-driven development, and all things code and security

Monitor data pipelines in a serverless data lake

Post Syndicated from Virendhar Sivaraman original https://aws.amazon.com/blogs/big-data/monitor-data-pipelines-in-a-serverless-data-lake/

AWS serverless services, including but not limited to AWS Lambda, AWS Glue, AWS Fargate, Amazon EventBridge, Amazon Athena, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have become the building blocks for any serverless data lake, providing key mechanisms to ingest and transform data without fixed provisioning and the persistent need to patch the underlying servers. The combination of a data lake in a serverless paradigm brings significant cost and performance benefits. The advent of rapid adoption of serverless data lake architectures—with ever-growing datasets that need to be ingested from a variety of sources, followed by complex data transformation and machine learning (ML) pipelines—can present a challenge. Similarly, in a serverless paradigm, application logs in Amazon CloudWatch are sourced from a variety of participating services, and traversing the lineage across logs can also present challenges. To successfully manage a serverless data lake, you require mechanisms to perform the following actions:

  • Reinforce data accuracy with every data ingestion
  • Holistically measure and analyze ETL (extract, transform, and load) performance at the individual processing component level
  • Proactively capture log messages and notify failures as they occur in near-real time

In this post, we will walk you through a solution to efficiently track and analyze ETL jobs in a serverless data lake environment. By monitoring application logs, you can gain insights into job execution, troubleshoot issues promptly to ensure the overall health and reliability of data pipelines.

Overview of solution

The serverless monitoring solution focuses on achieving the following goals:

  • Capture state changes across all steps and tasks in the data lake
  • Measure service reliability across a data lake
  • Quickly notify operations of failures as they happen

To illustrate the solution, we create a serverless data lake with a monitoring solution. For simplicity, we create a serverless data lake with the following components:

  • Storage layer – Amazon S3 is the natural choice, in this case with the following buckets:
    • Landing – Where raw data is stored
    • Processed – Where transformed data is stored
  • Ingestion layer – For this post, we use Lambda and AWS Glue for data ingestion, with the following resources:
    • Lambda functions – Two Lambda functions that run to simulate a success state and failure state, respectively
    • AWS Glue crawlers – Two AWS Glue crawlers that run to simulate a success state and failure state, respectively
    • AWS Glue jobs – Two AWS Glue jobs that run to simulate a success state and failure state, respectively
  • Reporting layer – An Athena database to persist the tables created via the AWS Glue crawlers and AWS Glue jobs
  • Alerting layer – Slack is used to notify stakeholders

The serverless monitoring solution is devised to be loosely coupled as plug-and-play components that complement an existing data lake. The Lambda-based ETL tasks state changes are tracked using AWS Lambda Destinations. We have used an SNS topic for routing both success and failure states for the Lambda-based tasks. In the case of AWS Glue-based tasks, we have configured EventBridge rules to capture state changes. These event changes are also routed to the same SNS topic. For demonstration purposes, this post only provides state monitoring for Lambda and AWS Glue, but you can extend the solution to other AWS services.

The following figure illustrates the architecture of the solution.

The architecture contains the following components:

  • EventBridge rules – EventBridge rules that capture the state change for the ETL tasks—in this case AWS Glue tasks. This can be extended to other supported services as the data lake grows.
  • SNS topic – An SNS topic that serves to catch all state events from the data lake.
  • Lambda function – The Lambda function is the subscriber to the SNS topic. It’s responsible for analyzing the state of the task run to do the following:
    • Persist the status of the task run.
    • Notify any failures to a Slack channel.
  • Athena database – The database where the monitoring metrics are persisted for analysis.

Deploy the solution

The source code to implement this solution uses AWS Cloud Development Kit (AWS CDK) and is available on the GitHub repo monitor-serverless-datalake. This AWS CDK stack provisions required network components and the following:

  • Three S3 buckets (the bucket names are prefixed with the AWS account name and Regions, for example, the landing bucket is <aws-account-number>-<aws-region>-landing):
    • Landing
    • Processed
    • Monitor
  • Three Lambda functions:
    • datalake-monitoring-lambda
    • lambda-success
    • lambda-fail
  • Two AWS Glue crawlers:
    • glue-crawler-success
    • glue-crawler-fail
  • Two AWS Glue jobs:
    • glue-job-success
    • glue-job-fail
  • An SNS topic named datalake-monitor-sns
  • Three EventBridge rules:
    • glue-monitor-rule
    • event-rule-lambda-fail
    • event-rule-lambda-success
  • An AWS Secrets Manager secret named datalake-monitoring
  • Athena artifacts:
    • monitor database
    • monitor-table table

You can also follow the instructions in the GitHub repo to deploy the serverless monitoring solution. It takes about 10 minutes to deploy this solution.

Connect to a Slack channel

We still need a Slack channel to which the alerts are delivered. Complete the following steps:

  1. Set up a workflow automation to route messages to the Slack channel using webhooks.
  2. Note the webhook URL.

The following screenshot shows the field names to use.

The following is a sample message for the preceding template.

  1. On the Secrets Manager console, navigate to the datalake-monitoring secret.
  2. Add the webhook URL to the slack_webhook secret.

Load sample data

The next step is to load some sample data. Copy the sample data files to the landing bucket using the following command:

aws s3 cp --recursive s3://awsglue-datasets/examples/us-legislators s3://<AWS_ACCCOUNT>-<AWS_REGION>-landing/legislators

In the next sections, we show how Lambda functions, AWS Glue crawlers, and AWS Glue jobs work for data ingestion.

Test the Lambda functions

On the EventBridge console, enable the rules that trigger the lambda-success and lambda-fail functions every 5 minutes:

  • event-rule-lambda-fail
  • event-rule-lambda-success

After a few minutes, the failure events are relayed to the Slack channel. The following screenshot shows an example message.

Disable the rules after testing to avoid repeated messages.

Test the AWS Glue crawlers

On the AWS Glue console, navigate to the Crawlers page. Here you can start the following crawlers:

  • glue-crawler-success
  • glue-crawler-fail

In a minute, the glue-crawler-fail crawler’s status changes to Failed, which triggers a notification in Slack in near-real time.

Test the AWS Glue jobs

On the AWS Glue console, navigate to the Jobs page, where you can start the following jobs:

  • glue-job-success
  • glue-job-fail

In a few minutes, the glue-job-fail job status changes to Failed, which triggers a notification in Slack in near-real time.

Analyze the monitoring data

The monitoring metrics are persisted in Amazon S3 for analysis and can be used of historical analysis.

On the Athena console, navigate to the monitor database and run the following query to find the service that failed the most often:

SELECT service_type, count(*) as "fail_count"
FROM "monitor"."monitor"
WHERE event_type = 'failed'
group by service_type
order by fail_count desc;

Over time with rich observability data – time series based monitoring data analysis will yield interesting findings.

Clean up

The overall cost of the solution is less than one dollar but to avoid future costs, make sure to clean up the resources created as part of this post.

Summary

The post provided an overview of a serverless data lake monitoring solution that you can configure and deploy to integrate with enterprise serverless data lakes in just a few hours. With this solution, you can monitor a serverless data lake, send alerts in near-real time, and analyze performance metrics for all ETL tasks operating in the data lake. The design was intentionally kept simple to demonstrate the idea; you can further extend this solution with Athena and Amazon QuickSight to generate custom visuals and reporting. Check out the GitHub repo for a sample solution and further customize it for your monitoring needs.


About the Authors

Virendhar (Viru) Sivaraman is a strategic Senior Big Data & Analytics Architect with Amazon Web Services. He is passionate about building scalable big data and analytics solutions in the cloud. Besides work, he enjoys spending time with family, hiking & mountain biking.

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.

Configure SAML federation for Amazon OpenSearch Serverless with Okta

Post Syndicated from Aish Gunasekar original https://aws.amazon.com/blogs/big-data/configure-saml-federation-for-amazon-opensearch-serverless-with-okta/

Modern applications apply security controls across many systems and their subsystems. Keeping all of these systems in sync would be a major undertaking if you tried to implement it separately. Centralized identity management is the way to maintain a single identity provider (IdP) that can authenticate actors and manage and distribute their rights.

OpenSearch is an open-source search and analytics suite that enables you to ingest, store, analyze, and visualize full text and log data. Amazon OpenSearch Serverless makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud, freeing you from the undifferentiated heavy lifting of sizing, scaling, and operating an OpenSearch cluster. When you use OpenSearch Serverless, you can integrate with your existing Security Assertion Markup Language 2.0 (SAML)-compliant IdP to provide granular access control for your OpenSearch Serverless collections. Our customers use a variety of IdPs, including AWS IAM Identity Center (successor to AWS SSO), Okta, Keycloak, Active Directory Federation Services (AD FS), and Auth0.

In this post, you will learn how to use Okta as your IdP and integrate it with OpenSearch Serverless to securely manage your users and groups for secure access to your data.

Solution overview

The flow of access requests is depicted in the following figure.

When you navigate to OpenSearch Dashboards, the workflow steps are as follows:

  1. OpenSearch Serverless generates a SAML authentication request.
  2. OpenSearch Serverless redirects your request back to the browser.
  3. The browser redirects to the Okta URL via the Okta application setup.
  4. Okta parses the SAML request, authenticates the user, and generates a SAML response.
  5. Okta returns the encoded SAML response to the browser.
  6. The browser sends the SAML response back to the OpenSearch Serverless Assertion Consumer Services (ACS) URL.
  7. ACS verifies the SAML response and logs in the user with the permissions defined in the data access policy.

Prerequisites

Complete the following prerequisite steps:

  1. Create an OpenSearch Serverless collection. For instructions, refer to Preview: Amazon OpenSearch Serverless – Run Search and Analytics Workloads without Managing Clusters.
  2. Make a note of your AWS account ID to use while configuring your application in Okta.
  3. Create an Okta account, which you will use as an IdP.
  4. Create users and a group in Okta:
    1. Log in to your Okta account, and in the navigation pane, choose Directory, then choose Groups.
    2. Choose Add Group and name itopensearch-serverless, then choose Save.
    3. Choose Assign People to add users.
    4. You can add users to theopensearch-serverlessgroup by choosing the plus sign next to the user name, or you can choose Add All.
    5. Add your users, then choose Save.
    6. To create new users, choose People in the navigation pane under Directory, then choose Add Person.
    7. Provide your first name, last name, user name (email ID), and primary email address.
    8. For Password, choose Set by admin and First-time password.
    9. To create your user, choose Save.
    10. In the navigation pane, choose Groups, then choose theopensearch-serverless group you created earlier.

The following graphic gives a quick demonstration of setting up a user and group.

Configure an application in Okta

To configure an application in Okta, complete the following steps:

  1. Navigate to the Applications page on the Okta console.
  2. Choose App Integration, select SAML 2.0 web application, then choose Next.
  3. For Name, enter a name for the app (for example, myweblogs), then choose Next.
  4. Under Application ACS URL, enter the URL using the format https://collection.<REGION>.aoss.amazonaws.com/_saml/acs (replace <REGION> with the corresponding Region) to generate the IdP metadata.
  5. Select Use this for Recipient URL and Destination URL to use the same ACS URL as the recipient and destination.
  6. Specify aws:opensearch:<AWS-Account-ID> under Audience URI (SP Entity ID). This specifies who the assertion is intended for within the SAML assertion.
  7. Under Group Attribute Statements, enter a name that is relevant to your application, such as mygroup, and select unspecified as the name format. (Don’t forget this name, you’ll need it later.)
  8. Select equals as the filter and enter opensearch-serverless.
  9. Select I’m a software vendor. I’d like to integrate my app with Okta and choose Finish.
  10. After an app is created, choose the sign-on tab, scroll down to the metadata details, and copy the value for Metadata URL.

The following graphic gives a quick demonstration of setting up an application in Okta via the preceding steps.

Next, you associate the users and groups to the application that you created in the previous step.

  1. On the Applications page, choose the app you created earlier.
  2. On the Assignments tab, choose Assign.
  3. Select Assign To Groups and choose the group you wish to assign to (opensearch-serverlessin this case).
  4. Choose Done.

The following graphic gives a quick demonstration of assigning groups to the application via the preceding steps.

Set up SAML on OpenSearch Serverless

In this section, you create a SAML provider that you’ll use for your OpenSearch Serverless collection. Complete the following steps:

  1. Open the OpenSearch Serverless console on a new tab.
  2. In the navigation pane, under Serverless, choose SAML authentication.
  3. Select Add SAML provider.
  4. Provide a recognizable name (for example, okta) and a description.
  5. Open a new tab and enter the copied metadata URL into your browser.

You should see the metadata for the Okta application.

  1. Take note of this metadata and copy it to your clipboard.
  2. On the OpenSearch Service console tab, enter this metadata in the Provide metadata from your IdP section.
  3. Under Additional settings, enter mygroup or the group attribute provided in the Okta configuration.
  4. Choose Create a SAML provider.

The SAML provider has now been created.

The following graphic gives a quick demonstration of setting up the SAML provider in OpenSearch Serverless via the preceding steps.

Update the data access policy

You need to configure the right permissions in the data access policies associated with your OpenSearch collection so your Okta group members can access the OpenSearch Dashboards endpoint.

  1. On the OpenSearch Serverless console, open your collection.
  2. Choose the data access policy associated with the collection in the Data Access section.
  3. Choose Edit.
  4. Choose Principals and Add a SAML principal.
  5. Select the SAML provider you created earlier and enter group/opensearch-serverless next to it.
  6. The OpenSearch Dashboards endpoint can be accessed by all group members. You can grant access to collections, indexes, or both.
  7. Choose Save.

Log in to OpenSearch Dashboards

Now that you have set permissions to access the dashboards, choose the Dashboards URL under the general information for the OpenSearch Serverless collection. This should take you to the website
https://collection-endpoint/_dashboards/

You will see a list with all the access options. Choose the SAML provider that you created (okta in this case) and log in using your Okta credentials. You will now be logged into OpenSearch Dashboards with the permissions that are part of the data access policy. You can perform searches or create visualizations from the dashboard.

Clean up

To avoid unwanted charges, delete the OpenSearch Serverless collection, data access policy, and SAML provider created as part of this demonstration.

Summary

In this post, you learned how to set up Okta as an IdP to access OpenSearch Dashboards using SAML. You also learned how to set up users and groups within Okta and configure their access to OpenSearch Dashboards. For more details, refer to SAML authentication for Amazon OpenSearch Serverless.

You can also refer to the Getting started with Amazon OpenSearch Serverless workshop to know more about OpenSearch Serverless.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.


About the Authors

Aish Gunasekar is a Specialist Solutions architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Developing with Java and Spring Boot using Amazon CodeWhisperer

Post Syndicated from Rajdeep Banerjee original https://aws.amazon.com/blogs/devops/developing-with-java-and-spring-boot-using-amazon-codewhisperer/

Developers often have to work with multiple programming languages depending on the task at hand.  Sometimes, this is a result of choosing the right tool for a specific problem, or it is mandated by adhering to a specific technology adopted by a team.  Within a specific programming language, developers may have to work with frameworks, software libraries, and popular cloud services from providers such as Amazon Web Services (AWS). This must be done while adhering to secure and best programming practices.  Despite these challenges, developers must continue to release code at a sufficiently high velocity.    

 Amazon CodeWhisperer is a real-time, AI coding companion that provides code suggestions in your IDE code editor. Developers can simply write a comment that outlines a specific task in plain English, such as “method to upload a file to S3.” Based on this, CodeWhisperer automatically determines which cloud services and public libraries are best suited to accomplish the  task and recommends multiple code snippets directly in the IDE. The code is generated based on the context of your file, such as comments as well as surrounding source code and import statements. CodeWhisperer is available as part of the AWS Toolkit for Visual Studio Code and JetBrain family of IDEs.  CodeWhisperer is also available for AWS Cloud9, AWS Lambda console, JupyterLabAmazon SageMaker Studio and AWS Glue Studio. CodeWhisperer supports popular programming languages like Java, Python, C#, TypeScript, GO, JavaScript, Rust, PHP, Kotlin, C, C++, Shell scripting, SQL, and Scala. 

 In this post, we will explore how to leverage CodeWhisperer in Java applications specifically using the Spring Boot framework. Spring Boot is an extension of the Spring framework that makes it easier to develop Java applications and microservices. Using CodeWhisperer, you will be spending less time creating boilerplate and repetitive code and more time focusing on business logic. You can generate entire Java Spring Boot functions and logical code blocks without having to search for code snippets from the web and customize them according to your requirements.  CodeWhisperer will enable you to responsibly use AI to create syntactically correct and secure Java Spring Boot applications. To enable CodeWhisperer in your IDE, please see Setting up CodeWhisperer for VS Code or Setting up Amazon CodeWhisperer for JetBrains depending on which IDE you are using.

 Note: Please note that CodeWhisperer uses artificial intelligence to provide code recommendations and this is non-deterministic. This code might differ from what you get from Amazon CodeWhisperer in your case.

Creating Data Transfer Objects (DTO)

Amazon CodeWhisperer makes it easier to develop the classes as you include import statements and provide brief comments on the purpose of the class.  Let’s start with the basics and develop a simple DTO or Plain Old Java Object (POJO).  This class will contain properties representing a product.  This DTO will be referenced later as part of a REST controller we generate to serialize the output to JSON.  CodeWhisperer will create a DTO class by using the class name and comments provided in plain language. Detailed and contextual comments will enable CodeWhisperer to generate code suggestions ranging from snippets to full functions in real time. For this use case, you are going to create a product class with id, name, price, description and rating properties.

 Type the following or similar comment in the class :

package com.amazonws.demo.cart.dto;

//create a Product class with id, name, price, description and rating properties. 

Quickly develop a Java class using Amazon Codewhisperer

After entering the comment and pressing ENTER, CodeWhisperer will start providing code suggestions. You can use the Tab key to accept a suggestion based on the context or use the left/right arrow keys to see more suggestions.  As shown below, the product class is auto generated with five properties id, name, price, rating and description with default getter/setter methods and two constructors. If you need more properties, you can either update the comment to include the new columns or manually create the columns in the file:

package com.amazonws.demo.cart.dto;

//create a Product class with id, name, price, description and rating properties. 

public class Product {
    private String id;
    private String name;
    private Double price;
    private String description;
    private Integer rating;
    
    public Product() {
    }
  
    public Product(String id, String name, Double price) {
      this.id = id;
      this.name = name;
      setPrice(this.price = price);
    }
  
    public String getId() {
      return id;
    }
  
    public void setId(String id) {
      this.id = id;
    }
  
    public String getName() {
      return name;
    }
  
    public void setName(String name) {
      this.name = name;
    }
  
    public Double getPrice() {
      return price;
    }
  
    public void setPrice(Double price) {
      this.price = price;
    }
    
    public String getDescription(){
      return description;
    }

    public void setDescription(String description){
      this.description = description;
    }

    public Integer getRating(){
      return rating;
    }
    
    public void setRating(Integer rating){
      this.rating = rating;
    }
}

Implementing Data Access Object  (DAO) pattern

Next, we implement the DAO pattern in this case for Amazon DynamoDB. The DAO pattern allows you to decouple the application/business layers from the persistence layer. It contains all the implementation logic for interacting with the persistence layer. We will create an entity class which represents the data to persist in DynamoDB along with the DAO class which contains the persistence logic.

 First, create a ProductDaoEntity class which maps to the Amazon DynamoDB table. Create a blank ProductDaoEntity class and import DynamoDB packages for annotations, attributes, and partition key, as shown below. Notice that the class has a comment about the class structure and use of the DynamoDB enhanced client, so that CodeWhisperer can provide meaningful suggestion. The enhanced client allows you to map client-side classes to DynamoDB tables.

package com.amazonws.demo.cart.dao.entity;

import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbAttribute;
import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbBean;
import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbPartitionKey;
import software.amazon.awssdk.enhanced.dynamodb.mapper.annotations.DynamoDbSortKey;

//create a dynamo db ProductDaoEntity class with partition Key as id, name, price, description, rating attributes using dynamo db enhanced mapper annotations

CodeWhisperer can now infer the comments, contexts, and import statements and start to generate the class implementation. You can accept or reject comments based on your requirements. Below, you can see the complete class generated by CodeWhisperer.

@DynamoDbBean
public class ProductDaoEntity {

    private String id;
    private String name;
    private double price;
    private String description;
    private int rating;
    
    public ProductDaoEntity() {
    }

    public ProductDaoEntity(String id, String name, double price, String description, int rating) {
        this.id = id;
        this.name = name;
        this.price = price;
        this.description = description;
        this.rating = rating;
    }

    @DynamoDbPartitionKey
    @DynamoDbAttribute("id")
    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    @DynamoDbSortKey
    @DynamoDbAttribute("name")
    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @DynamoDbAttribute("price")
    public double getPrice() {
        return price;
    }

    public void setPrice(double price) {
        this.price = price;
    }

    @DynamoDbAttribute("description")
    public String getDescription() {
        return description;
    }

    public void setDescription(String description) {
        this.description = description;
    }

    @DynamoDbAttribute("rating")
    public int getRating() {
        return rating;
    }

    public void setRating(int rating) {
        this.rating = rating;
    }
    
    @Override
    public String toString() {
        return "ProductDaoEntity [id=" + id + ", name=" + name + ", price=" + price + ", description=" + description
                + ", rating=" + rating + "]";
    }

}

Notice how CodeWhisperer includes the appropriate DynamoDB related annotations such as @DynamoDbBean, @DynamoDbPartitionKey, @DynamoDbSortKey and @DynamoDbAttribute. This will be used to generate a TableSchema for mapping classes to tables.

 Now that you have the mapper methods completed, you can create the actual persistence logic that is specific to DynamoDB. Create a class named ProductDaoImpl. (Note: it’s a best practice for DAOImpl class to implement a DAO interface class.  We left that out for brevity.) Using the import statements and comments, CodeWhisperer can auto-generate most of the DynamoDB persistence logic for you. Create a ProductDaoImpl class which uses a DynamoDbEnhancedClient object as shown below.

package com.amazonws.demo.cart.dao;

import javax.annotation.PostConstruct;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import com.amazonws.demo.cart.dao.Mapper.ProductMapper;
import com.amazonws.demo.cart.dao.entity.ProductDaoEntity;
import com.amazonws.demo.cart.dto.Product;

import software.amazon.awssdk.core.internal.waiters.ResponseOrException;
import software.amazon.awssdk.enhanced.dynamodb.DynamoDbEnhancedClient;
import software.amazon.awssdk.enhanced.dynamodb.DynamoDbTable;
import software.amazon.awssdk.enhanced.dynamodb.Key;
import software.amazon.awssdk.enhanced.dynamodb.TableSchema;


@Component
public class ProductDaoImpl{
    private static final Logger logger = LoggerFactory.getLogger(ProductDaoImpl.class);
    private static final String PRODUCT_TABLE_NAME = "Products";
    private final DynamoDbEnhancedClient enhancedClient;

    @Autowired
    public ProductDaoImpl(DynamoDbEnhancedClient enhancedClient){
        this.enhancedClient = enhancedClient;

    }

Rather than providing comments that describe the functionality of the entire class, you can provide comments for each specific method here. You will use CodeWhisperer to generate the implementation details for interacting with DynamoDB.  If the Products table doesn’t already exist, you will need to create it. Based on the comment, CodeWhisperer will generate a method to create a a Products table if one does not exist.  As you can see, you don’t have to memorize or search through the DynamoDB API documentation to implement this logic. CodeWhisperer will save you time and effort by giving contextualized suggestions.

//Create the DynamoDB table through enhancedClient object from ProductDaoEntity. If the table already exists, log the error.
    @PostConstruct
    public void createTable() {
        try {
            DynamoDbTable<ProductDaoEntity> productTable = enhancedClient.table(PRODUCT_TABLE_NAME, TableSchema.fromBean(ProductDaoEntity.class));
            productTable.createTable();
        } catch (Exception e) {
            logger.error("Error creating table: ", e);
        }
    }

Now, you can create the CRUD operations for the Product object. You can start with the createProduct operation to insert a new product entity to the DynamoDB table. Provide a comment about the purpose of the method along with relevant implementation details.

    // Create the createProduct() method 
    // Insert the ProductDaoEntity object into the DynamoDB table
    // Return the Product object

CodeWhisperer will start auto generating the Create operation as shown below. You can accept/reject the suggestions as needed. Or, you may select from alternate suggestion if available using the left/right arrow keys.

   // Create the createProduct() method
   // Insert the ProductDaoEntity object into the DynamoDB table
   // Return the Product object
    public ProductDaoEntity createProduct(ProductDaoEntity productDaoEntity) {
        DynamoDbTable<ProductDaoEntity> productTable = enhancedClient.table(PRODUCT_TABLE_NAME, TableSchema.fromBean(ProductDaoEntity.class));
        productTable.putItem(productDaoEntity);
        return product;
    }  

Similarly, you can generate a method to return a specific product by id. Provide a contextual comment, as shown below.

// Get a particular ProductDaoEntity object from the DynamoDB table using the
 // product id and return the Product object

Below is the auto-generated code. CodeWhisperer has correctly analyzed the comments and generated the method to get a Product by its id.

    //Get a particular ProductDaoEntity object from the DynamoDB table using the
    // product id and return the Product object
    
    public ProductDaoEntity getProduct(String productId) {
        DynamoDbTable<ProductDaoEntity> productTable = enhancedClient.table(PRODUCT_TABLE_NAME, TableSchema.fromBean(ProductDaoEntity.class));
        ProductDaoEntity productDaoEntity = productTable.getItem(Key.builder().partitionValue(productId).build());
        return productDaoEntity;
    }

Similarly, you can implement the DAO layer to delete and update products using DynamoDB table.

Creating a Service Object

Next, you will generate the ProductService class which retrieves the Product using ProductDAO.  In Spring Boot, a class annotated with @Service allows it to be detected through classpath scanning.

Let’s provide a comment to generate the ProductService class:

package com.amazonws.demo.cart.service;

import java.util.List;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import com.amazonws.demo.cart.dto.Product;
import com.amazonws.demo.cart.dao.ProductDao;

//Create a class called ProductService with methods: getProductById(string id),
//getAllProducts(), updateProduct(Product product), 
//deleteProduct(string id), createProduct(Product product)

CodeWhisperer will create the following class implementation.  Note, you may have to adjust return types or method parameter types as needed.  Notice the @Service annotation for this class along with the productDao property being @Autowired.

@Service
public class ProductService {

   @Autowired
   ProductDao productDao;

   public Product getProductById(String id) {
      return productDao.getProductById(id);
   }

   public List<Product> getProducts() {
      return productDao.getAllProducts();
   }

   public void updateProduct(Product product) {
      productDao.updateProduct(product);
   }

   public void deleteProduct(String id) {
      productDao.deleteProduct(id);
   }

   public void createProduct(Product product) {
      productDao.createProduct(product);
   }

}

Creating a REST Controller

The REST controller typically handles incoming client HTTP requests and responses and its output is typically serialized into JSON or XML formats.  Using annotations, Spring Boot maps the HTTPS methods such as GET, PUT, POST, and DELETE to appropriate methods within the controller.  It also binds the HTTP request data to parameters defined within the controller methods.

Provide a comment as shown below specifying that the class is a REST controller that should support CORS along with the required methods.

package com.amazonws.demo.product.controller;

import java.util.List;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.CrossOrigin;
import org.springframework.web.bind.annotation.DeleteMapping;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.PutMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import com.amazonws.demo.product.dto.Product;
import com.amazonws.demo.product.service.ProductService;

//create a RestController called ProductController to get all
//products, get a product by id, create a product, update a product,
//and delete a product. support cross origin requests from all origins.
 

Notice how the appropriate annotations are added to support CORS along with the mapping annotations that correspond with the GET, PUT, POST and DELETE HTTP methods. The @RestController annotation is used to specify that this controller returns an object serialized as XML or JSON rather than a view.

@RestController
@RequestMapping("/product")
@CrossOrigin(origins = "*")
public class ProductController {

    @Autowired
    private ProductService productService;
    
    @GetMapping("/getAllProducts")
    public List<Product> getAllProducts() {
        return productService.getAllProducts();
    }

    @GetMapping("/getProductById/{id}")
    public Product getProductById(@PathVariable String id) {
        return productService.getProductById(id);
    }

    @PostMapping("/createProduct")
    public Product createProduct(@RequestBody Product product) {
        return productService.createProduct(product);
    }

    @PutMapping("/updateProduct")
    public Product updateProduct(@RequestBody Product product) {
        return productService.updateProduct(product);
    }

    @DeleteMapping("/deleteProduct/{id}")
    public void deleteProduct(@PathVariable String id) {
        productService.deleteProduct(id);
    }

}

Conclusion

In this post, you have used CodeWhisperer to generate DTOs, controllers, service objects, and persistence classes. By inferring your natural language comments, CodeWhisperer will provide contextual code snippets to accelerate your development.  In addition, CodeWhisperer has additional features like reference tracker that detects whether a code suggestion might resemble open-source training data and can flag such suggestions with the open-source project’s repository URL, file reference, and license information for your review before deciding whether to incorporate the suggested code.

Try out Amazon CodeWhisperer today to get a head start on your coding projects.

Rajdeep Banerjee

Rajdeep Banerjee

Rajdeep Banerjee is a Senior Partner Solutions Architect at AWS helping strategic partners and clients in the AWS cloud migration and digital transformation journey. Rajdeep focuses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. He is a member of Serverless technical field community.Rajdeep is based out of Richmond,Virginia.

Jason Varghese

Jason Varghese

Jason is a Senior Solutions Architect at AWS guiding enterprise customers on their cloud migration and modernization journeys. He has served in multiple engineering leadership roles and has over 20 years of experience architecting, designing and building scalable software solutions. Jason holds a bachelor’s degree in computer engineering from the University of Oklahoma and an MBA from the University of Central Oklahoma.

Configure fine-grained access to your resources shared using AWS Resource Access Manager

Post Syndicated from Fabian Labat original https://aws.amazon.com/blogs/security/configure-fine-grained-access-to-your-resources-shared-using-aws-resource-access-manager/

You can use AWS Resource Access Manager (AWS RAM) to securely, simply, and consistently share supported resource types within your organization or organizational units (OUs) and across AWS accounts. This means you can provision your resources once and use AWS RAM to share them with accounts. With AWS RAM, the accounts that receive the shared resources can list those resources alongside the resources they own.

When you share your resources by using AWS RAM, you can specify the actions that an account can perform and the access conditions on the shared resource. AWS RAM provides AWS managed permissions, which are created and maintained by AWS and which grant permissions for common customer scenarios. Now, you can further tailor resource access by authoring and applying fine-grained customer managed permissions in AWS RAM. A customer managed permission is a managed permission that you create to precisely specify who can do what under which conditions for the resource types included in your resource share.

This blog post walks you through how to use customer managed permissions to tailor your resource access to meet your business and security needs. Customer managed permissions help you follow the best practice of least privilege for your resources that are shared using AWS RAM.

Considerations

Before you start, review the considerations for using customer managed permissions for supported resource types in the AWS RAM User Guide.

Solution overview

Many AWS customers share infrastructure services to accounts in an organization from a centralized infrastructure OU. The networking account in the infrastructure OU follows the best practice of least privilege and grants only the permissions that accounts receiving these resources, such as development accounts, require to perform a specific task. The solution in this post demonstrates how you can share an Amazon Virtual Private Cloud (Amazon VPC) IP Address Manager (IPAM) pool with the accounts in a Development OU. IPAM makes it simpler for you to plan, track, and monitor IP addresses for your AWS workloads.

You’ll use a networking account that owns an IPAM pool to share the pool with the accounts in a Development OU. You’ll do this by creating a resource share and a customer managed permission through AWS RAM. In this example, shown in Figure 1, both the networking account and the Development OU are in the same organization. The accounts in the Development OU only need the permissions that are required to allocate a classless inter-domain routing (CIDR) range and not to view the IPAM pool details. You’ll further refine access to the shared IPAM pool so that only AWS Identity and Access Management (IAM) users or roles tagged with team = networking can perform actions on the IPAM pool that’s shared using AWS RAM.

Figure 1: Multi-account diagram for sharing your IPAM pool from a networking account in the Infrastructure OU to accounts in the Development OU

Figure 1: Multi-account diagram for sharing your IPAM pool from a networking account in the Infrastructure OU to accounts in the Development OU

Prerequisites

For this walkthrough, you must have the following prerequisites:

  • An AWS account (the networking account) with an IPAM pool already provisioned. For this example, create an IPAM pool in a networking account named ipam-vpc-pool-use1-dev. Because you share resources across accounts in the same AWS Region using AWS RAM, provision the IPAM pool in the same Region where your development accounts will access the pool.
  • An AWS OU with the associated development accounts to share the IPAM pool with. In this example, these accounts are in your Development OU.
  • An IAM role or user with permissions to perform IPAM and AWS RAM operations in the networking account and the development accounts.

Share your IPAM pool with your Development OU with least privilege permissions

In this section, you share an IPAM pool from your networking account to the accounts in your Development OU and grant least-privilege permissions. To do that, you create a resource share that contains your IPAM pool, your customer managed permission for the IPAM pool, and the OU principal you want to share the IPAM pool with. A resource share contains resources you want to share, the principals you want to share the resources with, and the managed permissions that grant resource access to the account receiving the resources. You can add the IPAM pool to an existing resource share, or you can create a new resource share. Depending on your workflow, you can start creating a resource share either in the Amazon VPC IPAM or in the AWS RAM console.

To initiate a new resource share from the Amazon VPC IPAM console

  1. Sign in to the AWS Management Console as your networking account. For Features, select Amazon VPC IP Address Manager console.
  2. Select ipam-vpc-pool-use1-dev, which was provisioned as part of the prerequisites.
  3. On the IPAM pool detail page, choose the Resource sharing tab.
  4. Choose Create resource share.
     
Figure 2: Create resource share to share your IPAM pool

Figure 2: Create resource share to share your IPAM pool

Alternatively, you can initiate a new resource share from the AWS RAM console.

To initiate a new resource share from the AWS RAM console

  1. Sign in to the AWS Management Console as your networking account. For Services, select Resource Access Manager console.
  2. Choose Create resource share.

Next, specify the resource share details, including the name, the resource type, and the specific resource you want to share. Note that the steps of the resource share creation process are located on the left side of the AWS RAM console.

To specify the resource share details

  1. For Name, enter ipam-shared-dev-pool.
  2. For Select resource type, choose IPAM pools.
  3. For Resources, select the Amazon Resource Name (ARN) of the IPAM pool you want to share from a list of the IPAM pool ARNs you own.
  4. Choose Next.
     
Figure 3: Specify the resources to share in your resource share

Figure 3: Specify the resources to share in your resource share

Configure customer managed permissions

In this example, the accounts in the Development OU need the permissions required to allocate a CIDR range, but not the permissions to view the IPAM pool details. The existing AWS managed permission grants both read and write permissions. Therefore, you need to create a customer managed permission to refine the resource access permissions for your accounts in the Development OU. With a customer managed permission, you can select and tailor the actions that the development accounts can perform on the IPAM pool, such as write-only actions.

In this section, you create a customer managed permission, configure the managed permission name, select the resource type, and choose the actions that are allowed with the shared resource.

To create and author a customer managed permission

  1. On the Associate managed permissions page, choose Create customer managed permission. This will bring up a new browser tab with a Create a customer managed permission page.
  2. On the Create a customer managed permission page, enter my-ipam-cmp for the Customer managed permission name.
  3. Confirm the Resource type as ec2:IpamPool.
  4. On the Visual editor tab of the Policy template section, select the Write checkbox only. This will automatically check all the available write actions.
  5. Choose Create customer managed permission.
     
Figure 4: Create a customer managed permission with only write actions

Figure 4: Create a customer managed permission with only write actions

Now that you’ve created your customer managed permission, you must associate it to your resource share.

To associate your customer managed permission

  1. Go back to the previous Associate managed permissions page. This is most likely located in a separate browser tab.
  2. Choose the refresh icon .
  3. Select my-ipam-cmp from the dropdown menu.
  4. Review the policy template, and then choose Next.

Next, select the IAM roles, IAM users, AWS accounts, AWS OUs, or organization you want to share your IPAM pool with. In this example, you share the IPAM pool with an OU in your account.

To grant access to principals

  1. On the Grant access to principals page, select Allow sharing only with your organization.
  2. For Select principal type, choose Organizational unit (OU).
  3. Enter the Development OU’s ID.
  4. Select Add, and then choose Next.
  5. Choose Create resource share to complete creation of your resource share.
     
Figure 5: Grant access to principals in your resource share

Figure 5: Grant access to principals in your resource share

Verify the customer managed permissions

Now let’s verify that the customer managed permission is working as expected. In this section, you verify that the development account cannot view the details of the IPAM pool and that you can use that same account to create a VPC with the IPAM pool.

To verify that an account in your Development OU can’t view the IPAM pool details

  1. Sign in to the AWS Management Console as an account in your Development OU. For Features, select Amazon VPC IP Address Manager console.
  2. In the left navigation pane, choose Pools.
  3. Select ipam-shared-dev-pool. You won’t be able to view the IPAM pool details.

To verify that an account in your Development OU can create a new VPC with the IPAM pool

  1. Sign in to the AWS Management Console as an account in your Development OU. For Services, select VPC console.
  2. On the VPC dashboard, choose Create VPC.
  3. On the Create VPC page, select VPC only.
  4. For name, enter my-dev-vpc.
  5. Select IPAM-allocated IPv4 CIDR block.
  6. Choose the ARN of the IPAM pool that’s shared with your development account.
  7. For Netmask, select /24 256 IPs.
  8. Choose Create VPC. You’ve successfully created a VPC with the IPAM pool shared with your account in your Development OU.
     
Figure 6: Create a VPC

Figure 6: Create a VPC

Update customer managed permissions

You can create a new version of your customer managed permission to rescope and update the access granularity of your resources that are shared using AWS RAM. For example, you can add a condition in your customer managed permissions so that only IAM users or roles tagged with a particular principal tag can access and perform the actions allowed on resources shared using AWS RAM. If you need to update your customer managed permission — for example, after testing or as your business and security needs evolve — you can create and save a new version of the same customer managed permission rather than creating an entirely new customer management permission. For example, you might want to adjust your access configurations to read-only actions for your development accounts and to rescope to read-write actions for your testing accounts. The new version of the permission won’t apply automatically to your existing resource shares, and you must explicitly apply it to those shares for it to take effect.

To create a version of your customer managed permission

  1. Sign in to the AWS Management Console as your networking account. For Services, select Resource Access Manager console.
  2. In the left navigation pane, choose Managed permissions library.
  3. For Filter by text, enter my-ipam-cmp and select my-ipam-cmp. You can also select the Any type dropdown menu and then select Customer managed to narrow the list of managed permissions to only your customer managed permissions.
  4. On the my-ipam-cmp page, choose Create version.
  5. You can make the customer managed permission more fine-grained by adding a condition. On the Create a customer managed permission for my-ipam-cmp page, under the Policy template section, choose JSON editor.
  6. Add a condition with aws:PrincipalTag that allows only the users or roles tagged with team = networking to access the shared IPAM pool.
    "Condition": {
                    "StringEquals": {
                        "aws:PrincipalTag/team": "networking"
                    }
                }

  7. Choose Create version. This new version will be automatically set as the default version of your customer managed permission. As a result, new resource shares that use the customer managed permission will use the new version.
     
Figure 7: Update your customer managed permissions and add a condition statement with aws:PrincipalTag

Figure 7: Update your customer managed permissions and add a condition statement with aws:PrincipalTag

Note: Now that you have the new version of your customer managed permission, you must explicitly apply it to your existing resource shares for it to take effect.

To apply the new version of the customer managed permission to existing resource shares

  1. On the my-ipam-cmp page, under the Managed permission versions, select Version 1.
  2. Choose the Associated resource shares tab.
  3. Find ipam-shared-dev-pool and next to the current version number, select Update to default version. This will update your ipam-shared-dev-pool resource share with the new version of your my-ipam-cmp customer managed permission.

To verify your updated customer managed permission, see the Verify the customer managed permissions section earlier in this post. Make sure that you sign in with an IAM role or user tagged with team = networking, and then repeat the steps of that section to verify your updated customer managed permission. If you use an IAM role or user that is not tagged with team = networking, you won’t be able to allocate a CIDR from the IPAM pool and you won’t be able to create the VPC.

Cleanup

To remove the resources created by the preceding example:

  1. Delete the resource share from the AWS RAM console.
  2. Deprovision the CIDR from the IPAM pool.
  3. Delete the IPAM pool you created.

Summary

This blog post presented an example of using customer managed permissions in AWS RAM. AWS RAM brings simplicity, consistency, and confidence when sharing your resources across accounts. In the example, you used AWS RAM to share an IPAM pool to accounts in a Development OU, configured fine-grained resource access controls, and followed the best practice of least privilege by granting only the permissions required for the accounts in the Development OU to perform a specific task with the shared IPAM pool. In the example, you also created a new version of your customer managed permission to rescope the access granularity of your resources that are shared using AWS RAM.

To learn more about AWS RAM and customer managed permissions, see the AWS RAM documentation and watch the AWS RAM Introduces Customer Managed Permissions demo.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Fabian Labat

Fabian Labat

Fabian is a principal solutions architect based in New York, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 25 years of technology experience in system design and IT infrastructure.

Nini Ren

Nini Ren

Nini is the product manager for AWS Resource Access Manager (RAM). He enjoys working closely with customers to develop solutions that not only meet their needs, but also create value for their businesses. Nini holds an MBA from The Wharton School, a masters of computer and information technology from the University of Pennsylvania, and an AB in chemistry and physics from Harvard College.

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/create-an-apache-hudi-based-near-real-time-transactional-data-lake-using-aws-dms-amazon-kinesis-aws-glue-streaming-etl-and-data-visualization-using-amazon-quicksight/

With the rapid growth of technology, more and more data volume is coming in many different formats—structured, semi-structured, and unstructured. Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. In most real-world use cases, it’s important to replicate the data from the relational database source to the target in real time. Change data capture (CDC) is one of the most common design patterns to capture the changes made in the source database and reflect them to other data stores.

We recently announced support for streaming extract, transform, and load (ETL) jobs in AWS Glue version 4.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds. AWS also offers a broad selection of services to support your needs. A database replication service such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3), which commonly hosts the storage layer of the data lake. Although it’s straightforward to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s difficult to apply this CDC process on your data lakes. Apache Hudi, an open-source data management framework used to simplify incremental data processing and data pipeline development, is a good option to solve this problem.

This post demonstrates how to apply CDC changes from Amazon Relational Database Service (Amazon RDS) or other relational databases to an S3 data lake, with flexibility to denormalize, transform, and enrich the data in near-real time.

Solution overview

We use an AWS DMS task to capture near-real-time changes in the source RDS instance, and use Amazon Kinesis Data Streams as a destination of the AWS DMS task CDC replication. An AWS Glue streaming job reads and enriches changed records from Kinesis Data Streams and performs an upsert into the S3 data lake in Apache Hudi format. Then we can query the data with Amazon Athena visualize it in Amazon QuickSight. AWS Glue natively supports continuous write operations for streaming data to Apache Hudi-based tables.

The following diagram illustrates the architecture used for this post, which is deployed through an AWS CloudFormation template.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Source data overview

To illustrate our use case, we assume a data analyst persona who is interested in analyzing near-real-time data for sport events using the table ticket_activity. An example of this table is shown in the following screenshot.

Apache Hudi connector for AWS Glue

For this post, we use AWS Glue 4.0, which already has native support for the Hudi framework. Hudi, an open-source data lake framework, simplifies incremental data processing in data lakes built on Amazon S3. It enables capabilities including time travel queries, ACID (Atomicity, Consistency, Isolation, Durability) transactions, streaming ingestion, CDC, upserts, and deletes.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An RDS database instance (source).
  • An AWS DMS replication instance, used to replicate the data from the source table to Kinesis Data Streams.
  • A Kinesis data stream.
  • Four AWS Glue Python shell jobs:
    • rds-ingest-rds-setup-<CloudFormation Stack name> – creates one source table called ticket_activity on Amazon RDS.
    • rds-ingest-data-initial-<CloudFormation Stack name> – Sample data is automatically generated at random by the Faker library and loaded to the ticket_activity table.
    • rds-ingest-data-incremental-<CloudFormation Stack name> – Ingests new ticket activity data into the source table ticket_activity continuously. This job simulates customer activity.
    • rds-upsert-data-<CloudFormation Stack name> – Upserts specific records in the source table ticket_activity. This job simulates administrator activity.
  • AWS Identity and Access Management (IAM) users and policies.
  • An Amazon VPC, a public subnet, two private subnets, internet gateway, NAT gateway, and route tables.
    • We use private subnets for the RDS database instance and AWS DMS replication instance.
    • We use the NAT gateway to have reachability to pypi.org to use the MySQL connector for Python from the AWS Glue Python shell jobs. It also provides reachability to Kinesis Data Streams and an Amazon S3 API endpoint

To set up these resources, you must have the following prerequisites:

The following diagram illustrates the architecture of our provisioned resources.

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack
  3. Choose Next.
  4. For S3BucketName, enter the name of your new S3 bucket.
  5. For VPCCIDR, enter a CIDR IP address range that doesn’t conflict with your existing networks.
  6. For PublicSubnetCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  7. For PrivateSubnetACIDR and PrivateSubnetBCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  8. For SubnetAzA and SubnetAzB, choose the subnets you want to use.
  9. For DatabaseUserName, enter your database user name.
  10. For DatabaseUserPassword, enter your database user password.
  11. Choose Next.
  12. On the next page, choose Next.
  13. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  14. Choose Create stack.

Stack creation can take about 20 minutes.

Set up an initial source table

The AWS Glue job rds-ingest-rds-setup-<CloudFormation stack name> creates a source table called event on the RDS database instance. To set up the initial source table in Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-rds-setup-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

This job will only create the one table, ticket_activity, in the MySQL instance (DDL). See the following code:

CREATE TABLE ticket_activity (
ticketactivity_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
sport_type VARCHAR(256) NOT NULL,
start_date DATETIME NOT NULL,
location VARCHAR(256) NOT NULL,
seat_level VARCHAR(256) NOT NULL,
seat_location VARCHAR(256) NOT NULL,
ticket_price INT NOT NULL,
customer_name VARCHAR(256) NOT NULL,
email_address VARCHAR(256) NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL )

Ingest new records

In this section, we detail the steps to ingest new records. Implement following steps to star the execution of the jobs.

Start data ingestion to Kinesis Data Streams using AWS DMS

To start data ingestion from Amazon RDS to Kinesis Data Streams, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task rds-to-kinesis-<CloudFormation stack name>.
  3. On the Actions menu, choose Restart/Resume.
  4. Wait for the status to show as Load complete and Replication ongoing.

The AWS DMS replication task ingests data from Amazon RDS to Kinesis Data Streams continuously.

Start data ingestion to Amazon S3

Next, to start data ingestion from Kinesis Data Streams to Amazon S3, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose streaming-cdc-kinesis2hudi-<CloudFormation stack name> to open the job.
  3. Choose Run.

Do not stop this job; you can check the run status on the Runs tab and wait for it to show as Running.

Start the data load to the source table on Amazon RDS

To start data ingestion to the source table on Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-initial-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

Validate the ingested data

After about 2 minutes from starting the job, the data should be ingested into the Amazon S3. To validate the ingested data in the Athena, complete the following steps:

  1. On the Athena console, complete the following steps if you’re running an Athena query for the first time:
    • On the Settings tab, choose Manage.
    • Specify the stage directory and the S3 path where Athena saves the query results.
    • Choose Save.

  1. On the Editor tab, run the following query against the table to check the data:
SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

Note that AWS Cloud Formation will create the database with the account number as database_<your-account-number>_hudi_cdc_demo.

Update existing records

Before you update the existing records, note down the ticketactivity_id value of a record from the ticket_activity table. Run the following SQL using Athena. For this post, we use ticketactivity_id = 46 as an example:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

To simulate a real-time use case, update the data in the source table ticket_activity on the RDS database instance to see that the updated records are replicated to Amazon S3. Complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-incremental-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Choose the Runs tab and wait for Run status to show as SUCCEEDED.

To upsert the records in the source table, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job rds-upsert-data-<CloudFormation stack name>.
  3. On the Job details tab, under Advanced properties, for Job parameters, update the following parameters:
    • For Key, enter --ticketactivity_id.
    • For Value, replace 1 with one of the ticket IDs you noted above (for this post, 46).

  1. Choose Save.
  2. Choose Run and wait for the Run status to show as SUCCEEDED.

This AWS Glue Python shell job simulates a customer activity to buy a ticket. It updates a record in the source table ticket_activity on the RDS database instance using the ticket ID passed in the job argument --ticketactivity_id. It will update ticket_price=500 and updated_at with the current timestamp.

To validate the ingested data in Amazon s3, run the same query from Athena and check the ticket_activity value you noted earlier to observe the ticket_price and updated_at fields:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" where ticketactivity_id = 46 ;

Visualize the data in QuickSight

After you have the output file generated by the AWS Glue streaming job in the S3 bucket, you can use QuickSight to visualize the Hudi data files. QuickSight is a scalable, serverless, embeddable, ML-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device and seamlessly embedded into your applications, portals, and websites.

Build a QuickSight dashboard

To build a QuickSight dashboard, complete the following steps:

  1. Open the QuickSight console.

You’re presented with the QuickSight welcome page. If you haven’t signed up for QuickSight, you may have to complete the signup wizard. For more information, refer to Signing up for an Amazon QuickSight subscription.

After you have signed up, QuickSight presents a “Welcome wizard.” You can view the short tutorial, or you can close it.

  1. On the QuickSight console, choose your user name and choose Manage QuickSight.
  2. Choose Security & permissions, then choose Manage.
  3. Select Amazon S3 and select the buckets that you created earlier with AWS CloudFormation.
  4. Select Amazon Athena.
  5. Choose Save.
  6. If you changed your Region during the first step of this process, change it back to the Region that you used earlier during the AWS Glue jobs.

Create a dataset

Now that you have QuickSight up and running, you can create your dataset. Complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Athena.
  4. For Data source name, enter a name (for example, hudi-blog).
  5. Choose Validate.
  6. After the validation is successful, choose Create data source.
  7. For Database, choose database_<your-account-number>_hudi_cdc_demo.
  8. For Tables, select ticket_activity.
  9. Choose Select.
  10. Choose Visualize.
  11. Choose hour and then ticket_activity_id to get the count of ticket_activity_id by hour.

Clean up

To clean up your resources, complete the following steps:

  1. Stop the AWS DMS replication task rds-to-kinesis-<CloudFormation stack name>.
  2. Navigate to the RDS database and choose Modify.
  3. Deselect Enable deletion protection, then choose Continue.
  4. Stop the AWS Glue streaming job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
  5. Delete the CloudFormation stack.
  6. On the QuickSight dashboard, choose your user name, then choose Manage QuickSight.
  7. Choose Account settings, then choose Delete account.
  8. Choose Delete account to confirm.
  9. Enter confirm and choose Delete account.

Conclusion

In this post, we demonstrated how you can stream data—not only new records, but also updated records from relational databases—to Amazon S3 using an AWS Glue streaming job to create an Apache Hudi-based near-real-time transactional data lake. With this approach, you can easily achieve upsert use cases on Amazon S3. We also showcased how to visualize the Apache Hudi table using QuickSight and Athena. As a next step, refer to the Apache Hudi performance tuning guide for a high-volume dataset. To learn more about authoring dashboards in QuickSight, check out the QuickSight Author Workshop.


About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Sundeep Kumar is a Sr. Data Architect, Data Lake at AWS, helping customers build data lake and analytics platform and solutions. When not building and designing data lakes, Sundeep enjoys listening music and playing guitar.

Estimating Scope 1 Carbon Footprint with Amazon Athena

Post Syndicated from Thomas Burns original https://aws.amazon.com/blogs/big-data/estimating-scope-1-carbon-footprint-with-amazon-athena/

Today, more than 400 organizations have signed The Climate Pledge, a commitment to reach net-zero carbon by 2040. Some of the drivers that lead to setting explicit climate goals include customer demand, current and anticipated government relations, employee demand, investor demand, and sustainability as a competitive advantage. AWS customers are increasingly interested in ways to drive sustainability actions. In this blog, we will walk through how we can apply existing enterprise data to better understand and estimate Scope 1 carbon footprint using Amazon Simple Storage Service (S3) and Amazon Athena, a serverless interactive analytics service that makes it easy to analyze data using standard SQL.

The Greenhouse Gas Protocol

The Greenhouse Gas Protocol (GHGP) provides standards for measuring and managing global warming impacts from an organization’s operations and value chain.

The greenhouse gases covered by the GHGP are the seven gases required by the UNFCCC/Kyoto Protocol (which is often called the “Kyoto Basket”). These gases are carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), the so-called F-gases (hydrofluorocarbons and perfluorocarbons), sulfur hexafluoride (SF6) nitrogen trifluoride (NF3). Each greenhouse gas is characterized by its global warming potential (GWP), which is determined by the gas’s greenhouse effect and its lifetime in the atmosphere. Since carbon dioxide (CO2) accounts for about 76 percent of total man-made greenhouse gas emissions, the global warming potential of greenhouse gases are measured relative to CO2, and are thus expressed as CO2-equivalent (CO2e).

The GHGP divides an organization’s emissions into three primary scopes:

  • Scope 1 – Direct greenhouse gas emissions (for example from burning fossil fuels)
  • Scope 2 – Indirect emissions from purchased energy (typically electricity)
  • Scope 3 – Indirect emissions from the value chain, including suppliers and customers

How do we estimate greenhouse gas emissions?

There are different methods to estimating GHG emissions that includes the Continuous Emissions Monitoring System (CEMS) Method, the Spend-Based Method, and the Consumption-Based Method.

Direct Measurement – CEMS Method

An organization can estimate its carbon footprint from stationary combustion sources by performing a direct measurement of carbon emissions using the CEMS method. This method requires continuously measuring the pollutants emitted in exhaust gases from each emissions source using equipment such as gas analyzers, gas samplers, gas conditioning equipment (to remove particulate matter, water vapor and other contaminants), plumbing, actuated valves, Programmable Logic Controllers (PLCs) and other controlling software and hardware. Although this approach may yield useful results, CEMS requires specific sensing equipment for each greenhouse gas to be measured, requires supporting hardware and software, and is typically more suitable for Environment Health and Safety applications of centralized emission sources. More information on CEMS is available here.

Spend-Based Method

Because the financial accounting function is mature and often already audited, many organizations choose to use financial controls as a foundation for their carbon footprint accounting. The Economic Input-Output Life Cycle Assessment (EIO LCA) method is a spend-based method that combines expenditure data with monetary-based emission factors to estimate the emissions produced. The emission factors are published by the U.S. Environment Protection Agency (EPA) and other peer-reviewed academic and government sources. With this method, you can multiply the amount of money spent on a business activity by the emission factor to produce the estimated carbon footprint of the activity.

For example, you can convert the amount your company spends on truck transport to estimated kilograms (KG) of carbon dioxide equivalent (CO₂e) emitted as shown below.

Estimated Carbon Footprint = Amount of money spent on truck transport * Emission Factor [1]

Although these computations are very easy to make from general ledgers or other financial records, they are most valuable for initial estimates or for reporting minor sources of greenhouse gases. As the only user-provided input is the amount spent on an activity, EIO LCA methods aren’t useful for modeling improved efficiency. This is because the only way to reduce EIO-calculated emissions is to reduce spending. Therefore, as a company continues to improve its carbon footprint efficiency, other methods of estimating carbon footprint are often more desirable.

Consumption-Based Method

From either Enterprise Resource Planning (ERP) systems or electronic copies of fuel bills, it’s straightforward to determine the amount of fuel an organization procures during a reporting period. Fuel-based emission factors are available from a variety of sources such as the US Environmental Protection Agency and commercially-licensed databases. Multiplying the amount of fuel procured by the emission factor yields an estimate of the CO2e emitted through combustion. This method is often used for estimating the carbon footprint of stationary emissions (for instance backup generators for data centers or fossil fuel ovens for industrial processes).

If for a particular month an enterprise consumed a known amount of motor gasoline for stationary combustion, the Scope 1 CO2e footprint of the stationary gasoline combustion can be estimated in the following manner:

Estimated Carbon Footprint = Amount of Fuel Consumed * Stationary Combustion Emission Factor[2]

Organizations may estimate their carbon emissions by using existing data found in fuel and electricity bills, ERP data, and relevant emissions factors, which are then consolidated in to a data lake. Using existing analytics tools such as Amazon Athena and Amazon QuickSight an organization can gain insight into its estimated carbon footprint.

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint.

Analytics Architecture

Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case. For example, in the data ingestion phase, depending on the existing data requirements, there are many options to ingest data into the data lake such as using the AWS Command Line Interface (CLI), AWS DataSync, or AWS Database Migration Service.

Example of calculating a Scope 1 stationary emissions footprint with AWS services

Let’s assume you burned 100 standard cubic feet (scf) of natural gas in an oven. Using the US EPA emission factors for stationary emissions we can estimate the carbon footprint associated with the burning. In this case the emission factor is 0.05449555 Kg CO2e /scf.[3]

Amazon S3 is ideal for building a data lake on AWS to store disparate data sources in a single repository, due to its virtually unlimited scalability and high durability. Athena, a serverless interactive query service, allows the analysis of data directly from Amazon S3 using standard SQL without having to load the data into Athena or run complex extract, transform, and load (ETL) processes. Amazon QuickSight supports creating visualizations of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to extract a subset of the data. QuickSight dashboards can provide you with insights (such as your company’s estimated carbon footprint) quickly, and also provide the ability to generate standardized reports for your business and sustainability users.

In this example, the sample data is stored in a file system and uploaded to Amazon S3 using the AWS Command Line Interface (CLI) as shown in the following architecture diagram. AWS recommends creating AWS resources and managing CLI access in accordance with the Best Practices for Security, Identity, & Compliance guidance.

The AWS CLI command below demonstrates how to upload the sample data folders into the S3 target location.

aws s3 cp /path/to/local/file s3://bucket-name/path/to/destination

The snapshot of the S3 console shows two newly added folders that contains the files.

S3 Bucket Overview of Files

To create new table schemas, we start by running the following script for the gas utilization table in the Athena query editor using Hive DDL. The script defines the data format, column details, table properties, and the location of the data in S3.

CREATE EXTERNAL TABLE `gasutilization`(
`fuel_id` int,
`month` string,
`year` int,
`usage_therms` float,
`usage_scf` float,
`g-nr1_schedule_charge` float,
`accountfee` float,
`gas_ppps` float,
`netcharge` float,
`taxpercentage` float,
`totalcharge` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gasutilization'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

Athena Hive DDLThe script below shows another example of using Hive DDL to generate the table schema for the gas emission factor data.

CREATE EXTERNAL TABLE `gas_emission_factor`(
`fuel_id` int,
`gas_name` string,
`emission_factor` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gas_emission_factor'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

After creating the table schema in Athena, we run the below query against the gas utilization table that includes details of gas bills to show the gas utilization and the associated charges, such as gas public purpose program surcharge (PPPS) and total charges after taxes for the year of 2020:

SELECT * FROM "gasutilization" where year = 2020;

Athena gas utilization overview by month

We are also able to analyze the emission factor data showing the different fuel types and their corresponding CO2e emission as shown in the screenshot.

athena co2e emission factor

With the emission factor and the gas utilization data, we can run the following query below to get an estimated Scope 1 carbon footprint alongside other details. In this query, we joined the gas utilization table and the gas emission factor table on fuel id and multiplied the gas usage in standard cubic foot (scf) by the emission factor to get the estimated CO2e impact. We also selected the month, year, total charge, and gas usage measured in therms and scf, as these are often attributes that are of interest for customers.

SELECT "gasutilization"."usage_scf" * "gas_emission_factor"."emission_factor" 
AS "estimated_CO2e_impact", 
"gasutilization"."month", 
"gasutilization"."year", 
"gasutilization"."totalcharge", 
"gasutilization"."usage_therms", 
"gasutilization"."usage_scf" 
FROM "gasutilization" 
JOIN "gas_emission_factor" 
on "gasutilization"."fuel_id"="gas_emission_factor"."fuel_id";

athena join

Lastly, Amazon QuickSight allows visualization of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to get a subset of the data. The following is an example of a QuickSight dashboard showing the gas utilization, gas charges, and estimated carbon footprint across different years.

QuickSight sample dashboard

We have just estimated the Scope 1 carbon footprint for one source of stationary combustion. If we were to do the same process for all sources of stationary and mobile emissions (with different emissions factors) and add the results together, we could roll up an accurate estimate of our Scope 1 carbon emissions for the entire business by only utilizing native AWS services and our own data. A similar process will yield an estimate of Scope 2 emissions, with grid carbon intensity in the place of Scope 1 emission factors.

Summary

This blog discusses how organizations can use existing data in disparate sources to build a data architecture to gain better visibility into Scope 1 greenhouse gas emissions. With Athena, S3, and QuickSight, organizations can now estimate their stationary emissions carbon footprint in a repeatable way by applying the consumption-based method to convert fuel utilization into an estimated carbon footprint.

Other approaches available on AWS include Carbon Accounting on AWS, Sustainability Insights Framework, Carbon Data Lake on AWS, and general guidance detailed at the AWS Carbon Accounting Page.

If you are interested in information on estimating your organization’s carbon footprint with AWS, please reach out to your AWS account team and check out AWS Sustainability Solutions.

References

  1. An example from page four of Amazon’s Carbon Methodology document illustrates this concept.
    Amount spent on truck transport: $100,000
    EPA Emission Factor: 1.556 KG CO2e /dollar of truck transport
    Estimated CO₂e emission: $100,000 * 1.556 KG CO₂e/dollar of truck transport = 155,600 KG of CO2e
  2. For example,
    Gasoline consumed: 1,000 US Gallons
    EPA Emission Factor: 8.81 Kg of CO2e /gallon of gasoline combusted
    Estimated CO2e emission = 1,000 US Gallons * 8.81 Kg of CO2e per gallon of gasoline consumed= 8,810 Kg of CO2e.
    EPA Emissions Factor for stationary emissions of motor gasoline is 8.78 kg CO2 plus .38 grams of CH4, plus .08 g of N2O.
    Combining these emission factors using 100-year global warming potential for each gas (CH4:25 and N2O:298) gives us Combined Emission Factor = 8.78 kg + 25*.00038 kg + 298 *.00008 kg = 8.81 kg of CO2e per gallon.
  3. The Emission factor per scf is 0.05444 kg of CO2 plus 0.00103 g of CH4 plus 0.0001 g of N2O. To get this in terms of CO2e we need to multiply the emission factor of the other two gases by their global warming potentials (GWP). The 100-year GWP for CH4  and N2O are 25 and 298 respectively. Emission factors and GWPs come from the US EPA website.


About the Authors


Thomas Burns
, SCR, CISSP is a Principal Sustainability Strategist and Principal Solutions Architect at Amazon Web Services. Thomas supports manufacturing and industrial customers world-wide. Thomas’s focus is using the cloud to help companies reduce their environmental impact both inside and outside of IT.

Aileen Zheng is a Solutions Architect supporting US Federal Civilian Sciences customers at Amazon Web Services (AWS). She partners with customers to provide technical guidance on enterprise cloud adoption and strategy and helps with building well-architected solutions. She is also very passionate about data analytics and machine learning. In her free time, you’ll find Aileen doing pilates, taking her dog Mumu out for a hike, or hunting down another good spot for food! You’ll also see her contributing to projects to support diversity and women in technology.

How FIS ingests and searches vector data for quick ticket resolution with Amazon OpenSearch Service

Post Syndicated from Rupesh Tiwari original https://aws.amazon.com/blogs/big-data/how-fis-ingests-and-searches-vector-data-for-quick-ticket-resolution-with-amazon-opensearch-service/

This post was co-written by Sheel Saket, Senior Data Science Manager at FIS, and Rupesh Tiwari, Senior Architect at Amazon Web Services.

Do you ever find yourself grappling with multiple defect logging mechanisms, scattered project management tools, and fragmented software development platforms? Have you experienced the frustration of lacking a unified view, hindering your ability to efficiently manage and identify common trending issues within your enterprise? Are you constantly facing challenges when it comes to addressing defects and their impact, causing disruptions in your production cycles?

If these questions resonate with you, then you’re not alone. FIS, a leading technology and services provider, has encountered these very challenges. In their quest for a solution, they teamed up with AWS to tackle these obstacles head-on. In this post, we take you on a journey through their collaborative project, exploring how they used Amazon OpenSearch Service to transform their operations, enhance efficiency, and gain valuable insights.

This post shares FIS’s journey in overcoming challenges and provides step-by-step instructions for provisioning the solution architecture in your AWS account. You’ll learn how to implement a transformative solution that empowers your organization with near-real-time data indexing and visualization capabilities.

In the following sections, we dive into the details of FIS’s journey and discover how they overcame these challenges, revolutionizing their approach to defect management and software development.

Challenges for near-real-time ticket visualization and search

FIS faced several challenges in achieving near-real-time ticket visualization and search capabilities, including the
following:

  • Integrating ticket data from tens of different third-party systems
  • Overcoming API call thresholds and limitations from various systems
  • Implementing an efficient KNN vector search algorithm for resolving issues and performing trend analysis
  • Establishing a robust data ingestion and indexing process for real-time updates from 15,000 tickets per day
  • Ensuring unified access to ticket information across 20 development teams
  • Providing secure and scalable access to ticket data for up to 250 teams

Despite these challenges, FIS successfully enhanced their operational efficiency, enabled quick ticket resolution, and gained valuable insights through the integration of OpenSearch Service.

Let’s delve into the technical walkthrough of the architecture diagram and mechanisms. The following section provides step-by-step instructions for provisioning and implementing the solution on your AWS Management Console, along with a helpful video tutorial.

Solution overview

The architecture diagram of FIS’s near-real-time data indexing and visualization solution incorporates various AWS services for specific functions. The solution uses GitHub as the data source, employs Amazon Simple Storage Service (Amazon S3) for scalable storage, manages APIs with Amazon API Gateway, performs serverless computing using AWS Lambda, and facilitates data streaming and ETL (extract, transform, and load) processes through Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose. OpenSearch Service is employed for analytics and application monitoring. This architecture ensures a robust and scalable solution, enabling FIS to efficiently index and visualize data in near-real time. With these AWS services, FIS effectively manages their data pipeline and gains valuable insights for their business processes.

The following diagram illustrates the solution architecture.

Architecture Diagram

The workflow includes the following steps:

  1. GitHub webhook events stream data to both Amazon S3 and OpenSearch
    Service, facilitating real-time data analysis.
  2. A Lambda function connects to an API Gateway REST API, processing and structuring the received payloads.
  3. The Lambda function adds the structured data to a Kinesis data stream, enabling immediate data streaming and quick ticket insights.
  4. Kinesis Data Firehose streams the records from the Kinesis data stream to an S3 bucket, simultaneously creating an index in OpenSearch Service.
  5. OpenSearch Service uses the indexed data to provide near-real-time visualization and enable efficient ticket analysis through K-Nearest Neighbor (KNN) search, enhancing productivity and optimizing data operations.

The following sections provide step-by-step instructions for setting up the solution. Additionally, we have created a video guide that demonstrates each step in detail. You are welcome to watch the video and follow along with this post if you prefer.

Prerequisites

You should have the following prerequisites:

Implement the solution

Complete the following steps to implement the solution:

  1. Create an OpenSearch Service domain.
  2. Create an S3 bucket named git-data.
  3. Create a Kinesis data stream named git-data-stream.
  4. Create a Firehose delivery stream named git-data-delivery-stream with
    git-data-stream as the source and git-data as the destination, and a buffer interval of 60 seconds.
  5. Create a Lambda function named git-webhook-handler with a timeout of 5 minutes. Add code to add data to the Kinesis data stream.
  6. Grant the Lambda function’s execution role permission to put_record on the Kinesis data stream.
  7. Create a REST API in API Gateway named git-webhook-handler-api. Create a resource named
    git-data with a POST method, integrate it with the Lambda function git-webhook-handler created in the previous step, and deploy the REST API.
  8. Create a delivery stream with the Kinesis data stream as the source and OpenSearch Service as the destination. Provide the AWS Identity and Access Management (IAM) role for Kinesis Data Firehose with the necessary permissions to create an index in OpenSearch Service. Finally, add the IAM role as a backend service in OpenSearch Service.
  9. Navigate to your GitHub repository and create a webhook to enable seamless integration with the solution. Copy the REST API URL and enter this newly created webhook.

Test the solution

To test the solution, complete the following steps:

  1. Go to your GitHub repository and choose the Star button, and verify that you receive a response with a status code of 200.
  2. Also, check for the ShardId and SequenceNumber in the recent deliveries to confirm successful event addition to the Kinesis data stream.

Kinesis data stream

  1. On the Kinesis console, use the Data Viewer to confirm the arrival of data records.

kinesis record data

  1. Navigate to the OpenSearch Dashboard and choose the dev tool.
  2. Search for the records and observe that all the Git events are displayed
    in the result pane.

opensearch devtool

  1. On the Amazon S3 console, open the bucket and view the data records.

s3 bucket records

Security

We adhere to IAM best practices to uphold security:

  1. Craft a Lambda execution role for read/write operations on the Kinesis data stream.
  2. Generate an IAM role for Kinesis Data Firehose to manage Amazon S3 and OpenSearch
    Service access.
  3. Link this IAM role in OpenSearch Service security to confer backend user privileges.

Clean up

To avoid incurring future charges, delete all the resources you created.

Benefits of near-real-time ticket visualization and search

During our demonstration, we showcased the utilization of GitHub as the streaming data source. However, it’s important to note that the solution we presented has the flexibility to scale and incorporate multiple data sources from various services. This allows for the consolidation and visualization of diverse data in near-real time, using the capabilities of OpenSearch Service.

With the implementation of the solution described in this post, FIS effectively overcame all the challenges they faced.

In this section, we delve into the details of the challenges and benefits they achieved:

  • Integrating ticket data from multiple third-party systems – Near-real-time data streaming ensures an up-to-date information flow from third-party providers for timely insights
  • Overcoming API call thresholds and limitations imposed by different systems – Unrestricted data flow with no threshold or rate limiting enables seamless integration and continuous updates
  • Accommodating scalability requirements for up to 250 teams – The asynchronous, serverless architecture effortlessly scales more than 250 times larger without infrastructure modifications
  • Efficiently resolving tickets and performing trend analysis – OpenSearch Service semantic KNN search identifies duplicates and defects, and optimizes operations for improved efficiency
  • Gaining valuable insights for business processes – Artificial intelligence (AI) and machine
    learning (ML) analytics use the data stored in the S3 bucket, empowering deeper insights and informed decision-making
  • Ensuring secure access to ticket data and regulatory compliance – Secure data access and compliance with data protection regulations ensure data privacy and regulatory compliance

Conclusion

FIS, in collaboration with AWS, successfully addressed several challenges to achieve near-real-time ticket visualization and search capabilities. With OpenSearch Service, FIS enhanced operational efficiency by efficiently resolving ticketsand performing trend analysis. With their data ingestion and indexing process, FIS processed 15,000 tickets per day in real time. The solution provided secure and scalable access to ticket data for more than 250 teams, enabling unified collaboration. FIS experienced a remarkable 30% reduction in ticket resolution time, empowering teams to quickly address
issues.

As Sheel Saket, Senior Data Science Manager at FIS, states, “Our near-real-time solution transformed how we identify and resolve tickets, improving our overall productivity.”

Furthermore, organizations can further improve the solution by adopting Amazon OpenSearch Ingestion for data ingestion, which offers cost savings and out-of-the-box data processing capabilities. By embracing this transformative solution, organizations can optimize their ticket management, drive productivity, and deliver exceptional experiences to customers.

Want to know more? You can reach out to FIS from their official FIS contact page, follow FIS Twitter, and visit the FIS LinkedIn page.


About the Author

Rupesh Tiwari is a Senior Solutions Architect at AWS in New York City, with a focus on Financial Services. He has over 18 years of IT experience in the finance, insurance, and education domains, and specializes in architecting large-scale applications and cloud-native big data workloads. In his spare time, Rupesh enjoys singing karaoke, watching comedy TV series, and creating joyful moments with his family.

Sheel Saket is a Senior Data Science Manager at FIS in Chicago, Illinois. He has over 11 years of IT experience in the finance, insurance, and e-commerce domains, and specializes in architecting large-scale AI solutions and cloud MLOps. In his spare time, Sheel enjoys listening to audiobooks, podcasts, and watching movies with his family.

Amazon Kinesis Data Streams on-demand capacity mode now scales up to 1 GB/second ingest capacity

Post Syndicated from Nihar Sheth original https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-streams-on-demand-capacity-mode-now-scales-up-to-1-gb-second-ingest-capacity/

Amazon Kinesis Data Streams is a serverless data streaming service that makes it easy to capture, process, and store streaming data at any scale. As customers collect and stream more types of data, they have asked for simpler, elastic data streams that can handle variable and unpredictable data traffic. In November 2021, Amazon Web Services launched the on-demand capacity mode for Kinesis Data Streams, which is capable of serving gigabytes of write and read throughput per minute and helps reduce the operational pain point of manually updating data stream capacity. You can create a new on-demand data stream or convert an existing data stream to on-demand mode with a single click and never have to provision and manage servers, storage, or throughput. By default, on-demand capacity mode can automatically scale up to 200 MB/s of write throughput.

We were encouraged by customers’ adoption of on-demand capacity mode, but as customers scaled their workloads, some ran into the 200 MB/s data ingestion limit and asked for a solution. The team worked backward from customer feedback to raise that limit. As of March 2023, Kinesis Data Streams supports an increased on-demand write throughput limit to 1 GB/s, a five-times increase from the current limit of 200 MB/s. It’s like having a truly serverless and elastic data streaming service that works for all your use cases. If you require an increase in capacity, you can contact AWS Support to enable on-demand streams to scale up to 1 GB/s write throughput for each requested account. You pay for throughput consumed rather than for provisioned resources, making it easier to balance costs and performance. Overall, if your data volume can spike unpredictably or you don’t want to manage the number of shards, use on-demand streams.

In this post, we explore how to use Kinesis Data Streams on-demand scaling and best practices to build an efficient data-streaming solution. We discuss different scenarios to avoid write throughput exceptions and scale ingest capacity of Kinesis Data Streams to 1 GB/s in on-demand capacity mode.

Kinesis Data Streams on-demand scaling

A shard serves as a base throughput unit of Kinesis Data Streams. A shard supports 1 MB/s and 1,000 records/s for writes and 2 MB/s for reads. The shard limits ensure predictable performance, making it easy to design and operate a highly reliable data streaming workflow. In on-demand capacity mode, scaling happens at the individual shard level. When the average ingest shard utilization reaches 50% (0.5 MB/s or 500 records/s) in 1 minute, then a shard is split into two shards. If you use random values as a partition key, all shards of the stream will have even traffic, and they will be scaled at the same time. If you use a business-specific key as a partition key, the shards will have uneven traffic. In that scenario, only the shards exceeding an average of 50% utilization will be scaled. Depending upon the number of shards being scaled, it will take up to 15 minutes to split the shards.

When we create a new Kinesis data stream in on-demand capacity mode, by default, Kinesis Data Streams provisions four shards, which provides 4 MB/s write and 8 MB/s read throughput. As the workload ramps up, Kinesis Data Streams increases the number of shards in the stream by monitoring ingest throughput at the shard level. The 4 MB/s default ingest throughput and scaling at shard level in on-demand capacity mode works for most use cases. However, in some specific scenarios, producers may face WriteThroughputExceeded and Rate Exceeded errors, even in on-demand capacity mode. We discuss a few of these scenarios in the following sections and strategies to avoid these errors.

You can create and save record templates and easily send data to Kinesis Data Streams using the Amazon Kinesis Data Generator (KDG) to test the streaming data solution. Alternatively, you can also use the modern load testing framework Locust to run large-scale Kinesis Data Streams load testing. For this post, we use the Locust tool to produce and ingest messages in Kinesis Data Streams for our different use cases.

Scenario 1: A baseline ingest throughput greater than 4 MB/s is needed

To simulate this scenario, run the following AWS Command Line Interface (AWS CLI) command to create the kds-od-default-shards data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-default-shards --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

When the kds-od-default-shards data stream is active, run following AWS CLI command to check the number of shards in the data stream:

aws kinesis describe-stream-summary --stream-name kds-od-default-shards --region us-east-1

You can observe that the OpenShardCount value is 4, which means the kds-od-default-shards data stream has an ingest capacity of 4 MB/s.

Next, we use the Locust tool to set the baseline to approximately 25 MB/s records. As displayed in the following Amazon CloudWatch metrics graph, records are getting throttled for the first couple of minutes. Then the kds-od-default-shards data stream scales the number of shards to support 25 MB/s ingest throughput, and records stop getting throttled. You can also rerun the describe-stream-summary AWS CLI command to check the increased number of shards in the data stream.

BDB-3047-scenario-1-incoming-data

BDB-3047-scenario-1-record-throttle

In a scenario where we know our ingest throughput baseline (25 MB/s) ahead of the time and we don’t want to observe any write throttles, we can create a stream in provisioned mode by specifying the number of shards (30), as shown in the following AWS CLI command (make sure to delete kds-od-default-shards manually from the Kinesis Data Streams console before running the following command):

aws kinesis create-stream --stream-name kds-od-default-shards --stream-mode-details StreamMode=PROVISIONED --shard-count 30 --region us-east-1

When the kds-od-default-shards data stream is active, run the following AWS CLI command to convert the data stream’s capacity mode to on-demand:

aws kinesis update-stream-mode --stream-arn arn:aws:kinesis:us-east-1:<AccountId>:stream/kds-od-default-shards --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

Next, we send 25 MB/s records to the kds-od-default-shards data stream. As displayed in the following CloudWatch metrics graph, we can observe no write throttles, and the kds-od-default-shards data stream scales the number of shards to handle the increase in ingest volume.

BDB-3047-scenario-1-incoming-data1

BDB-3047-scenario-1-record-throttle1

After we send 25 MB/s traffic to the data stream for some time, we can run following AWS CLI command to see that the OpenShardCount value is increased to more than 30 now:

aws kinesis describe-stream-summary --stream-name kds-od-default-shards --region us-east-1

Scenario 2: A significant ingestion spike is expected, which needs ingest throughput greater than the number of shards in the stream

To simulate the scenario, run the following AWS CLI command to create the kds-od-significant-spike data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-significant-spike --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

As mentioned earlier, by default, the kds-od-significant-spike data stream will have four shards initially because this stream is created in on-demand mode. When the data stream is active, we send 4 MB/s ingest throughput initially and grow the ingest throughput by 30–50% every 5–10 minutes. As displayed in the following CloudWatch metrics graph, the kds-od-significant-spike data stream scales the number of shards to handle the increase in ingest volume.

After approximately 15 minutes, run the following AWS CLI command to find the OpenShardCount value (x) of the kds-od-significant-spike data stream. Then send (x * 2) MB/s ingest throughput in the data stream for 2–3 minutes and reduced ingest throughput to the prior level:

aws kinesis describe-stream-summary --stream-name kds-od-significant-spike --region us-east-1

As displayed in the following CloudWatch metrics graph, the records are getting throttled for a few minutes, and then the throttling goes away.

BDB-3047-scenario-2-incoming-data

BDB-3047-scenario-2-record-throttle

Typically, we face a significant spike scenario when running planned events, such as shopping holidays and product launches. To handle such scenarios, we can proactively change capacity mode from on-demand to provisioned. We can configure the number of shards and pick the ingest capacity we anticipate. After we successfully scale the number of shards to our desired peak capacity in provisioned capacity mode, we can change the capacity mode back to on-demand mode.

Scenario 3: A single partition key starts pushing more than 1 MB/s

Partition keys are used to segregate and route records to different shards of a stream. A partition key is specified by the data producer while adding data to the data stream. For example, let’s assume we have a stream with two shards (shard 1 and shard 2). We can configure the data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2. Choosing a partition key is a very important decision, and we should carefully pick the partition key to ensure equal distribution of records across all the shards of the stream. Messages tied to a single partition key A will be sent to a single shard (shard 1), and at any given instance, messages tied to a single partition key A cannot be distributed across different shards. As mentioned earlier, by default, one shard supports 1 MB/s and 1,000 records/s for writes, and we may end up with an edge case scenario where we are trying to push more than 1 MB/s for a specific partition key. In this scenario, producers will continue to experience throttles and keep retrying indefinitely.

To simulate the scenario, run the following AWS CLI command to create the kds-od-partition-key-throttle data stream in on-demand capacity mode:

aws kinesis create-stream --stream-name kds-od-partition-key-throttle --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

As mentioned earlier, by default, the data stream will have four shards initially because this stream is created in on-demand mode. When the data stream is active, we send 1.5 MB/s ingest throughput continuously for the specific partition key A. As displayed in the following CloudWatch metrics graph, we can observe that throttling continues from a single shard even if we are sending 1.5 MB/s ingest throughput, and the kds-od-partition-key-throttle data stream has an overall ingest capacity of 4 MB/s.

BDB-3047-scenario-3-incoming-data

BDB-3047-scenario-3-record-throttle

To avoid this scenario, we should carefully pick our partition key and ensure that this specific partition key won’t be continuously sending more than 1 MB/s ingest throughput in the data stream.

Scale the ingest capacity of Kinesis Data Streams to 1 GB/s in on-demand capacity mode

To test, we start with approximately 100 MB/s baseline ingest throughput to Kinesis Data Streams in on-demand capacity mode, then we increase ingest throughput rate by 30–50% every 5–10 minutes using Locust load testing tool.

To set up the scenario, first create the kds-od-1gb-stream data stream in provisioned capacity mode and provide a value of 120 for the provisioned shards field:

aws kinesis create-stream --stream-name kds-od-1gb-stream --stream-mode-details StreamMode=PROVISIONED --shard-count 120 --region us-east-1

When the kds-od-1gb-stream data stream is active, switch its capacity mode to on-demand, as shown in the following code. When we change capacity mode from provisioned to on-demand, the shard count (120) remains the same for the data stream even in on-demand capacity mode.

aws kinesis update-stream-mode --stream-arn arn:aws:kinesis:us-east-1:<AccountId>:stream/kds-od-1gb-stream --stream-mode-details StreamMode=ON_DEMAND --region us-east-1

When the kds-od-1gb-stream data stream is in on-demand mode, start the experiment. We send approximately 100 MB/s baseline ingest throughput using the Locust tool and increase 30–50% ingest throughput every 5–10 minutes. As displayed in the following CloudWatch metrics graph, the kds-od-1gb-stream data stream seamlessly scaled to 1 GB/s in on-demand capacity mode. We can also observe that the producers didn’t encounter any write throttles while the data stream was scaling in on-demand capacity mode.

BDB-3047-scale-to-1-GB

Clean up

To avoid ongoing costs, delete all the data streams that you created as part of this post using the Kinesis Data Streams console.

Conclusion

This post demonstrated the on-demand scaling policy of Kinesis Data Streams with a few scenarios using best practices and showed how to scale ingest capacity to 1 GB/s in on-demand capacity mode. You can have an on-demand write throughput limit that is five times larger than the previous limit of 200 MB/s. Choose on-demand mode if you create new data streams with unknown workloads, have unpredictable application traffic, or prefer not to manage capacity. You can switch between on-demand and provisioned capacity modes two times per 24-hour rolling period. Please leave any feedback in the comments section.


About the Authors

Nihar Sheth is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Pratik Patel is Sr. Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively keep customers’ AWS environments operationally healthy.

Nisha Dekhtawala is a Partner Solutions Architect and data analytics specialist. She works with global consulting partners as their trusted advisor, providing technical guidance and support in building Well-Architected innovative industry solutions.

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Post Syndicated from Tom Romano original https://aws.amazon.com/blogs/big-data/empower-your-jira-data-in-a-data-lake-with-amazon-appflow-and-aws-glue/

In the world of software engineering and development, organizations use project management tools like Atlassian Jira Cloud. Managing projects with Jira leads to rich datasets, which can provide historical and predictive insights about project and development efforts.

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. Companies often take a data lake approach to their analytics, bringing data from many different systems into one place to simplify how the analytics are done.

This post shows you how to use Amazon AppFlow and AWS Glue to create a fully automated data ingestion pipeline that will synchronize your Jira data into your data lake. Amazon AppFlow provides software as a service (SaaS) integration with Jira Cloud to load the data into your AWS account. AWS Glue is a serverless data discovery, load, and transformation service that will prepare data for consumption in BI and AI/ML activities. Additionally, this post strives to achieve a low-code and serverless solution for operational efficiency and cost optimization, and the solution supports incremental loading for cost optimization.

Solution overview

This solution uses Amazon AppFlow to retrieve data from the Jira Cloud. The data is synchronized to an Amazon Simple Storage Service (Amazon S3) bucket using an initial full download and subsequent incremental downloads of changes. When new data arrives in the S3 bucket, an AWS Step Functions workflow is triggered that orchestrates extract, transform, and load (ETL) activities using AWS Glue crawlers and AWS Glue DataBrew. The data is then available in the AWS Glue Data Catalog and can be queried by services such as Amazon Athena, Amazon QuickSight, and Amazon Redshift Spectrum. The solution is completely automated and serverless, resulting in low operational overhead. When this setup is complete, your Jira data will be automatically ingested and kept up to date in your data lake!

The following diagram illustrates the solution architecture.

The Jira Appflow Architecture is shown. The Jira Cloud data is retrieved by Amazon AppFlow and is stored in Amazon S3. This triggers an Amazon EventBridge event that runs an AWS Step Functions workflow. The workflow uses AWS Glue to catalog and transform the data, The data is then queried with QuickSight.

The Step Functions workflow orchestrates the following ETL activities, resulting in two tables:

  • An AWS Glue crawler collects all downloads into a single AWS Glue table named jira_raw. This table is comprised of a mix of full and incremental downloads from Jira, with many versions of the same records representing changes over time.
  • A DataBrew job prepares the data for reporting by unpacking key-value pairs in the fields, as well as removing depreciated records as they are updated in subsequent change data captures. This reporting-ready data will available in an AWS Glue table named jira_data.

The following figure shows the Step Functions workflow.

A diagram represents the AWS Step Functions workflow. It contains the steps to run an AWS Crawler, wait for it's completion, and then run a AWS Glue DataBrew data transformation job.

Prerequisites

This solution requires the following:

  • Administrative access to your Jira Cloud instance, and an associated Jira Cloud developer account.
  • An AWS account and a login with access to the AWS Management Console. Your login will need AWS Identity and Access Management (IAM) permissions to create and access the resources in your AWS account.
  • Basic knowledge of AWS and working knowledge of Jira administration.

Configure the Jira Instance

After logging in to your Jira Cloud instance, you establish a Jira project with associated epics and issues to download into a data lake. If you’re starting with a new Jira instance, it helps to have at least one project with a sampling of epics and issues for the initial data download, because it allows you to create an initial dataset without errors or missing fields. Note that you may have multiple projects as well.

An image show a Jira Cloud example, with several issues arranged in a Kansan board.

After you have established your Jira project and populated it with epics and issues, ensure you also have access to the Jira developer portal. In later steps, you use this developer portal to establish authentication and permissions for the Amazon AppFlow connection.

Provision resources with AWS CloudFormation

For the initial setup, you launch an AWS CloudFormation stack to create an S3 bucket to store data, IAM roles for data access, and the AWS Glue crawler and Data Catalog components. Complete the following steps:

  1. Sign in to your AWS account.
  2. Click Launch Stack:
  3. For Stack name, enter a name for the stack (the default is aws-blog-jira-datalake-with-AppFlow).
  4. For GlueDatabaseName, enter a unique name for the Data Catalog database to hold the Jira data table metadata (the default is jiralake).
  5. For InitialRunFlag, choose Setup. This mode will scan all data and disable the change data capture (CDC) features of the stack. (Because this is the initial load, the stack needs an initial data load before you configure CDC in later steps.)
  6. Under Capabilities and transforms, select the acknowledgement check boxes to allow IAM resources to be created within your AWS account.
  7. Review the parameters and choose Create stack to deploy the CloudFormation stack. This process will take around 5–10 minutes to complete.
    An image depicts the Amazon CloudFormation configuration steps, including setting a stack name, setting parameters to "jiralake" and "Setup" mode, and checking all IAM capabilities requested.
  8. After the stack is deployed, review the Outputs tab for the stack and collect the following values to use when you set up Amazon AppFlow:
    • Amazon AppFlow destination bucket (o01AppFlowBucket)
    • Amazon AppFlow destination bucket path (o02AppFlowPath)
    • Role for Amazon AppFlow Jira connector (o03AppFlowRole)
      An image demonstrating the Amazon Cloudformation "Outputs" tab, highlighting the values to add to the Amazon AppFlow configuration.

Configure Jira Cloud

Next, you configure your Jira Cloud instance for access by Amazon AppFlow. For full instructions, refer to Jira Cloud connector for Amazon AppFlow. The following steps summarize these instructions and discuss the specific configuration to enable OAuth in the Jira Cloud:

  1. Open the Jira developer portal.
  2. Create the OAuth 2 integration from the developer application console by choosing Create an OAuth 2.0 Integration. This will provide a login mechanism for AppFlow.
  3. Enable fine-grained permissions. See Recommended scopes for the permission settings to grant AppFlow appropriate access to your Jira instance.
  4. Add the following permission scopes to your OAuth app:
    1. manage:jira-configuration
    2. read:field-configuration:jira
  5. Under Authorization, set the Call Back URL to return to Amazon AppFlow with the URL https://us-east-1.console.aws.amazon.com/AppFlow/oauth.
  6. Under Settings, note the client ID and secret to use in later steps to set up authentication from Amazon AppFlow.

Create the Amazon AppFlow Jira Cloud connection

In this step, you configure Amazon AppFlow to run a one-time full data fetch of all your data, establishing the initial data lake:

  1. On the Amazon AppFlow console, choose Connectors in the navigation pane.
  2. Search for the Jira Cloud connector.
  3. Choose Create flow on the connector tile to create the connection to your Jira instance.
    An image of Amazon AppFlor, showing the search for the "Jira Cloud" connector.
  4. For Flow name, enter a name for the flow (for example, JiraLakeFlow).
  5. Leave the Data encryption setting as the default.
  6. Choose Next.
    The Amazon AppFlow Jira connector configuration, showing the Flow name set to "JiraLakeFlow" and clicking the "next" button.
  7. For Source name, keep the default of Jira Cloud.
  8. Choose Create new connection under Jira Cloud connection.
  9. In the Connect to Jira Cloud section, enter the values for Client ID, Client secret, and Jira Cloud Site that you collected earlier. This provides the authentication from AppFlow to Jira Cloud.
  10. For Connection Name, enter a connection name (for example, JiraLakeCloudConnection).
  11. Choose Connect. You will be prompted to allow your OAuth app to access your Atlassian account to verify authentication.
    An image of the Amazon AppFlow conflagration, reflecting the completion of the prior steps.
  12. In the Authorize App window that pops up, choose Accept.
  13. With the connection created, return to the Configure flow section on the Amazon AppFlow console.
  14. For API version, choose V2 to use the latest Jira query API.
  15. For Jira Cloud object, choose Issue to query and download all issues and associated details.
    An image of the Amazon AppFlow configuration, reflecting the completion of the prior steps.
  16. For Destination Name in the Destination Details section, choose Amazon S3.
  17. For Bucket details, choose the S3 bucket name that matches the Amazon AppFlow destination bucket value that you collected from the outputs of the CloudFormation stack.
  18. Enter the Amazon AppFlow destination bucket path to complete the full S3 path. This will send the Jira data to the S3 bucket created by the CloudFormation script.
  19. Leave Catalog your data in the AWS Glue Data Catalog unselected. The CloudFormation script uses an AWS Glue crawler to update the Data Catalog in a different manner, grouping all the downloads into a common table, so we disable the update here.
  20. For File format settings, select Parquet format and select Preserve source data types in Parquet output. Parquet is a columnar format to optimize subsequent querying.
  21. Select Add a timestamp to the file name for Filename preference. This will allow you to easily find data files downloaded at a specific date and time.
    An image of the Amazon AppFlow configuration, reflecting the completion of the prior steps.
  22. For now, select Run on Demand for the Flow trigger to run the full load flow manually. You will schedule downloads in a later step when implementing CDC.
  23. Choose Next.
    An image of the Amazon AppFlow Flow Trigger configuration, reflecting the completion of the prior steps.
  24. On the Map data fields page, select Manually map fields.
  25. For Source to destination field mapping, choose the drop-down box under Source field name and select Map all fields directly. This will bring down all fields as they are received, because we will instead implement data preparation in later steps.
    An image of the Amazon AppFlow configuration, reflecting the completion of steps 24 & 25.
  26. Under Partition and aggregation settings, you can set up the partitions in a way that works for your use case. For this example, we use a daily partition, so select Date and time and choose Daily.
  27. For Aggregation settings, leave it as the default of Don’t aggregate.
  28. Choose Next.
    An image of the Amazon AppFlow configuration, reflecting the completion of steps 26-28.
  29. On the Add filters page, you can create filters to only download specific data. For this example, you download all the data, so choose Next.
  30. Review and choose Create flow.
  31. When the flow is created, choose Run flow to start the initial data seeding. After some time, you should receive a banner indicating the run finished successfully.
    An image of the Amazon AppFlow configuration, reflecting the completion of step 31.

Review seed data

At this stage in the process, you now have data in your S3 environment. When new data files are created in the S3 bucket, it will automatically run an AWS Glue crawler to catalog the new data. You can see if it’s complete by reviewing the Step Functions state machine for a Succeeded run status. There is a link to the state machine on the CloudFormation stack’s Resources tab, which will redirect you to the Step Functions state machine.

A image showing the CloudFormation resources tab of the stack, with a link to the AWS Step Functions workflow.

When the state machine is complete, it’s time to review the raw Jira data with Athena. The database is as you specified in the CloudFormation stack (jiralake by default), and the table name is jira_raw. If you kept the default AWS Glue database name of jiralake, the Athena SQL is as follows:

SELECT * FROM "jiralake"."jira_raw" limit 10;

If you explore the data, you’ll notice that most of the data you would want to work with is actually packed into a column called fields. This means the data is not available as columns in your Athena queries, making it harder to select, filter, and sort individual fields within an Athena SQL query. This will be addressed in the next steps.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_raw" limit 10;

Set up CDC and unpack the fields columns

To add the ongoing CDC and reformat the data for analytics, we introduce a DataBrew job to transform the data and filter to the most recent version of each record as changes come in. You can do this by updating the CloudFormation stack with a flag that includes the CDC and data transformation steps.

  1. On the AWS CloudFormation console, return to the stack.
  2. Choose Update.
  3. Select Use current template and choose Next.
    An image showing Amazon CloudFormation, with steps 1-3 complete.
  4. For SetupOrCDC, choose CDC, then choose Next. This will enable both the CDC steps and the data transformation steps for the Jira data.
    An image showing Amazon CloudFormation, with step 4 complete.
  5. Continue choosing Next until you reach the Review section.
  6. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
    An image showing Amazon CloudFormation, with step 5-6 complete.
  7. Return to the Amazon AppFlow console and open your flow.
  8. On the Actions menu, choose Edit flow. We will now edit the flow trigger to run an incremental load on a periodic basis.
  9. Select Run flow on schedule.
  10. Configure the desired repeats, as well as start time and date. For this example, we choose Daily for Repeats and enter 1 for the number of days you’ll have the flow trigger. For Starting at, enter 01:00.
  11. Select Incremental transfer for Transfer mode.
  12. Choose Updated on the drop-down menu so that changes will be captured based on when the records were updated.
  13. Choose Save. With these settings in our example, the run will happen nightly at 1:00 AM.
    An image showing the Flow Trigger, with incremental transfer selected.

Review the analytics data

When the next incremental load occurs that results in new data, the Step Functions workflow will start the DataBrew job and populate a new staged analytical data table named jira_data in your Data Catalog database. If you don’t want to wait, you can trigger the Step Functions workflow manually.

The DataBrew job performs data transformation and filtering tasks. The job unpacks the key-values from the Jira JSON data and the raw Jira data, resulting in a tabular data schema that facilitates use with BI and AI/ML tools. As Jira items are changed, the changed item’s data is resent, resulting in multiple versions of an item in the raw data feed. The DataBrew job filters the raw data feed so that the resulting data table only contains the most recent version of each item. You could enhance this DataBrew job to further customize the data for your needs, such as renaming the generic Jira custom field names to reflect their business meaning.

When the Step Functions workflow is complete, we can query the data in Athena again using the following query:

SELECT * FROM "jiralake"."jira_data" limit 10;

You can see that in our transformed jira_data table, the nested JSON fields are broken out into their own columns for each field. You will also notice that we’ve filtered out obsolete records that have been superseded by more recent record updates in later data loads so the data is fresh. If you want to rename custom fields, remove columns, or restructure what comes out of the nested JSON, you can modify the DataBrew recipe to accomplish this. At this point, the data is ready to be used by your analytics tools, such as Amazon QuickSight.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_data" limit 10;

Clean up

If you would like to discontinue this solution, you can remove it with the following steps:

  1. On the Amazon AppFlow console, deactivate the flow for Jira, and optionally delete it.
  2. On the Amazon S3 console, select the S3 bucket for the stack, and empty the bucket to delete the existing data.
  3. On the AWS CloudFormation console, delete the CloudFormation stack that you deployed.

Conclusion

In this post, we created a serverless incremental data load process for Jira that will synchronize data while handling custom fields using Amazon AppFlow, AWS Glue, and Step Functions. The approach uses Amazon AppFlow to incrementally load the data into Amazon S3. We then use AWS Glue and Step Functions to manage the extraction of the Jira custom fields and load them in a format to be queried by analytics services such as Athena, QuickSight, or Redshift Spectrum, or AI/ML services like Amazon SageMaker.

To learn more about AWS Glue and DataBrew, refer to Getting started with AWS Glue DataBrew. With DataBrew, you can take the sample data transformation in this project and customize the output to meet your specific needs. This could include renaming columns, creating additional fields, and more.

To learn more about Amazon AppFlow, refer to Getting started with Amazon AppFlow. Note that Amazon AppFlow supports integrations with many SaaS applications in addition to the Jira Cloud.

To learn more about orchestrating flows with Step Functions, see Create a Serverless Workflow with AWS Step Functions and AWS Lambda. The workflow could be enhanced to load the data into a data warehouse, such as Amazon Redshift, or trigger a refresh of a QuickSight dataset for analytics and reporting.

In future posts, we will cover how to unnest parent-child relationships within the Jira data using Athena and how to visualize the data using QuickSight.


About the Authors

Tom Romano is a Sr. Solutions Architect for AWS World Wide Public Sector from Tampa, FL, and assists GovTech and EdTech customers as they create new solutions that are cloud native, event driven, and serverless. He is an enthusiastic Python programmer for both application development and data analytics, and is an Analytics Specialist. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.

Shane Thompson is a Sr. Solutions Architect based out of San Luis Obispo, California, working with AWS Startups. He works with customers who use AI/ML in their business model and is passionate about democratizing AI/ML so that all customers can benefit from it. In his free time, Shane loves to spend time with his family and travel around the world.

Perform continuous vulnerability scanning of AWS Lambda functions with Amazon Inspector

Post Syndicated from Manjunath Arakere original https://aws.amazon.com/blogs/security/perform-continuous-vulnerability-scanning-of-aws-lambda-functions-with-amazon-inspector/

This blog post demonstrates how you can activate Amazon Inspector within one or more AWS accounts and be notified when a vulnerability is detected in an AWS Lambda function.

Amazon Inspector is an automated vulnerability management service that continually scans workloads for software vulnerabilities and unintended network exposure. Amazon Inspector scans mixed workloads like Amazon Elastic Compute Cloud (Amazon EC2) instances and container images located in Amazon Elastic Container Registry (Amazon ECR). At re:Invent 2022, we announced Amazon Inspector support for Lambda functions and Lambda layers to provide a consolidated solution for compute types.

Only scanning your functions for vulnerabilities before deployment might not be enough since vulnerabilities can appear at any time, like the widespread Apache Log4j vulnerability. So it’s essential that workloads are continuously monitored and rescanned in near real time as new vulnerabilities are published or workloads are changed.

Amazon Inspector scans are intelligently initiated based on the updates to Lambda functions or when new Common Vulnerabilities and Exposures (CVEs) are published that are relevant to your function. No agents are needed for Amazon Inspector to work, which means you don’t need to install a library or agent in your Lambda functions or layers. When Amazon Inspector discovers a software vulnerability or network configuration issue, it creates a finding which describes the vulnerability, identifies the affected resource, rates the severity of the vulnerability, and provides remediation guidance.

In addition, Amazon Inspector integrates with several AWS services, such as Amazon EventBridge and AWS Security Hub. You can use EventBridge to build automation workflows like getting notified for a specific vulnerability finding or performing an automatic remediation with the help of Lambda or AWS Systems Manager.

In this blog post, you will learn how to do the following:

  1. Activate Amazon Inspector in a single AWS account and AWS Region.
  2. See how Amazon Inspector automated discovery and continuous vulnerability scanning works by deploying a new Lambda function with a vulnerable package dependency.
  3. Receive a near real-time notification when a vulnerability with a specific severity is detected in a Lambda function with the help of EventBridge and Amazon Simple Notification Service (Amazon SNS).
  4. Remediate the vulnerability by using the recommendation provided in the Amazon Inspector dashboard.
  5. Activate Amazon Inspector in multiple accounts or Regions through AWS Organizations.

Solution architecture

Figure 1 shows the AWS services used in the solution and how they are integrated.

Figure 1: Solution architecture overview

Figure 1: Solution architecture overview

The workflow for the solution is as follows:

  1. Deploy a new Lambda function by using the AWS Serverless Application Model (AWS SAM).
  2. Amazon Inspector scans when a new vulnerability is published or when an update to an existing Lambda function or a new Lambda function is deployed. Vulnerabilities are identified in the deployed Lambda function.
  3. Amazon EventBridge receives the events from Amazon Inspector and checks against the rules for specific events or filter conditions.
  4. In this case, an EventBridge rule exists for the Amazon Inspector findings, and the target is defined as an SNS topic to send an email to the system operations team.
  5. The EventBridge rule invokes the target SNS topic with the event data, and an email is sent to the confirmed subscribers in the SNS topic.
  6. The system operations team receives an email with detailed information on the vulnerability, the fixed package versions, the Amazon Inspector score to prioritize, and the impacted Lambda functions. By using the remediation information from Amazon Inspector, the team can now prioritize actions and remediate.

Prerequisites

To follow along with this demo, we recommend that you have the following in place:

  • An AWS account.
  • A command line interface: AWS CloudShell or AWS CLI. In this post, we recommend the use of CloudShell because it already has Python and AWS SAM. However, you can also use your CLI with AWS CLI, SAM, and Python.
  • An AWS Region where Amazon Inspector Lambda code scanning is available.
  • An IAM role in that account with administrator privileges.

The solution in this post includes the following AWS services: Amazon Inspector, AWS Lambda, Amazon EventBridge, AWS Identity and Access Management (IAM), Amazon SNS, AWS CloudShell and AWS Organizations for activating Amazon Inspector at scale (multi-accounts).

Step 1: Activate Amazon Inspector in a single account in the Region

The first step is to activate Amazon Inspector in your account in the Region you are using.

To activate Amazon Inspector

  1. Sign in to the AWS Management Console.
  2. Open AWS CloudShell. CloudShell inherits the credentials and permissions of the IAM principal who is signed in to the AWS Management Console. CloudShell comes with the CLIs and runtimes that are needed for this demo (AWS CLI, AWS SAM, and Python).
  3. Use the following command in CloudShell to get the status of the Amazon Inspector activation.
    aws inspector2 batch-get-account-status

  4. Use the following command to activate Inspector in the default Region for resource type LAMBDA. Other allowed values for resource types are EC2, ECR and LAMDA_CODE.
    aws inspector2 enable --resource-types '["LAMBDA"]'

  5. Use the following command to verify the status of the Amazon Inspector activation.
    aws inspector2 batch-get-account-status

You should see a response that shows that Amazon Inspector is enabled for Lambda resources, as shown in Figure 2.

Figure 2: Amazon Inspector status after you enable Lambda scanning

Figure 2: Amazon Inspector status after you enable Lambda scanning

Step 2: Create an SNS topic and subscription for notification

Next, create the SNS topic and the subscription so that you will be notified of each new Amazon Inspector finding.

To create the SNS topic and subscription

  1. Use the following command in CloudShell to create the SNS topic and its subscription and replace <REGION_NAME>, <AWS_ACCOUNTID> and <[email protected]> by the relevant values.
    aws sns create-topic --name amazon-inspector-findings-notifier; 
    
    aws sns subscribe \
    --topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier \
    --protocol email --notification-endpoint <[email protected]>

  2. Check the email inbox you entered for <[email protected]>, and in the email from Amazon SNS, choose Confirm subscription.
  3. In the CloudShell console, use the following command to list the subscriptions, to verify the topic and email subscription.
    aws sns list-subscriptions

    You should see a response that shows subscription details like the email address and ARN, as shown in Figure 3.

    Figure 3: Subscribed email address and SNS topic

    Figure 3: Subscribed email address and SNS topic

  4. Use the following command to send a test message to your subscribed email and verify that you receive the message by replacing <REGION_NAME> and <AWS_ACCOUNTID>.
    aws sns publish \
        --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
        --message "Hello from Amazon Inspector2"

Step 3: Set up Amazon EventBridge with a custom rule and the SNS topic as target

Create an EventBridge rule that will invoke your previously created SNS topic whenever Amazon Inspector finds a new vulnerability with a critical severity.

To set up the EventBridge custom rule

  1. In the CloudShell console, use the following command to create an EventBridge rule named amazon-inspector-findings with filters InspectorScore greater than 8 and severity state set to CRITICAL.
    aws events put-rule \
        --name "amazon-inspector-findings" \
        --event-pattern "{\"source\": [\"aws.inspector2\"],\"detail-type\": [\"Inspector2 Finding\"],\"detail\": {\"inspectorScore\": [ { \"numeric\": [ \">\", 8] } ],\"severity\": [\"CRITICAL\"]}}"

    Refer to the topic Amazon EventBridge event schema for Amazon Inspector events to customize the event pattern for your application needs.

  2. To verify the rule creation, go to the EventBridge console and in the left navigation bar, choose Rules.
  3. Choose the rule with the name amazon-inspector-findings. You should see the event pattern as shown in Figure 4.
    Figure 4: Event pattern for the EventBridge rule to filter on CRITICAL vulnerabilities.

    Figure 4: Event pattern for the EventBridge rule to filter on CRITICAL vulnerabilities.

  4. Add the SNS topic you previously created as the target to the EventBridge rule. Replace <REGION_NAME>, <AWS_ACCOUNTID>, and <RANDOM-UNIQUE-IDENTIFIER-VALUE> with the relevant values. For RANDOM-UNIQUE-IDENTIFIER-VALUE, create a memorable and unique string.
    aws events put-targets \
        --rule amazon-inspector-findings \
        --targets "Id"="<RANDOM-UNIQUE-IDENTIFIER-VALUE>","Arn"="arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier"

    Important: Save the target ID. You will need this in order to delete the target in the last step.

  5. Provide permission to enable Amazon EventBridge to publish to SNS topic amazon-inspector-findings-notifier
    aws sns set-topic-attributes --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
    --attribute-name Policy \
    --attribute-value "{\"Version\":\"2012-10-17\",\"Id\":\"__default_policy_ID\",\"Statement\":[{\"Sid\":\"PublishEventsToMyTopic\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"events.amazonaws.com\"},\"Action\":\"sns:Publish\",\"Resource\":\"arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier\"}]}"

Step 4: Deploy the Lambda function to the AWS account by using AWS SAM

In this step, you will use Serverless Application Manager (SAM) quick state templates to build and deploy a Lambda function with a vulnerable library, in order to generate findings. Learn more about AWS SAM.

To deploy the Lambda function with a vulnerable library

  1. In the CloudShell console, use a prebuilt “hello-world” AWS SAM template to deploy the Lambda function.
    sam init --runtime python3.7 --dependency-manager pip --app-template hello-world --name sam-app

  2. Use the following command to add the vulnerable package python-jwt==3.3.3 to the Lambda function.
    cd sam-app;
    echo -e 'requests\npython-jwt==3.3.3' > hello_world/requirements.txt

  3. Use the following command to build the application.
    sam build

  4. Use the following command to deploy the application with the guided option.
    sam deploy --guided

    This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the:

    1. Stack name you want
    2. Set the default options, except for the
      1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
      2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.

Step 5: View Amazon Inspector findings

Amazon Inspector will automatically generate findings when scanning the Lambda function previously deployed. To view those findings, follow the steps below.

To view Amazon Inspector findings for the vulnerability

  1. Navigate to the Amazon Inspector console.
  2. In the left navigation menu, choose All findings to see all of the Active findings, as shown in Figure 5.

    Due to the custom event pattern rule in Amazon EventBridge, even though there are multiple findings for the vulnerable package python-jwt==3.3.3, you will be notified only for the finding that has InspectorScore greater than 8 and severity CRITICAL.

  3. Choose the title of each finding to see detailed information about the vulnerability.
    Figure 5: Example of findings from the Amazon Inspector console

    Figure 5: Example of findings from the Amazon Inspector console

Step 6: Remediate the vulnerability by applying the fixed package version

Now you can remediate the vulnerability by updating the package version as suggested by Amazon Inspector.

To remediate the vulnerability

  1. In the Amazon Inspector console, in the left navigation menu, choose All Findings.
  2. Choose the title of the vulnerability to see the finding details and the remediation recommendations.
    Figure 6: Amazon Inspector finding for python-jwt, with the associated remediation

    Figure 6: Amazon Inspector finding for python-jwt, with the associated remediation

  3. To remediate, use the following command to update the package version to the fixed version as suggested by Amazon Inspector.
    cd /home/cloudshell-user/sam-app;
    echo -e "requests\npython-jwt==3.3.4" > hello_world/requirements.txt

  4. Use the following command to build the application.
    sam build

  5. Use the following command to deploy the application with the guided option.
    sam deploy --guided

    This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the

    1. Stack name you want
    2. Set the default options, except for the
      1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
      2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.
  6. Amazon Inspector automatically rescans the function after its deployment and reevaluates the findings. At this point, you can navigate back to the Amazon Inspector console, and in the left navigation menu, choose All findings. In the Findings area, you can see that the vulnerabilities are moved from Active to Closed status.

    Due to the custom event pattern rule in Amazon EventBridge, you will be notified by email with finding status as CLOSED.

    Figure 7: Inspector rescan results, showing no open findings after remediation

    Figure 7: Inspector rescan results, showing no open findings after remediation

(Optional) Step 7: Activate Amazon Inspector in multiple accounts and Regions

To benefit from Amazon Inspector scanning capabilities across the accounts that you have in AWS Organizations and in your selected Regions, use the following steps:

To activate Amazon Inspector in multiple accounts and Regions

  1. In the CloudShell console, use the following command to clone the code from the aws-samples inspector2-enablement-with-cli GitHub repo.
    cd /home/cloudshell-user;
    git clone https://github.com/aws-samples/inspector2-enablement-with-cli.git;
    cd inspector2-enablement-with-cli

  2. Follow the instructions from the README.md file.
  3. Configure the file param_inspector2.json with the relevant values, as follows:
    • inspector2_da: The delegated administrator account ID for Amazon Inspector to manage member accounts.
    • scanning_type: The resource types (EC2, ECR, LAMBDA) to be enabled by Amazon Inspector.
    • auto_enable: The resource types to be enabled on every account that is newly attached to the delegated administrator.
    • regions: Because Amazon Inspector is a regional service, provide the list of AWS Regions to enable.
  4. Select the AWS account that would be used as the delegated administrator account (<DA_ACCOUNT_ID>).
  5. Delegate an account as the admin for Amazon Inspector by using the following command.
    ./inspector2_enablement_with_awscli.sh -a delegate_admin -da <DA_ACCOUNT_ID>

  6. Activate the delegated admin by using the following command:
    ./inspector2_enablement_with_awscli.sh -a activate -t <DA_ACCOUNT_ID> -s all

  7. Associate the member accounts by using the following command:
    ./inspector2_enablement_with_awscli.sh -a associate -t members

  8. Wait five minutes.
  9. Enable the resource types (EC2, ECR, LAMBDA) on your member accounts by using the following command:
    ./inspector2_enablement_with_awscli.sh -a activate -t members

  10. Enable Amazon Inspector on the new member accounts that are associated with the organization by using the following command:
    ./inspector2_enablement_with_awscli.sh -auto_enable

  11. Check the Amazon Inspector status in your accounts and in multiple selected Regions by using the following command:
    ./inspector2_enablement_with_awscli.sh -a get_status

There are other options you can use to enable Amazon Inspector in multiple accounts, like AWS Control Tower and Terraform. For the reference architecture for Control Tower, see the AWS Security Reference Architecture Examples on GitHub. For more information on the Terraform option, see the Terraform aws_inspector2_enabler resource page.

Step 8: Delete the resources created in the previous steps

AWS offers a 15-day free trial for Amazon Inspector so that you can evaluate the service and estimate its cost.

To avoid potential charges, delete the AWS resources that you created in the previous steps of this solution (Lambda function, EventBridge target, EventBridge rule, and SNS topic), and deactivate Amazon Inspector.

To delete resources

  1. In the CloudShell console, enter the sam-app folder.
    cd /home/cloudshell-user/sam-app

  2. Delete the Lambda function and confirm by typing “y” when prompted for confirmation.
    sam delete

  3. Remove the SNS target from the Amazon EventBridge rule.
    aws events remove-targets --rule "amazon-inspector-findings" --ids <RANDOM-UNIQUE-IDENTIFIER-VALUE>

    Note: If you don’t remember the target ID, navigate to the Amazon EventBridge console, and in the left navigation menu, choose Rules. Select the rule that you want to delete. Choose CloudFormation, and copy the ID.

  4. Delete the EventBridge rule.
    aws events delete-rule --name amazon-inspector-findings

  5. Delete the SNS topic.
    aws sns delete-topic --topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier

  6. Disable Amazon Inspector.
    aws inspector2 disable --resource-types '["LAMBDA"]'

    Follow the new few steps to roll back changes only if you have performed the activities listed in Step 7: Activate Amazon Inspector in multiple accounts and Regions.

  7. In the CloudShell console, enter the folder inspector2-enablement-with-cli.
    cd /home/cloudshell-user/inspector2-enablement-with-cli

  8. Deactivate the resource types (EC2, ECR, LAMBDA) on your member accounts.
    ./inspector2_enablement_with_awscli.sh -a deactivate -t members -s all

  9. Disassociate the member accounts.
    ./inspector2_enablement_with_awscli.sh -a disassociate -t members

  10. Deactivate the delegated admin account.
    ./inspector2_enablement_with_awscli.sh -a deactivate -t <DA_ACCOUNT_ID> -s all

  11. Remove the delegated account as the admin for Amazon Inspector.
    ./inspector2_enablement_with_awscli.sh -a remove_admin -da <DA_ACCOUNT_ID>

Conclusion

In this blog post, we discussed how you can use Amazon Inspector to continuously scan your Lambda functions, and how to configure an Amazon EventBridge rule and SNS to send out notification of Lambda function vulnerabilities in near real time. You can then perform remediation activities by using AWS Lambda or AWS Systems Manager. We also showed how to enable Amazon Inspector at scale, activating in both single and multiple accounts, in default and multiple Regions.

As of the writing this post, a new feature to perform code scans for Lambda functions is available. Amazon Inspector can now also scan the custom application code within a Lambda function for code security vulnerabilities such as injection flaws, data leaks, weak cryptography, or missing encryption, based on AWS security best practices. You can use this additional scanning functionality to further protect your workloads.

If you have feedback about this blog post, submit comments in the Comments section below. If you have question about this blog post, start a new thread on the Amazon Inspector forum or contact AWS Support.

 
Want more AWS Security news? Follow us on Twitter.

Manjunath Arakere

Manjunath Arakere

Manjunath is a Senior Solutions Architect in the Worldwide Public Sector team at AWS. He works with Public Sector partners to design and scale well-architected solutions, and he supports their cloud migrations and application modernization initiatives. Manjunath specializes in migration, modernization and serverless technology.

Stéphanie Mbappe

Stéphanie Mbappe

Stéphanie is a Security Consultant with Amazon Web Services. She delights in assisting her customers at every step of their security journey. Stéphanie enjoys learning, designing new solutions, and sharing her knowledge with others.

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

Post Syndicated from Mitesh Patel original https://aws.amazon.com/blogs/big-data/migrate-your-existing-sql-based-etl-workload-to-an-aws-serverless-etl-infrastructure-using-aws-glue/

Data has become an integral part of most companies, and the complexity of data processing is increasing rapidly with the exponential growth in the amount and variety of data. Data engineering teams are faced with the following challenges:

  • Manipulating data to make it consumable by business users
  • Building and improving extract, transform, and load (ETL) pipelines
  • Scaling their ETL infrastructure

Many customers migrating data to the cloud are looking for ways to modernize by using native AWS services to further scale and efficiently handle ETL tasks. In the early stages of their cloud journey, customers may need guidance on modernizing their ETL workload with minimal effort and time. Customers often use many SQL scripts to select and transform the data in relational databases hosted either in an on-premises environment or on AWS and use custom workflows to manage their ETL.

AWS Glue is a serverless data integration and ETL service with the ability to scale on demand. In this post, we show how you can migrate your existing SQL-based ETL workload to AWS Glue using Spark SQL, which minimizes the refactoring effort.

Solution overview

The following diagram describes the high-level architecture for our solution. This solution decouples the ETL and analytics workloads from our transactional data source Amazon Aurora, and uses Amazon Redshift as the data warehouse solution to build a data mart. In this solution, we employ AWS Database Migration Service (AWS DMS) for both full load and continuous replication of changes from Aurora. AWS DMS enables us to capture deltas, including deletes from the source database, through the use of Change Data Capture (CDC) configuration. CDC in DMS enables us to capture deltas without writing code and without missing any changes, which is critical for the integrity of the data. Please refer CDC support in DMS to extend the solutions for ongoing CDC.

The workflow includes the following steps:

  1. AWS Database Migration Service (AWS DMS) connects to the Aurora data source.
  2. AWS DMS replicates data from Aurora and migrates to the target destination Amazon Simple Storage Service (Amazon S3) bucket.
  3. AWS Glue crawlers automatically infer schema information of the S3 data and integrate into the AWS Glue Data Catalog.
  4. AWS Glue jobs run ETL code to transform and load the data to Amazon Redshift.

For this post, we use the TPCH dataset for sample transactional data. The components of TPCH consist of eight tables. The relationships between columns in these tables are illustrated in the following diagram.

We use Amazon Redshift as the data warehouse to implement the data mart solution. The data mart fact and dimension tables are created in the Amazon Redshift database. The following diagram illustrates the relationships between the fact (ORDER) and dimension tables (DATE, PARTS, and REGION).

Set up the environment

To get started, we set up the environment using AWS CloudFormation. Complete the following steps:

  1. Sign in to the AWS Management Console with your AWS Identity and Access Management (IAM) user name and password.
  2. Choose Launch Stack and open the page on a new tab:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. In the Parameters section, enter the required parameters.
  6. Choose Next.

  1. On the Configure stack options page, leave all values as default and choose Next.
  2. On the Review stack page, select the check boxes to acknowledge the creation of IAM resources.
  3. Choose Submit.

Wait for the stack creation to complete. You can examine various events from the stack creation process on the Events tab. When the stack creation is complete, you will see the status CREATE_COMPLETE. The stack takes approximately 25–30 minutes to complete.

This template configures the following resources:

  • The Aurora MySQL instance sales-db.
  • The AWS DMS task dmsreplicationtask-* for full load of data and replicating changes from Aurora (source) to Amazon S3 (destination).
  • AWS Glue crawlers s3-crawler and redshift_crawler.
  • The AWS Glue database salesdb.
  • AWS Glue jobs insert_region_dim_tbl, insert_parts_dim_tbl, and insert_date_dim_tbl. We use these jobs for the use cases covered in this post. We create the insert_orders_fact_tbl AWS Glue job manually using AWS Glue Visual Studio.
  • The Redshift cluster blog_cluster with database sales and fact and dimension tables.
  • An S3 bucket to store the output of the AWS Glue job runs.
  • IAM roles and policies with appropriate permissions.

Replicate data from Aurora to Amazon S3

Now let’s look at the steps to replicate data from Aurora to Amazon S3 using AWS DMS:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task dmsreplicationtask-* and on the Action menu, choose Restart/Resume.

This will start the replication task to replicate the data from Aurora to the S3 bucket. Wait for the task status to change to Full Load Complete. The data from the Aurora tables is now copied to the S3 bucket under a new folder, sales.

Create AWS Glue Data Catalog tables

Now let’s create AWS Glue Data Catalog tables for the S3 data and Amazon Redshift tables:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Connections.
  2. Select RedshiftConnection and on the Actions menu, choose Edit.
  3. Choose Save changes.
  4. Select the connection again and on the Actions menu, choose Test connection.
  5. For IAM role¸ choose GlueBlogRole.
  6. Choose Confirm.

Testing the connection can take approximately 1 minute. You will see the message “Successfully connected to the data store with connection blog-redshift-connection.” If you have trouble connecting successfully, refer to Troubleshooting connection issues in AWS Glue.

  1. Under Data Catalog in the navigation pane, choose Crawlers.
  2. Select s3_crawler and choose Run.

This will generate eight tables in the AWS Glue Data Catalog. To view the tables created, in the navigation pane, choose Databases under Data Catalog, then choose salesdb.

  1. Repeat the steps to run redshift_crawler and generate four additional tables.

If the crawler fails, refer to Error: Running crawler failed.

Create SQL-based AWS Glue jobs

Now let’s look at how the SQL statements are used to create ETL jobs using AWS Glue. AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs on virtual resources that it provisions and manages in its own service account. AWS Glue Studio is a graphical interface that makes it simple to create, run, and monitor ETL jobs in AWS Glue. You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.

Let’s go through the steps of creating an AWS Glue job for loading the orders fact table using AWS Glue Studio.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose Create job.
  3. Select Visual with a blank canvas, then choose Create.

  1. Navigate to the Job details tab.
  2. For Name, enter insert_orders_fact_tbl.
  3. For IAM Role, choose GlueBlogRole.
  4. For Job bookmark, choose Enable.
  5. Leave all other parameters as default and choose Save.

  1. Navigate to the Visual tab.
  2. Choose the plus sign.
  3. Under Add nodes, enter Glue in the search bar and choose AWS Glue Data Catalog (Source) to add the Data Catalog as the source.

  1. In the right pane, on the Data source properties – Data Catalog tab, choose salesdb for Database and customer for Table.

  1. On the Node properties tab, for Name, enter Customers.

  1. Repeat these steps for the Orders and LineItem tables.

This concludes creating data sources on the AWS Glue job canvas. Next, we add transformations by combining data from these different tables.

Transform the data

Complete the following steps to add data transformations:

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose SQL Query.
  3. On the Transform tab, for Node parents, select all the three data sources.
  4. On the Transform tab, under SQL query, enter the following query:
SELECT orders.o_orderkey        AS ORDERKEY,
orders.o_orderdate       AS ORDERDATE,
lineitem.l_linenumber    AS LINENUMBER,
lineitem.l_partkey       AS PARTKEY,
lineitem.l_receiptdate   AS RECEIPTDATE,
lineitem.l_quantity      AS QUANTITY,
lineitem.l_extendedprice AS EXTENDEDPRICE,
orders.o_custkey         AS CUSTKEY,
customer.c_nationkey     AS NATIONKEY,
CURRENT_TIMESTAMP        AS UPDATEDATE
FROM   orders orders,
lineitem lineitem,
customer customer
WHERE  orders.o_orderkey = lineitem.l_orderkey
AND orders.o_custkey = customer.c_custkey
  1. Update the SQL aliases values as shown in the following screenshot.

  1. On the Data preview tab, choose Start data preview session.
  2. When prompted, choose GlueBlogRole for IAM role and choose Confirm.

The data preview process will take a minute to complete.

  1. On the Output schema tab, choose Use data preview schema.

You will see the output schema similar to the following screenshot.

Now that we have previewed the data, we change a few data types.

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose Change Schema.
  3. Select the node.
  4. On the Transform tab, update the Data type values as shown in the following screenshot.

Now let’s add the target node.

  1. Choose the Change Schema node and choose the plus sign.
  2. In the search bar, enter target.
  3. Choose Amazon Redshift as the target.

  1. Choose the Amazon Redshift node, and on the Data target properties – Amazon Redshift tab, for Redshift access type, select Direct data connection.
  2. Choose RedshiftConnection for Redshift Connection, public for Schema, and order_table for Table.
  3. Select Merge data into target table under Handling of data and target table.
  4. Choose orderkey for Matching keys.

  1. Choose Save.

AWS Glue Studio automatically generates the Spark code for you. You can view it on the Script tab. If you would like to do any out-of-the-box transformations, you can modify the Spark code. The AWS Glue job uses the Apache SparkSQL query for SQL query transformation. To find the available SparkSQL transformations, refer to the Spark SQL documentation.

  1. Choose Run to run the job.

As part of the CloudFormation stack, three other jobs are created to load the dimension tables.

  1. Navigate back to the Jobs page on the AWS Glue console, select the job insert_parts_dim_tbl, and choose Run.

This job uses the following SQL to populate the parts dimension table:

SELECT part.p_partkey,
part.p_type,
part.p_brand
FROM   part part
  1. Select the job insert_region_dim_tbl and choose Run.

This job uses the following SQL to populate the region dimension table:

SELECT nation.n_nationkey,
nation.n_name,
region.r_name
FROM   nation,
region
WHERE  nation.n_regionkey = region.r_regionkey
  1. Select the job insert_date_dim_tbl and choose Run.

This job uses the following SQL to populate the date dimension table:

SELECT DISTINCT( l_receiptdate )        AS DATEKEY,
Dayofweek(l_receiptdate) AS DAYOFWEEK,
Month(l_receiptdate)     AS MONTH,
Year(l_receiptdate)      AS YEAR,
Day(l_receiptdate)       AS DATE
FROM   lineitem lineitem

You can view the status of the running jobs by navigating to the Job run monitoring section on the Jobs page. Wait for all the jobs to complete. These jobs will load the data into the facts and dimension tables in Amazon Redshift.

To help optimize the resources and cost, you can use the AWS Glue Auto Scaling feature.

Verify the Amazon Redshift data load

To verify the data load, complete the following steps:

  1. On the Amazon Redshift console, select the cluster blog-cluster and on the Query Data menu, choose Query in query editor 2.
  2. For Authentication, select Temporary credentials.
  3. For Database, enter sales.
  4. For User name, enter admin.
  5. Choose Save.

  1. Run the following commands in the query editor to verify that the data is loaded into the Amazon Redshift tables:
SELECT *
FROM   sales.PUBLIC.order_table;

SELECT *
FROM   sales.PUBLIC.date_table;

SELECT *
FROM   sales.PUBLIC.parts_table;

SELECT *
FROM   sales.PUBLIC.region_table;

The following screenshot shows the results from one of the SELECT queries.

Now for the CDC, update the quantity of a line item for order number 1 in Aurora database using the below query. (To connect to your Aurora cluster use Cloud9 or any SQL client tools like MySQL command-line client).

UPDATE lineitem SET l_quantity = 100 WHERE l_orderkey = 1 AND l_linenumber = 4;

DMS will replicate the changes into the S3 bucket as shown in the below screenshot.

Re-running the Glue job insert_orders_fact_tbl will update the changes to the ORDER fact table as shown in the below screenshot

Clean up

To avoid incurring future charges, delete the resources created for the solution:

  1. On the Amazon S3 console, select the S3 bucket created as part of the CloudFormation stack, then choose Empty.
  2. On the AWS CloudFormation console, select the stack that you created initially and choose Delete to delete all the resources created by the stack.

Conclusion

In this post, we showed how you can migrate existing SQL-based ETL to an AWS serverless ETL infrastructure using AWS Glue jobs. We used AWS DMS to migrate data from Aurora to an S3 bucket, then SQL-based AWS Glue jobs to move the data to fact and dimension tables in Amazon Redshift.

This solution demonstrates a one-time data load from Aurora to Amazon Redshift using AWS Glue jobs. You can extend this solution for moving the data on a scheduled basis by orchestrating and scheduling jobs using AWS Glue workflows. To learn more about the capabilities of AWS Glue, refer to AWS Glue.


About the Authors

Mitesh Patel is a Principal Solutions Architect at AWS with specialization in data analytics and machine learning. He is passionate about helping customers building scalable, secure and cost effective cloud native solutions in AWS to drive the business growth. He lives in DC Metro area with his wife and two kids.

Sumitha AP is a Sr. Solutions Architect at AWS. She works with customers and help them attain their business objectives by  designing secure, scalable, reliable, and cost-effective solutions in the AWS Cloud. She has a focus on data and analytics and provides guidance on building analytics solutions on AWS.

Deepti Venuturumilli is a Sr. Solutions Architect in AWS. She works with commercial segment customers and AWS partners to accelerate customers’ business outcomes by providing expertise in AWS services and modernize their workloads. She focuses on data analytics workloads and setting up modern data strategy on AWS.

Deepthi Paruchuri is an AWS Solutions Architect based in NYC. She works closely with customers to build cloud adoption strategy and solve their business needs by designing secure, scalable, and cost-effective solutions in the AWS cloud.

Deploy serverless applications in a multicloud environment using Amazon CodeCatalyst

Post Syndicated from Deepak Kovvuri original https://aws.amazon.com/blogs/devops/deploy-serverless-applications-in-a-multicloud-environment-using-amazon-codecatalyst/

Amazon CodeCatalyst is an integrated service for software development teams adopting continuous integration and deployment practices into their software development process. CodeCatalyst puts the tools you need all in one place. You can plan work, collaborate on code, and build, test, and deploy applications by leveraging CodeCatalyst Workflows.

Introduction

In the first post of the blog series, we showed you how organizations can deploy workloads to instances, and virtual machines (VMs), across hybrid and multicloud environment. The second post of the series covered deploying containerized application in a multicloud environment. Finally, in this post, we explore how organizations can deploy modern, cloud-native, serverless application across multiple cloud platforms. Figure 1 shows the solution which we walk through in the post.

Figure 1 – Architecture diagram

The post walks through how to develop, deploy and test a HTTP RESTful API to Azure Functions using Amazon CodeCatalyst. The solution covers the following steps:

  • Set up CodeCatalyst development environment and develop your application using the Serverless Framework.
  • Build a CodeCatalyst workflow to test and then deploy to Azure Functions using GitHub Actions in Amazon CodeCatalyst.

An Amazon CodeCatalyst workflow is an automated procedure that describes how to build, test, and deploy your code as part of a continuous integration and continuous delivery (CI/CD) system. You can use GitHub Actions alongside native CodeCatalyst actions in a CodeCatalyst workflow.

Pre-requisites

Walkthrough

In this post, we will create a hello world RESTful API using the Serverless Framework. As we progress through the solution, we will focus on building a CodeCatalyst workflow that deploys and tests the functionality of the application. At the end of the post, the workflow will look similar to the one shown in Figure 2.

 CodeCatalyst CI/CD workflow

Figure 2 – CodeCatalyst CI/CD workflow

Environment Setup

Before we start developing the application, we need to setup a CodeCatalyst project and then link a code repository to the project. The code repository can be CodeCatalyst Repo or GitHub. In this scenario, we’ve used GitHub repository. By the time we develop the solution, the repository should look as shown below.

Files in solution's GitHub repository

Figure 3 – Files in GitHub repository

In Amazon CodeCatalyst, there’s an option to create Dev Environments, which can used to work on the code stored in the source repositories of a project. In the post, we create a Dev Environment, and associate it with the source repository created above and work off it. But you may choose not to use a Dev Environment, and can run the following commands, and commit to the repository. The /projects directory of a Dev Environment stores the files that are pulled from the source repository. In the dev environment, install the Serverless Framework using this command:

npm install -g serverless

and then initialize a serverless project in the source repository folder:

├── README.md
├── host.json
├── package.json
├── serverless.yml
└── src
    └── handlers
        ├── goodbye.js
        └── hello.js

We can push the code to the CodeCatalyst project using git. Now, that we have the code in CodeCatalyst, we can turn our focus to building the workflow using the CodeCatalyst console.

CI/CD Setup in CodeCatalyst

Configure access to the Azure Environment

We’ll use the GitHub action for Serverless to create and manage Azure Function. For the action to be able to access the Azure environment, it requires credentials associated with a Service Principal passed to the action as environment variables.

Service Principals in Azure are identified by the CLIENT_ID, CLIENT_SECRET, SUBSCRIPTION_ID, and TENANT_ID properties. Storing these values in plaintext anywhere in your repository should be avoided because anyone with access to the repository which contains the secret can see them. Similarly, these values shouldn’t be used directly in any workflow definitions because they will be visible as files in your repository. With CodeCatalyst, we can protect these values by storing them as secrets within the project, and then reference the secret in the CI\CD workflow.

We can create a secret by choosing Secrets (1) under CI\CD and then selecting ‘Create Secret’ (2) as shown in Figure 4. Now, we can key in the secret name and value of each of the identifiers described above.

Figure 4 – CodeCatalyst Secrets

Building the workflow

To create a new workflow, select CI/CD from navigation on the left and then select Workflows (1). Then, select Create workflow (2), leave the default options, and select Create (3) as shown in Figure 5.

Create CodeCatalyst CI/CD workflow

Figure 5 – Create CI/CD workflow

If the workflow editor opens in YAML mode, select Visual to open the visual designer. Now, we can start adding actions to the workflow.

Configure the Deploy action

We’ll begin by adding a GitHub action for deploying to Azure. Select “+ Actions” to open the actions list and choose GitHub from the dropdown menu. Find the Build action and click “+” to add a new GitHub action to the workflow.

Next, configure the GitHub action from the configurations tab by adding the following snippet to the GitHub Actions YAML property:

- name: Deploy to Azure Functions
  uses: serverless/[email protected]
  with:
    args: -c "serverless plugin install --name serverless-azure-functions && serverless deploy"
    entrypoint: /bin/sh
  env:
    AZURE_SUBSCRIPTION_ID: ${Secrets.SUBSCRIPTION_ID}
    AZURE_TENANT_ID: ${Secrets.TENANT_ID}
    AZURE_CLIENT_ID: ${Secrets.CLIENT_ID}
    AZURE_CLIENT_SECRET: ${Secrets.CLIENT_SECRET}

The above workflow configuration makes use of Serverless GitHub Action that wraps the Serverless Framework to run serverless commands. The action is configured to package and deploy the source code to Azure Functions using the serverless deploy command.

Please note how we were able to pass the secrets to GitHub action by referencing the secret identifiers in the above configuration.

Configure the Test action

Similar to the previous step, we add another GitHub action which will use the serverless framework’s serverless invoke command to test the API deployed on to Azure Functions.

- name: Test Function
  uses: serverless/[email protected]
  with:
    args: |
      -c "serverless plugin install --name serverless-azure-functions && \
          serverless invoke -f hello -d '{\"name\": \"CodeCatalyst\"}' && \
          serverless invoke -f goodbye -d '{\"name\": \"CodeCatalyst\"}'"
    entrypoint: /bin/sh
  env:
    AZURE_SUBSCRIPTION_ID: ${Secrets.SUBSCRIPTION_ID}
    AZURE_TENANT_ID: ${Secrets.TENANT_ID}
    AZURE_CLIENT_ID: ${Secrets.CLIENT_ID}
    AZURE_CLIENT_SECRET: ${Secrets.CLIENT_SECRET}

The workflow is now ready and can be validated by choosing ‘Validate’ and then saved to the repository by choosing ‘Commit’. The workflow should automatically kick-off after commit and the application is automatically deployed to Azure Functions.

The functionality of the API can now be verified from the logs of the test action of the workflow as shown in Figure 6.

Test action in CodeCatalyst CI/CD workfl

Figure 6 – CI/CD workflow Test action

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the Azure Function App (usually prefixed ‘sls’) using the Azure console. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project. There’s no cost associated with the CodeCatalyst project and you can continue using it.

Conclusion

In summary, this post highlighted how Amazon CodeCatalyst can help organizations deploy cloud-native, serverless workload into multi-cloud environment. The post also walked through the solution detailing the process of setting up Amazon CodeCatalyst to deploy a serverless application to Azure Functions by leveraging GitHub Actions. Though we showed an application deployment to Azure Functions, you can follow a similar process and leverage CodeCatalyst to deploy any type of application to almost any cloud platform. Learn more and get started with your Amazon CodeCatalyst journey!

We would love to hear your thoughts, and experiences, on deploying serverless applications to multiple cloud platforms. Reach out to us if you’ve any questions, or provide your feedback in the comments section.

About Authors

Picture of Deepak

Deepak Kovvuri

Deepak Kovvuri is a Senior Solutions Architect at supporting Enterprise Customers at AWS in the US East area. He has over 6 years of experience in helping customers architecting a DevOps strategy for their cloud workloads. Deepak specializes in CI/CD, Systems Administration, Infrastructure as Code and Container Services. He holds an Masters in Computer Engineering from University of Illinois at Chicago.

Picture of Amandeep

Amandeep Bajwa

Amandeep Bajwa is a Senior Solutions Architect at AWS supporting Financial Services enterprises. He helps organizations achieve their business outcomes by identifying the appropriate cloud transformation strategy based on industry trends, and organizational priorities. Some of the areas Amandeep consults on are cloud migration, cloud strategy (including hybrid & multicloud), digital transformation, data & analytics, and technology in general.

Picture of Brian

Brian Beach

Brian Beach has over 20 years of experience as a Developer and Architect. He is currently a Principal Solutions Architect at Amazon Web Services. He holds a Computer Engineering degree from NYU Poly and an MBA from Rutgers Business School. He is the author of “Pro PowerShell for Amazon Web Services” from Apress. He is a regular author and has spoken at numerous events. Brian lives in North Carolina with his wife and three kids.

Picture of Pawan

Pawan Shrivastava

Pawan Shrivastava is a Partner Solution Architect at AWS in the WWPS team. He focusses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. Pawan is passionate about DevOps, automation and CI CD pipelines. He enjoys watching MMA, playing cricket and working out in the Gym.

Deploy container applications in a multicloud environment using Amazon CodeCatalyst

Post Syndicated from Pawan Shrivastava original https://aws.amazon.com/blogs/devops/deploy-container-applications-in-a-multicloud-environment-using-amazon-codecatalyst/

In the previous post of this blog series, we saw how organizations can deploy workloads to virtual machines (VMs) in a hybrid and multicloud environment. This post shows how organizations can address the requirement of deploying containers, and containerized applications to hybrid and multicloud platforms using Amazon CodeCatalyst. CodeCatalyst is an integrated DevOps service which enables development teams to collaborate on code, and build, test, and deploy applications with continuous integration and continuous delivery (CI/CD) tools.

One prominent scenario where multicloud container deployment is useful is when organizations want to leverage AWS’ broadest and deepest set of Artificial Intelligence (AI) and Machine Learning (ML) capabilities by developing and training AI/ML models in AWS using Amazon SageMaker, and deploying the model package to a Kubernetes platform on other cloud platforms, such as Azure Kubernetes Service (AKS) for inference. As shown in this workshop for operationalizing the machine learning pipeline, we can train an AI/ML model, push it to Amazon Elastic Container Registry (ECR) as an image, and later deploy the model as a container application.

Scenario description

The solution described in the post covers the following steps:

  • Setup Amazon CodeCatalyst environment.
  • Create a Dockerfile along with a manifest for the application, and a repository in Amazon ECR.
  • Create an Azure service principal which has permissions to deploy resources to Azure Kubernetes Service (AKS), and store the credentials securely in Amazon CodeCatalyst secret.
  • Create a CodeCatalyst workflow to build, test, and deploy the containerized application to AKS cluster using Github Actions.

The architecture diagram for the scenario is shown in Figure 1.

Solution architecture diagram

Figure 1 – Solution Architecture

Solution Walkthrough

This section shows how to set up the environment, and deploy a HTML application to an AKS cluster.

Setup Amazon ECR and GitHub code repository

Create a new Amazon ECR and a code repository. In this case we’re using GitHub as the repository but you can create a source repository in CodeCatalyst or you can choose to link an existing source repository hosted by another service if that service is supported by an installed extension. Then follow the application and Docker image creation steps outlined in Step 1 in the environment creation process in exposing Multiple Applications on Amazon EKS. Create a file named manifest.yaml as shown, and map the “image” parameter to the URL of the Amazon ECR repository created above.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multicloud-container-deployment-app
  labels:
    app: multicloud-container-deployment-app
spec:
  selector:
    matchLabels:
      app: multicloud-container-deployment-app
  replicas: 2
  template:
    metadata:
      labels:
        app: multicloud-container-deployment-app
    spec:
      nodeSelector:
        "beta.kubernetes.io/os": linux
      containers:
      - name: ecs-web-page-container
        image: <aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>
        imagePullPolicy: Always
        ports:
            - containerPort: 80
        resources:
          limits:
            memory: "100Mi"
            cpu: "200m"
      imagePullSecrets:
          - name: ecrsecret
---
apiVersion: v1
kind: Service
metadata:
  name: multicloud-container-deployment-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: multicloud-container-deployment-app

Push the files to Github code repository. The multicloud-container-app github repository should look similar to Figure 2 below

Files in multicloud container app github repository 

Figure 2 – Files in Github repository

Configure Azure Kubernetes Service (AKS) cluster to pull private images from ECR repository

Pull the docker images from a private ECR repository to your AKS cluster by running the following command. This setup is required during the azure/k8s-deploy Github Actions in the CI/CD workflow. Authenticate Docker to an Amazon ECR registry with get-login-password by using aws ecr get-login-password. Run the following command in a shell where AWS CLI is configured, and is used to connect to the AKS cluster. This creates a secret called ecrsecret, which is used to pull an image from the private ECR repository.

kubectl create secret docker-registry ecrsecret\
 --docker-server=<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>\
 --docker-username=AWS\
 --docker-password= $(aws ecr get-login-password --region us-west-2)

Provide ECR URI in the variable “–docker-server =”.

CodeCatalyst setup

Follow these steps to set up CodeCatalyst environment:

Configure access to the AKS cluster

In this solution, we use three GitHub Actions – azure/login, azure/aks-set-context and azure/k8s-deploy – to login, set the AKS cluster, and deploy the manifest file to the AKS cluster respectively. For the Github Actions to access the Azure environment, they require credentials associated with an Azure Service Principal.

Service Principals in Azure are identified by the CLIENT_ID, CLIENT_SECRET, SUBSCRIPTION_ID, and TENANT_ID properties. Create the Service principal by running the following command in the azure cloud shell:

az ad sp create-for-rbac \
    --name "ghActionHTMLapplication" \
    --scope /subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP> \
    --role Contributor \
    --sdk-auth

The command generates a JSON output (shown in Figure 3), which is stored in CodeCatalyst secret called AZURE_CREDENTIALS. This credential is used by azure/login Github Actions.

JSON output stored in AZURE-CREDENTIALS secret

Figure 3 – JSON output

Configure secrets inside CodeCatalyst Project

Create three secrets CLUSTER_NAME (Name of AKS cluster), RESOURCE_GROUP(Name of Azure resource group) and AZURE_CREDENTIALS(described in the previous step) as described in the working with secret document. The secrets are shown in Figure 4.

Secrets in CodeCatalyst

Figure 4 – CodeCatalyst Secrets

CodeCatalyst CI/CD Workflow

To create a new CodeCatalyst workflow, select CI/CD from the navigation on the left and select Workflows (1). Then, select Create workflow (2), leave the default options, and select Create (3) as shown in Figure 5.

Create CodeCatalyst CI/CD workflow

Figure 5 – Create CodeCatalyst CI/CD workflow

Add “Push to Amazon ECR” Action

Add the Push to Amazon ECR action, and configure the environment where you created the ECR repository as shown in Figure 6. Refer to adding an action to learn how to add CodeCatalyst action.

Create ‘Push to ECR’ CodeCatalyst Action

Figure 6 – Create ‘Push to ECR’ Action

Select the Configuration tab and specify the configurations as shown in Figure7.

Configure ‘Push to ECR’ CodeCatalyst Action

Figure 7 – Configure ‘Push to ECR’ Action

Configure the Deploy action

1. Add a GitHub action for deploying to AKS as shown in Figure 8.

Github action to deploy to AKS

Figure 8 – Github action to deploy to AKS

2. Configure the GitHub action from the configurations tab by adding the following snippet to the GitHub Actions YAML property:

- name: Install Azure CLI
  run: pip install azure-cli
- name: Azure login
  id: login
  uses: azure/[email protected]
  with:
    creds: ${Secrets.AZURE_CREDENTIALS}
- name: Set AKS context
  id: set-context
  uses: azure/aks-set-context@v3
  with:
    resource-group: ${Secrets.RESOURCE_GROUP}
    cluster-name: ${Secrets.CLUSTER_NAME}
- name: Setup kubectl
  id: install-kubectl
  uses: azure/setup-kubectl@v3
- name: Deploy to AKS
  id: deploy-aks
  uses: Azure/k8s-deploy@v4
  with:
    namespace: default
    manifests: manifest.yaml
    pull-images: true

Github action configuration for deploying application to AKS

Figure 9 – Github action configuration

3. The workflow is now ready and can be validated by choosing ‘Validate’ and then saved to the repository by choosing ‘Commit’.
We have implemented an automated CI/CD workflow that builds the container image of the application (refer Figure 10), pushes the image to ECR, and deploys the application to AKS cluster. This CI/CD workflow is triggered as application code is pushed to the repository.

Automated CI/CD workflow

Figure 10 – Automated CI/CD workflow

Test the deployment

When the HTML application runs, Kubernetes exposes the application using a public facing load balancer. To find the external IP of the load balancer, connect to the AKS cluster and run the following command:

kubectl get service multicloud-container-deployment-service

The output of the above command should look like the image in Figure 11.

Output of kubectl get service command

Figure 11 – Output of kubectl get service

Paste the External IP into a browser to see the running HTML application as shown in Figure 12.

HTML application running successfully in AKS

Figure 12 – Application running in AKS

Cleanup

If you have been following along with the workflow described in the post, you should delete the resources you deployed so you do not continue to incur charges. First, delete the Amazon ECR repository using the AWS console. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project. There’s no cost associated with the CodeCatalyst project and you can continue using it. Finally, if you deployed the application on a new AKS cluster, delete the cluster from the Azure console. In case you deployed the application to an existing AKS cluster, run the following commands to delete the application resources.

kubectl delete deployment multicloud-container-deployment-app
kubectl delete services multicloud-container-deployment-service

Conclusion

In summary, this post showed how Amazon CodeCatalyst can help organizations deploy containerized workloads in a hybrid and multicloud environment. It demonstrated in detail how to set up and configure Amazon CodeCatalyst to deploy a containerized application to Azure Kubernetes Service, leveraging a CodeCatalyst workflow, and GitHub Actions. Learn more and get started with your Amazon CodeCatalyst journey!

If you have any questions or feedback, leave them in the comments section.

About Authors

Picture of Pawan

Pawan Shrivastava

Pawan Shrivastava is a Partner Solution Architect at AWS in the WWPS team. He focusses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. Pawan is passionate about DevOps, automation and CI CD pipelines. He enjoys watching MMA, playing cricket and working out in the gym.

Picture of Brent

Brent Van Wynsberge

Brent Van Wynsberge is a Solutions Architect at AWS supporting enterprise customers. He accelerates the cloud adoption journey for organizations by aligning technical objectives to business outcomes and strategic goals, and defining them where needed. Brent is an IoT enthusiast, specifically in the application of IoT in manufacturing, he is also interested in DevOps, data analytics and containers.

Picture of Amandeep

Amandeep Bajwa

Amandeep Bajwa is a Senior Solutions Architect at AWS supporting Financial Services enterprises. He helps organizations achieve their business outcomes by identifying the appropriate cloud transformation strategy based on industry trends, and organizational priorities. Some of the areas Amandeep consults on are cloud migration, cloud strategy (including hybrid & multicloud), digital transformation, data & analytics, and technology in general.

Picture of Brian

Brian Beach

Brian Beach has over 20 years of experience as a Developer and Architect. He is currently a Principal Solutions Architect at Amazon Web Services. He holds a Computer Engineering degree from NYU Poly and an MBA from Rutgers Business School. He is the author of “Pro PowerShell for Amazon Web Services” from Apress. He is a regular author and has spoken at numerous events. Brian lives in North Carolina with his wife and three kids.

Extend your data mesh with Amazon Athena and federated views

Post Syndicated from Saurabh Bhutyani original https://aws.amazon.com/blogs/big-data/extend-your-data-mesh-with-amazon-athena-and-federated-views/

Amazon Athena is a serverless, interactive analytics service built on the Trino, PrestoDB, and Apache Spark open-source frameworks. You can use Athena to run SQL queries on petabytes of data stored on Amazon Simple Storage Service (Amazon S3) in widely used formats such as Parquet and open-table formats like Apache Iceberg, Apache Hudi, and Delta Lake. However, Athena also allows you to query data stored in 30 different data sources—in addition to Amazon S3—including relational, non-relational, and object stores running on premises or in other cloud environments.

In Athena, we refer to queries on non-Amazon S3 data sources as federated queries. These queries run on the underlying database, which means you can analyze the data without learning a new query language and without the need for separate extract, transform, and load (ETL) scripts to extract, duplicate, and prepare data for analysis.

Recently, Athena added support for creating and querying views on federated data sources to bring greater flexibility and ease of use to use cases such as interactive analysis and business intelligence reporting. Athena also updated its data connectors with optimizations that improve performance and reduce cost when querying federated data sources. The updated connectors use dynamic filtering and an expanded set of predicate pushdown optimizations to perform more operations in the underlying data source rather than in Athena. As a result, you get faster queries with less data scanned, especially on tables with millions to billions of rows of data.

In this post, we show how to create and query views on federated data sources in a data mesh architecture featuring data producers and consumers.

The term data mesh refers to a data architecture with decentralized data ownership. A data mesh enables domain-oriented teams with the data they need, emphasizes self-service, and promotes the notion of purpose-built data products. In a data mesh, data producers expose datasets to the organization and data consumers subscribe to and consume the data products created by producers. By distributing data ownership to cross-functional teams, a data mesh can foster a culture of collaboration, invention, and agility around data.

Let’s dive into the solution.

Solution overview

For this post, imagine a hypothetical ecommerce company that uses multiple data sources, each playing a different role:

  • In an S3 data lake, ecommerce records are stored in a table named Lineitems
  • Amazon ElastiCache for Redis stores Nations and ActiveOrders data, ensuring ultra-fast reads of operational data by downstream ecommerce systems
  • On Amazon Relational Database Service (Amazon RDS), MySQL is used to store data like email addresses and shipping addresses in the Orders, Customer, and Suppliers tables
  • For flexibility and low-latency reads and writes, an Amazon DynamoDB table holds Part and Partsupp data

We want to query these data sources in a data mesh design. In the following sections, we set up Athena data source connectors for MySQL, DynamoDB, and Redis, and then run queries that perform complex joins across these data sources. The following diagram depicts our data architecture.

Architecture diagram

As you proceed with this solution, note that you will create AWS resources in your account. We have provided you with an AWS CloudFormation template that defines and configures the required resources, including the sample MySQL database, S3 tables, Redis store, and DynamoDB table. The template also creates the AWS Glue database and tables, S3 bucket, Amazon S3 VPC endpoint, AWS Glue VPC endpoint, and other AWS Identity and Access Management (IAM) resources that are used in the solution.

The template is designed to demonstrate how to use federated views in Athena, and is not intended for production use without modification. Additionally, the template uses the us-east-1 Region and will not work in other Regions without modification. The template creates resources that incur costs while they are in use. Follow the cleanup steps at the end of this post to delete the resources and avoid unnecessary charges.

Prerequisites

Before you launch the CloudFormation stack, ensure you have the following prerequisites:

  • An AWS account that provides access to AWS services
  • An IAM user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI), and permissions to create an IAM role, IAM policies, and stacks in AWS CloudFormation

Create resources with AWS CloudFormation

To get started, complete the following steps:

  1. Choose Launch Stack: Cloudformation Launch Stack
  2. Select I acknowledge that this template may create IAM resources.

The CloudFormation stack takes approximately 20–30 minutes to complete. You can monitor its progress on the AWS CloudFormation console. When status reads CREATE_COMPLETE, your AWS account will have the resources necessary to implement this solution.

Deploy connectors and connect to data sources

With our resources provisioned, we can begin to connect the dots in our data mesh. Let’s start by connecting the data sources created by the CloudFormation stack with Athena.

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Data sources, select MySQL, then choose Next.
  4. For Data source name, enter a name, such as mysql. The Athena connector for MySQL is an AWS Lambda function that was created for you by the CloudFormation template.
  5. For Connection details, choose Select or enter a Lambda function.
  6. Choose mysql, then choose Next.
  7. Review the information and choose Create data source.
  8. Return to the Data sources page and choose mysql.
  9. On the connector details page, choose the link under Lambda function to access the Lambda console and inspect the function associated with this connector.
    mysql Data Soruce details
  10. Return to the Athena query editor.
  11. For Data source, choose mysql.
  12. For Database, choose the sales database.
  13. For Tables, you should see a listing of MySQL tables that are ready for you to query.
  14. Repeat these steps to set up the connectors for DynamoDB and Redis.

After all four data sources are configured, we can see the data sources on the Data source drop-down menu. All other databases and tables, like the lineitem table, which is stored on Amazon S3, are defined in the AWS Glue Data Catalog and can be accessed by choosing AwsDataCatalog as the data source.

This image shows AwsDataCatalog is being selected as Data Source

Analyze data with Athena

With our data sources configured, we are ready to start running queries and using federated views in a data mesh architecture. Let’s start by trying to find out how much profit was made on a given line of parts, broken out by supplier nation and year.

For such a query, we need to calculate, for each nation and year, the profit for parts ordered in each year that were filled by a supplier in each nation. Profit is defined as the sum of [(l_extendedprice*(1-l_discount)) - (ps_supplycost * l_quantity)] for all line items describing parts in the specified line.

Answering this question requires querying all four data sources—MySQL, DynamoDB, Redis, and Amazon S3—and is accomplished with the following SQL:

SELECT 
    n_name nation,
	year(CAST(o_orderdate AS date)) as o_year,
	((l_extendedprice * (1 - l_discount)) - (CAST(ps_supplycost AS double) * l_quantity)) as amount
FROM
    awsdatacatalog.data_lake.lineitem,
    dynamo.default.part,
    dynamo.default.partsupp,
    mysql.sales.supplier,
    mysql.sales.orders,
    redis.redis.nation
WHERE 
    ((s_suppkey = l_suppkey)
    AND (ps_suppkey = l_suppkey)
	AND (ps_partkey = l_partkey)
	AND (p_partkey = l_partkey)
	AND (o_orderkey = l_orderkey)
	AND (s_nationkey = CAST(Regexp_extract(_key_, '.*-(.*)', 1) AS int)))

Running this query on the Athena console produces the following result.

Result of above query

This query is fairly complex: it involves multiple joins and requires special knowledge of the correct way to calculate profit metrics that other end-users may not possess.

To simplify the analysis experience for those users, we can hide this complexity behind a view. For more information on using views with federated data sources, see Querying federated views.

Use the following query to create the view in the data_lake database under the AwsDataCatalog data source:

CREATE OR REPLACE VIEW "data_lake"."federated_view" AS
SELECT 
    n_name nation,
	year(CAST(o_orderdate AS date)) as o_year,
	((l_extendedprice * (1 - l_discount)) - (CAST(ps_supplycost AS double) * l_quantity)) as amount
FROM
    awsdatacatalog.data_lake.lineitem,
    dynamo.default.part,
    dynamo.default.partsupp,
    mysql.sales.supplier,
    mysql.sales.orders,
    redis.redis.nation
WHERE 
    ((s_suppkey = l_suppkey)
    AND (ps_suppkey = l_suppkey)
	AND (ps_partkey = l_partkey)
	AND (p_partkey = l_partkey)
	AND (o_orderkey = l_orderkey)
	AND (s_nationkey = CAST(Regexp_extract(_key_, '.*-(.*)', 1) AS int)))

Next, run a simple select query to validate the view was created successfully: SELECT * FROM federated_view limit 10

The result should be similar to our previous query.

With our view in place, we can perform new analyses to answer questions that would be challenging without the view due to the complex query syntax that would be required. For example, we can find the total profit by nation:

SELECT nation, sum(amount) AS total
from federated_view
GROUP BY nation 
ORDER BY nation ASC

Your results should resemble the following screenshot.

Result of above query

As you now see, the federated view makes it simpler for end-users to run queries on this data. Users are free to query a view of the data, defined by a knowledgeable data producer, rather than having to first acquire expertise in each underlying data source. Because Athena federated queries are processed where the data is stored, with this approach, we avoid duplicating data from the source system, saving valuable time and cost.

Use federated views in a multi-user model

So far, we have satisfied one of the principles of a data mesh: we created a data product (federated view) that is decoupled from its originating source and is available for on-demand analysis by consumers.

Next, we take our data mesh a step further by using federated views in a multi-user model. To keep it simple, assume we have one producer account, the account we used to create our four data sources and federated view, and one consumer account. Using the producer account, we give the consumer account permission to query the federated view from the consumer account.

The following figure depicts this setup and our simplified data mesh architecture.

Multi-user model setup

Follow these steps to share the connectors and AWS Glue Data Catalog resources from the producer, which includes our federated view, with the consumer account:

  1. Share the data sources mysql, redis, dynamo, and data_lake with the consumer account. For instructions, refer to Sharing a data source in Account A with Account B. Note that Account A represents the producer and Account B represents the consumer. Make sure you use the same data source names from earlier when sharing data. This is necessary for the federated view to work in a cross-account model.
  2. Next, share the producer account’s AWS Glue Data Catalog with the consumer account by following the steps in Cross-account access to AWS Glue data catalogs. For the data source name, use shared_federated_catalog.
  3. Switch to the consumer account, navigate to the Athena console, and verify that you see federated_view listed under Views in the shared_federated_catalog Data Catalog and data_lake database.
  4. Next, run a sample query on the shared view to see the query results.

Result of sample query

Clean up

To clean up the resources created for this post, complete the following steps:

  1. On the Amazon S3 console, empty the bucket athena-federation-workshop-<account-id>.
  2. If you’re using the AWS CLI, delete the objects in the athena-federation-workshop-<account-id> bucket with the following code. Make sure you run this command on the correct bucket.
    aws s3 rm s3://athena-federation-workshop-<account-id> --recursive
  3. On the AWS CloudFormation console or the AWS CLI, delete the stack athena-federated-view-blog.

Summary

In this post, we demonstrated the functionality of Athena federated views. We created a view spanning four different federated data sources and ran queries against it. We also saw how federated views could be extended to a multi-user data mesh and ran queries from a consumer account.

To take advantage of federated views, ensure you are using Athena engine version 3 and upgrade your data source connectors to the latest version available. For information on how to upgrade a connector, see Updating a data source connector.


About the Authors

Saurabh Bhutyani is a Principal Big Data Specialist Solutions Architect at AWS. He is passionate about new technologies. He joined AWS in 2019 and works with customers to provide architectural guidance for running scalable analytics solutions and data mesh architectures using AWS analytics services like Amazon EMR, Amazon Athena, AWS Glue, AWS Lake Formation, and Amazon DataZone.

Pathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Use AWS Glue DataBrew recipes in your AWS Glue Studio visual ETL jobs

Post Syndicated from Gonzalo Herreros original https://aws.amazon.com/blogs/big-data/use-aws-glue-databrew-recipes-in-your-aws-glue-studio-visual-etl-jobs/

AWS Glue Studio is now integrated with AWS Glue DataBrew. AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. DataBrew is a visual data preparation tool that enables you to clean and normalize data without writing any code. The over 200 transformations it provides are now available to be used in an AWS Glue Studio visual job.

In DataBrew, a recipe is a set of data transformation steps that you can author interactively in its intuitive visual interface. In this post, you’ll see how to use build a recipe in DataBrew and then apply it as part of an AWS Glue Studio visual ETL job.

Existing DataBrew users will also benefit from this integration—you can now run your recipes as part of a larger visual workflow with all the other components AWS Glue Studio provides, in addition to being able to use advanced job configuration and the latest AWS Glue engine version.

This integration brings distinct benefits to the existing users of both tools:

  • You have a centralized view in AWS Glue Studio of the overall ETL diagram, end to end
  • You can interactively define a recipe, seeing values, statistics, and distribution on the DataBrew console, then reuse that tested and versioned processing logic in AWS Glue Studio visual jobs
  • You can orchestrate multiple DataBrew recipes in an AWS Glue ETL job or even multiple jobs using AWS Glue workflows
  • DataBrew recipes can now use AWS Glue job features such as bookmarks for incremental data processing, automatic retries, auto scale, or grouping small files for greater efficiency

Solution overview

In our fictitious use case, the requirement is to clean up a synthetic medical claims dataset created for this post, which has some data quality issues introduced on purpose to demonstrate the DataBrew capabilities on data preparation. Then the claims data is ingested into the catalog (so it’s visible to analysts), after enriching it with some relevant details about the corresponding medical providers coming from a separate source.

The solution consists of an AWS Glue Studio visual job that reads two CSV files with claims and providers, respectively. The job applies a recipe of the first one to address the quality issues, select columns from the second one, join both datasets, and finally store the result on Amazon Simple Storage Service (Amazon S3), creating a table on the catalog so the output data can be used by other tools like Amazon Athena.

Create a DataBrew recipe

Start by registering the data store for the claims file. This will allow you to build the recipe in its interactive editor using the actual data so you can evaluate the result of the transformations as you define them.

  1. Download the claims CSV file using the following link: alabama_claims_data_Jun2023.csv.
  2. On the DataBrew console, choose Datasets in the navigation pane, then choose Connect new dataset.
  3. Choose the option File upload.
  4. For Dataset name, enter Alabama claims.
  5. For Select a file to upload, choose the file you just downloaded on your computer.
    Add dataset
  6. For Enter S3 destination, enter or browse to a bucket in your account and Region.
  7. Leave the rest of the options by default (CSV separated with comma and with header) and complete the dataset creation.
  8. Choose Project in the navigation pane, then choose Create project.
  9. For Project name, name it ClaimsCleanup.
  10. Under Recipe details, for Attached recipe, choose Create new recipe, name it ClaimsCleanup-recipe, and choose the Alabama claims dataset you just created.Add project
  11. Select a role suitable for DataBrew or create a new one, and complete the project creation.

This will create a session using a configurable subset of the data. After it has initialized the session, you can notice some of the cells have invalid or missing values.

Loaded project

In addition to the missing values in the columns Diagnosis Code, Claim Amount, and Claim Date, some values in the data have some extra characters: Diagnosis Code values are sometimes prefixed with “code ” (space included), and Procedure Code values are sometimes followed by single quotes.
Claim Amount values will likely be used for some calculations, so convert to number, and Claim Data should be converted to date type.

Now that we identified the data quality issues to address, we need to decide how to deal with each case.
There are multiple ways you can add recipe steps, including using the column context menu, the toolbar on the top, or from the recipe summary. Using the last method, you can search for the indicated step type to replicate the recipe created in this post.

Add step searchbox

Claim Amount is essential for this use case, and the decision is to remove such rows.

  1. Add the step Remove missing values.
  2. For Source column, choose Claim Amount.
  3. Leave the default action Delete rows with missing values and choose Apply to save it.
    Preview missing values

The view is now updated to reflect the step application and the rows with missing amounts are no longer there.

Diagnosis Code can be empty so this is accepted, but in the case of Claim Date, we want to have a reasonable estimation. The rows in the data are sorted in chronological order, so you can impute missing dates using the previews valid value from the preceding rows. Assuming every day has claims, the largest error would be assigning it to the preview day if it were the first claim that day missing the date; for illustration purposes, let’s consider that potential error acceptable.

First, convert the column from string to date type.

  1. Add the step Change type.
  2. Choose Claim Date as the column and date as the type, then choose Apply.
    Change type to date
  3. Now to do the imputation of missing dates, add the step Fill or impute missing values.
  4. Select Fill with last valid value as the action and choose Claim Date as the source.
  5. Choose Preview changes to validate it, then choose Apply to save the step.
    Preview imputation

So far, your recipe should have three steps, as shown in the following screenshot.

Steps so far

  1. Next, add the step Remove quotation marks.
  2. Choose the Procedure Code column and select Leading and trailing quotation marks.
  3. Preview to verify it has the desired effect and apply the new step.
    Preview remove quotes
  4. Add the step Remove special characters.
  5. Choose the Claim Amount column and to be more specific, select Custom special characters and enter $ for Enter custom special characters.
    Preview remove dollar sign
  6. Add a Change type step on the column Claim Amount and choose double as the type.
    Chane type to double
  7. As the last step, to remove the superfluous “code ” prefix, add a Replace value or pattern step.
  8. Choose the column Diagnosis Code, and for Enter custom value, enter code (with a space at the end).
    Preview remove code

Now that you have addressed all data quality issues identified on the sample, publish the project as a recipe.

  1. Choose Publish in the Recipe pane, enter an optional description, and complete the publication.
    Recipe steps

Each time you publish, it will create a different version of the recipe. Later, you will be able to choose which version of the recipe to use.

Create a visual ETL job in AWS Glue Studio

Next, you create the job that uses the recipe. Complete the following steps:

  1. On the AWS Glue Studio console, choose Visual ETL in the navigation pane.
  2. Choose Visual with a blank canvas and create the visual job.
  3. At the top of the job, replace “Untitled job” with a name of your choice.
  4. On the Job Details tab, specify a role that the job will use.
    This needs to be an AWS Identity and Access Management (IAM) role suitable for AWS Glue with permissions to Amazon S3 and the AWS Glue Data Catalog. Note that the role used before for DataBrew is not usable for run jobs, so won’t be listed on the IAM Role drop-down menu here.
    Job details
    If you used only DataBrew jobs before, notice that in AWS Glue Studio, you can choose performance and cost settings, including worker size, auto scaling, and Flexible Execution, as well as use the latest AWS Glue 4.0 runtime and benefit from the significant performance improvements it brings. For this job, you can use the default settings, but reduce the requested number of workers in the interest of frugality. For this example, two workers will do.
  5. On the Visual tab, add an S3 source and name it Providers.
  6. For S3 URL, enter s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv.
    S3 Source
  1. Select the format as CSV and choose Infer schema.
    Now the schema is listed on the Output schema tab using the file header.
    Input schema

In this use case, the decision is that not all columns in the providers dataset are needed, so we can discard the rest.

  1. With the Providers node selected, add a Drop Fields transform (if you didn’t select the parent node, it won’t have one; in that case, assign the node parent manually).
  2. Select all the fields after Provider Zip Code.
    Drop fields

Later, this data will be joined by the claims for Alabama state using the provider; however, that second dataset doesn’t have the state specified. We can use knowledge of the data to optimize the join by filtering the data we really need.

  1. Add a Filter transform as a child of Drop Fields.
  2. Name it Alabama providers and add a condition that the state must match AL.
    Filter providers
  3. Add the second source (a new S3 source) and name it Alabama claims.
  4. To enter the S3 URL, open DataBrew on a separate browser tab, choose Datasets in the navigation pane, and on the table copy the location shown on the table for Alabama claims (copy the text starting with s3://, not the http link associated). Then back on the visual job, paste it as S3 URL; if it is correct, you will see in the Output schema tab the data fields listed.
  5. Select CSV format and infer the schema like you did with the other source.
  6. As a child of this source, search in the Add nodes menu for recipe and choose Data Preparation Recipe.
    Add recipe
  7. In this new node’s properties, give it the name Claim cleanup recipe and choose the recipe and version you published before.
  8. You can review the recipe steps here and use the link to DataBrew to make changes if needed.
    Recipe details
  9. Add a Join node and select both Alabama providers and Claim cleanup recipes as the parent.
  10. Add a join condition equaling the provider ID from both sources.
  11. As the last step, add an S3 node as a target (note the first one listed when you search is the source; make sure you select the version that is listed as the target).
  12. In the node configuration, leave the default format JSON and enter an S3 URL on which the job role has permission to write.

In addition, make the data output available as a table in the catalog.

  1. In the Data Catalog update options section, select the second option Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions, then select a database on which you have permission to create tables.
  2. Assign alabama_claims as the name and choose Claim Date as the partition key (this is for illustration purposes; a tiny table like this doesn’t really need partitions if further data won’t be added later).
    Join
  3. Now you can save and run the job.
  4. On the Runs tab, you can keep track of the process and see detailed job metrics using the job ID link.

The job should take a few minutes to complete.

  1. When the job is complete, navigate to the Athena console.
  2. Search for the table alabama_claims in the database you selected and, using the context menu, choose Preview Table, which will run a simple SELECT * SQL statement on the table.

Athena results

You can see in the result of the job that the data was cleaned by the DataBrew recipe and enriched by the AWS Glue Studio join.

Apache Spark is the engine that runs the jobs created on AWS Glue Studio. Using the Spark UI on the event logs it produces, you can view insights about the job plan and run, which can help you understand how your job is performing and potential performance bottlenecks. For instance, for this job on a large dataset, you could use it to compare the impact of filtering explicitly the provider state before doing the join, or identify if you can benefit from adding an Autobalance transform to improve parallelism.

By default, the job will store the Apache Spark event logs under the path s3://aws-glue-assets-<your account id>-<your region name>/sparkHistoryLogs/. To view the jobs, you have to install a History server using one of the methods available.

SparkUI

Clean up

If you no longer need this solution, you can delete the files generated on Amazon S3, the table created by the job, the DataBrew recipe, and the AWS Glue job.

Conclusion

In this post, we showed how you can use AWS DataBrew to build a recipe using the provided interactive editor and then use the published recipe as part of an AWS Glue Studio visual ETL job. We included some examples of common tasks that are required when doing data preparation and ingesting data into AWS Glue Catalog tables.

This example used a single recipe in the visual job, but it’s possible to use multiple recipes at different parts of the ETL process, as well as reusing the same recipe on multiple jobs.

These AWS Glue solutions allow you to effectively create advanced ETL pipelines that are straightforward to build and maintain, all without writing any code. You can start creating solutions that combine both tools today.


About the authors

Mikhail Smirnov is a Sr. Software Dev Engineer on the AWS Glue team and part of the AWS Glue DataBrew development team. Outside of work, his interests include learning to play guitar and traveling with his family.

Gonzalo Herreros is a Sr. Big Data Architect on the AWS Glue team. Based on Dublin, Ireland, he helps customers succeed with big data solutions based on AWS Glue. On his spare time, he enjoys board games and cycling.

Content Repository for Unstructured Data with Multilingual Semantic Search: Part 2

Post Syndicated from Patrik Nagel original https://aws.amazon.com/blogs/architecture/content-repository-for-unstructured-data-with-multilingual-semantic-search-part-2/

Leveraging vast unstructured data poses challenges, particularly for global businesses needing cross-language data search. In Part 1 of this blog series, we built the architectural foundation for the content repository. The key component of Part 1 was the dynamic access control-based logic with a web UI to upload documents.

In Part 2, we extend the content repository with multilingual semantic search capabilities while maintaining the access control logic from Part 1. This allows users to ingest documents in content repository across multiple languages and then run search queries to get reference to semantically similar documents.

Solution overview

Building on the architectural foundation from Part 1, we introduce four new building blocks to extend the search functionality.

Optical character recognition (OCR) workflow: To automatically identify, understand, and extract text from ingested documents, we use Amazon Textract and a sample review dataset of .png format documents (Figure 1). We use Amazon Textract synchronous application programming interfaces (APIs) to capture key-value pairs for the reviewid and reviewBody attributes. Based on your specific requirements, you can choose to capture either the complete extracted text or parts the text.

Sample document for ingestion

Figure 1. Sample document for ingestion

Embedding generation: To capture the semantic relationship between the text, we use a machine learning (ML) model that maps words and sentences to high-dimensional vector embeddings. You can use Amazon SageMaker, a fully-managed ML service, to build, train, and deploy your ML models to production-ready hosted environments. You can also deploy ready-to-use pre-trained models from multiple avenues such as SageMaker JumpStart. For this blog post, we use the open-source pre-trained universal-sentence-encoder-multilingual model from TensorFlow Hub. The model inference endpoint deployed to a SageMaker endpoint generates embeddings for the document text and the search query. Figure 2 is an example of n-dimensional vector that is generated as the output of the reviewBody attribute text provided to the embeddings model.

Sample embedding representation of the value of reviewBody

Figure 2. Sample embedding representation of the value of reviewBody

Embedding ingestion: To make the embeddings searchable for the content repository users, you can use the k-Nearest Neighbor (k-NN) search feature of Amazon OpenSearch Service. The OpenSearch k-NN plugin provides different methods. For this blog post, we use the Approximate k-NN search approach, based on the Hierarchical Navigable Small World (HNSW) algorithm. HNSW uses a hierarchical set of proximity graphs in multiple layers to improve performance when searching large datasets to find the “nearest neighbors” for the search query text embeddings.

Semantic search: We make the search service accessible as an additional backend logic on Amazon API Gateway. Authenticated content repository users send their search query using the frontend to receive the matching documents. The solution maintains end-to-end access control logic by using the user’s enriched Amazon Cognito provided identity (ID) token claim with the department attribute to compare it with the ingested documents.

Technical architecture

The technical architecture includes two parts:

  1. Implementing multilingual semantic search functionality: Describes the processing workflow for the document that the user uploads; makes the document searchable.
  2. Running input search query: Covers the search workflow for the input query; finds and returns the nearest neighbors of the input text query to the user.

Part 1. Implementing multilingual semantic search functionality

Our previous blog post discussed blocks A through D (Figure 3), including user authentication, ID token enrichment, Amazon Simple Storage Service (Amazon S3) object tags for dynamic access control, and document upload to the source S3 bucket. In the following section, we cover blocks E through H. The overall workflow describes how an unstructured document is ingested in the content repository, run through the backend OCR and embeddings generation process and finally the resulting vector embedding are stored in OpenSearch service.

Technical architecture for implementing multi-lingual semantic search functionality

Figure 3. Technical architecture for implementing multilingual semantic search functionality

  1. The OCR workflow extracts text from your uploaded documents.
    • The source S3 bucket sends an event notification to Amazon Simple Queue Service (Amazon SQS).
    • The document transformation AWS Lambda function subscribed to the Amazon SQS queue invokes an Amazon Textract API call to extract the text.
  2. The document transformation Lambda function makes an inference request to the encoder model hosted on SageMaker. In this example, the Lambda function submits the reviewBody attribute to the encoder model to generate the embedding.
  3. The document transformation Lambda function writes an output file in the transformed S3 bucket. The text file consists of:
    • The reviewid and reviewBody attributes extracted from Step 1
    • An additional reviewBody_embeddings attribute from Step 2
      Note: The workflow tags the output file with the same S3 object tags as the source document for downstream access control.
  4. The transformed S3 bucket sends an event notification to invoke the indexing Lambda function.
  5. The indexing Lambda function reads the text file content. Then indexing Lambda function makes an OpenSearch index API call along with source document tag as one of the indexing attributes for access control.

Part 2. Running user-initiated search query

Next, we describe how the user’s request produces query results (Figure 4).

Search query lifecycle

Figure 4. Search query lifecycle

  1. The user enters a search string in the web UI to retrieve relevant documents.
  2. Based on the active sign-in session, the UI passes the user’s ID token to the search endpoint of the API Gateway.
  3. The API Gateway uses Amazon Cognito integration to authorize the search API request.
  4. Once validated, the search API endpoint request invokes the search document Lambda function.
  5. The search document function sends the search query string as the inference request to the encoder model to receive the embedding as the inference response.
  6. The search document function uses the embedding response to build an OpenSearch k-NN search query. The HNSW algorithm is configured with the Lucene engine and its filter option to maintain the access control logic based on the custom department claim from the user’s ID token. The OpenSearch query returns the following to the query embeddings:
    • Top three Approximate k-NN
    • Other attributes, such as reviewid and reviewBody
  7. The workflow sends the relevant query result attributes back to the UI.

Prerequisites

You must have the following prerequisites for this solution:

Walkthrough

Setup

The following steps deploy two AWS CDK stacks into your AWS account:

  • content-repo-search-stack (blog-content-repo-search-stack.ts) creates the environment detailed in Figure 3, except for the SageMaker endpoint, which you create in a spearate step.
  • demo-data-stack (userpool-demo-data-stack.ts) deploys sample users, groups, and role mappings.

To continue setup, use the following commands:

  1. Clone the project Git repository:
    git clone https://github.com/aws-samples/content-repository-with-multilingual-search content-repository
  2. Install the necessary dependencies:
    cd content-repository/backend-cdk 
    npm install
  3. Configure environment variables:
    export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text)
    export CDK_DEFAULT_REGION=$(aws configure get region)
  4. Bootstrap your account for AWS CDK usage:
    cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$CDK_DEFAULT_REGION
  5. Deploy the code to your AWS account:
    cdk deploy --all

The complete stack set-up may take up to 20 minutes.

Creation of SageMaker endpoint

Follow below steps to create the SageMaker endpoint in the same AWS Region where you deployed the AWS CDK stack.

    1. Sign in to the SageMaker console.
    2. In the navigation menu, select Notebook, then Notebook instances.
    3. Choose Create notebook instance.
    4. Under the Notebook instance settings, enter content-repo-notebook as the notebook instance name, and leave other defaults as-is.
    5. Under the Permissions and encryption section (Figure 5), you need to set the IAM role section to the role with the prefix content-repo-search-stack. In case you don’t see this role automatically populated, select it from the drop-down. Leave the rest of the defaults, and choose Create notebook instance.

      Notebook permissions

      Figure 5. Notebook permissions

    6. The notebook creation status changes to Pending before it’s available for use within 3-4 minutes.
    7. Once the notebook is in the Available status, choose Open Jupyter.
    8. Choose the Upload button and upload the create-sagemaker-endpoint.ipynb file in the backend-cdk folder of the root of the blog repository.
    9. Open the create-sagemaker-endpoint.ipynb notebook. Select the option Run All from the Cell menu (Figure 6). This might take up to 10 minutes.

      Run create-sagemaker-endpoint notebook cells

      Figure 6. Run create-sagemaker-endpoint notebook cells

    10. After all the cells have successfully run, verify that the AWS Systems Manager parameter sagemaker-endpoint is updated with the value of the SageMaker endpoint name. An example of value as the output of the cell is in Figure 7. In case you don’t see the output, check if the preceding steps were run correctly.

      SSM parameter updated with SageMaker endpoint

      Figure 7. SSM parameter updated with SageMaker endpoint

    11. Verify in the SageMaker console that the inference endpoint with the prefix tensorflow-inference has been deployed and is set to status InService.
    12. Upload sample data to the content repository:
      • Update the S3_BUCKET_NAME variable in the upload_documents_to_S3.sh script in the root folder of the blog repository with the s3SourceBucketName from the AWS CDK output of the content-repo-search-stack.
      • Run upload_documents_to_S3.sh script to upload 150 sample documents to the content repository. This takes 5-6 minutes. During this process, the uploaded document triggers the workflow described in the Implementing multilingual semantic search functionality.

Using the search service

At this stage, you have deployed all the building blocks for the content repository in your AWS account. Next, as part of the upload sample data to the content repository, you pushed a limited corpus of 150 sample documents (.png format). Each document is in one of the four different languages – English, German, Spanish and French. With the added multilingual search capability, you can query in one language and receive semantically similar results across different languages while maintaining the access control logic.

  1. Access the frontend application:
    • Copy the amplifyHostedAppUrl value of the AWS CDK output from the content-repo-search-stack shown in the terminal.
    • Enter the URL in your web browser to access the frontend application.
    • A temporary page displays until the automated build and deployment of the React application completes after 4-5 minutes.
  2. Sign into the application:
    • The content repository provides two demo users with credentials as part of the demo-data-stack in the AWS CDK output. Copy the password from the terminal associated with the sales-user, which belongs to the sales department.
    • Follow the prompts from the React webpage to sign in with the sales-user and change the temporary password.
  3. Enter search queries and verify results. The search action invokes the workflow described in Running input search query. For example:
    • Enter works well as the search query. Note the multilingual output and the semantically similar results (Figure 8).

      Positive sentiment multi-lingual search result for the sales-user

        Figure 8. Positive sentiment multilingual search result for the sales-user

    • Enter bad quality as the search query. Note the multilingual output and the semantically similar results (Figure 9).

      Negative sentiment multi-lingual search result for the sales-user

      Figure 9. Negative sentiment multi-lingual search result for the sales-user

  4. Sign out as the sales-user with the Log Out button on the webpage.
  5. Sign in using the marketing-user credentials to verify access control:
    • Follow the sign in procedure in step 2 but with the marketing-user.
    • This time with works well as search query, you find different output. This is because the access control only allows marketing-user to search for the documents that belong to the marketing department (Figure 10).

      Positive sentiment multi-lingual search result for the marketing-user

      Figure 10. Positive sentiment multilingual search result for the marketing-user

Cleanup

In the backend-cdk subdirectory of the cloned repository, delete the deployed resources: cdk destroy --all.

Additionally, you need to access the Amazon SageMaker console to delete the SageMaker endpoint and notebook instance created as part of the Walkthrough setup section.

Conclusion

In this blog, we enriched the content repository with multi-lingual semantic search features while maintaining the access control fundamentals that we implemented in Part 1. The building blocks of the semantic search for unstructured documents—Amazon Textract, Amazon SageMaker, and Amazon OpenSearch Service—set a foundation for you to customize and enhance the search capabilities for your specific use case. For example, you can leverage the fast developments in Large Language Models (LLM) to enhance the semantic search experience. You can replace the encoder model with an LLM capable of generating multilingual embeddings while still maintaining the OpenSearch service to store and index data and perform vector search.

Migrating your secrets to AWS Secrets Manager, Part 2: Implementation

Post Syndicated from Adesh Gairola original https://aws.amazon.com/blogs/security/migrating-your-secrets-to-aws-secrets-manager-part-2-implementation/

In Part 1 of this series, we provided guidance on how to discover and classify secrets and design a migration solution for customers who plan to migrate secrets to AWS Secrets Manager. We also mentioned steps that you can take to enable preventative and detective controls for Secrets Manager. In this post, we discuss how teams should approach the next phase, which is implementing the migration of secrets to Secrets Manager. We also provide a sample solution to demonstrate migration.

Implement secrets migration

Application teams lead the effort to design the migration strategy for their application secrets. Once you’ve made the decision to migrate your secrets to Secrets Manager, there are two potential options for migration implementation. One option is to move the application to AWS in its current state and then modify the application source code to retrieve secrets from Secrets Manager. Another option is to update the on-premises application to use Secrets Manager for retrieving secrets. You can use features such as AWS Identity and Access Management (IAM) Roles Anywhere to make the application communicate with Secrets Manager even before the migration, which can simplify the migration phase.

If the application code contains hardcoded secrets, the code should be updated so that it references Secrets Manager. A good interim state would be to pass these secrets as environment variables to your application. Using environment variables helps in decoupling the secrets retrieval logic from the application code and allows for a smooth cutover and rollback (if required).

Cutover to Secrets Manager should be done in a maintenance window. This minimizes downtime and impacts to production.

Before you perform the cutover procedure, verify the following:

  • Application components can access Secrets Manager APIs. Based on your environment, this connectivity might be provisioned through interface virtual private cloud (VPC) endpoints or over the internet.
  • Secrets exist in Secrets Manager and have the correct tags. This is important if you are using attribute-based access control (ABAC).
  • Applications that integrate with Secrets Manager have the required IAM permissions.
  • Have a well-documented cutover and rollback plan that contains the changes that will be made to the application during cutover. These would include steps like updating the code to use environment variables and updating the application to use IAM roles or instance profiles (for apps that are being migrated to Amazon Elastic Compute Cloud (Amazon EC2)).

After the cutover, verify that Secrets Manager integration was successful. You can use AWS CloudTrail to confirm that application components are using Secrets Manager.

We recommend that you further optimize your integration by enabling automatic secrets rotation. If your secrets were previously widely accessible (for example, they were stored in your Git repositories), we recommend rotating as soon as possible when migrating .

Sample application to demo integration with Secrets Manager

In the next sections, we present a sample AWS Cloud Development Kit (AWS CDK) solution that demonstrates the implementation of the previously discussed guardrails, design, and migration strategy. You can use the sample solution as a starting point and expand upon it. It includes components that environment teams may deploy to help provide potentially secure access for application teams to migrate their secrets to Secrets Manager. The solution uses ABAC, a tagging scheme, and IAM Roles Anywhere to demonstrate regulated access to secrets for application teams. Additionally, the solution contains client-side utilities to assist application and migration teams in updating secrets. Teams with on-premises applications that are seeking integration with Secrets Manager before migration can use the client-side utility for access through IAM Roles Anywhere.

The sample solution is hosted on the aws-secrets-manager-abac-authorization-samples GitHub repository and is made up of the following components:

  • A common environment infrastructure stack (created and owned by environment teams). This stack provisions the following resources:
    • A sample VPC created with Amazon Virtual Private Cloud (Amazon VPC), with PUBLIC, PRIVATE_WITH_NAT, and PRIVATE_ISOLATED subnet types.
    • VPC endpoints for the AWS Key Management Service (AWS KMS) and Secrets Manager services to the sample VPC. The use of VPC endpoints means that calls to AWS KMS and Secrets Manager are not made over the internet and remain internal to the AWS backbone network.
    • An empty shell secret, tagged with the supplied attributes and an IAM managed policy that uses attribute-based access control conditions. This means that the secret is managed in code, but the actual secret value is not visible in version control systems like GitHub or in AWS CloudFormation parameter inputs. 
  • An IAM Roles Anywhere infrastructure stack (created and owned by environment teams). This stack provisions the following resources:
    • An AWS Certificate Manager Private Certificate Authority (AWS Private CA).
    • An IAM Roles Anywhere public key infrastructure (PKI) trust anchor that uses AWS Private CA.
    • An IAM role for the on-premises application that uses the common environment infrastructure stack.
    • An IAM Roles Anywhere profile.

    Note: You can choose to use your existing CAs as trust anchors. If you do not have a CA, the stack described here provisions a PKI for you. IAM Roles Anywhere allows migration teams to use Secrets Manager before the application is moved to the cloud. Post migration, you could consider updating the applications to use native IAM integration (like instance profiles for EC2 instances) and revoking IAM Roles Anywhere credentials.

  • A client-side utility (primarily used by application or migration teams). This is a shell script that does the following:
    • Assists in provisioning a certificate by using OpenSSL.
    • Uses aws_signing_helper (Credential Helper) to set up AWS CLI profiles by using the credential_process for IAM Roles Anywhere.
    • Assists application teams to access and update their application secrets after assuming an IAM role by using IAM Roles Anywhere.
  • A sample application stack (created and owned by the application/migration team). This is a sample serverless application that demonstrates the use of the solution. It deploys the following components, which indicate that your ABAC-based IAM strategy is working as expected and is effectively restricting access to secrets:
    • The sample application stack uses a VPC-deployed common environment infrastructure stack.
    • It deploys an Amazon Aurora MySQL serverless cluster in the PRIVATE_ISOLATED subnet and uses the secret that is created through a common environment infrastructure stack.
    • It deploys a sample Lambda function in the PRIVATE_WITH_NAT subnet.
    • It deploys two IAM roles for testing:
      • allowedRole (default role): When the application uses this role, it is able to use the GET action to get the secret and open a connection to the Aurora MySQL database.
      • Not allowedRole: When the application uses this role, it is unable to use the GET action to get the secret and open a connection to the Aurora MySQL database.

Prerequisites to deploy the sample solution

The following software packages need to be installed in your development environment before you deploy this solution:

Note: In this section, we provide examples of AWS CLI commands and configuration for Linux or macOS operating systems. For instructions on using AWS CLI on Windows, refer to the AWS CLI documentation.

Before deployment, make sure that the correct AWS credentials are configured in your terminal session. The credentials can be either in the environment variables or in ~/.aws. For more details, see Configuring the AWS CLI.

Next, use the following commands to set your AWS credentials to deploy the stack:

export AWS_ACCESS_KEY_ID=<>
export AWS_SECRET_ACCESS_KEY=<>
export AWS_REGION = <>

You can view the IAM credentials that are being used by your session by running the command aws sts get-caller-identity. If you are running the cdk command for the first time in your AWS account, you will need to run the following cdk bootstrap command to provision a CDK Toolkit stack that will manage the resources necessary to enable deployment of cloud applications with the AWS CDK.

cdk bootstrap aws://<AWS account number>/<Region> # Bootstrap CDK in the specified account and AWS Region

Select the applicable archetype and deploy the solution

This section outlines the design and deployment steps for two archetypes:

Archetype 1: Application is currently on premises

Archetype 1 has the following requirements:

  • The application is currently hosted on premises.
  • The application would consume API keys, stored credentials, and other secrets in Secrets Manager.

The application, environment and security teams work together to define a tagging strategy that will be used to restrict access to secrets. After this, the proposed workflow for each persona is as follows:

  1. The environment engineer deploys a common environment infrastructure stack (as described earlier in this post) to bootstrap the AWS account with secrets and IAM policy by using the supplied tagging requirement.
  2. Additionally, the environment engineer deploys the IAM Roles Anywhere infrastructure stack.
  3. The application developer updates the secrets required by the application by using the client-side utility (helper.sh).
  4. The application developer uses the client-side utility to update the AWS CLI profile to consume the IAM Roles Anywhere role from the on-premises servers.

    Figure 1 shows the workflow for Archetype 1.

    Figure 1: Application on premises connecting to Secrets Manager

    Figure 1: Application on premises connecting to Secrets Manager

To deploy Archetype 1

  1. (Actions by the application team persona) Clone the repository and update the tagging details at configs/tagconfig.json.

    Note: Do not modify the tag/attributes name/key, only modify value.

  2. (Actions by the environment team persona) Run the following command to deploy the common environment infrastructure stack.
    ./helper.sh prepare
    Then, run the following command to deploy the IAM Roles Anywhere infrastructure stack../helper.sh on-prem
  3. (Actions by the application team persona) Update the secret value of the dummy secrets provided by the environment team, by using the following command.
    ./helper.sh update-secret

    Note: This command will only update the secret if it’s still using the dummy value.

    Then, run the following command to set up the client and server on premises../helper.sh client-profile-setup

    Follow the command prompt. It will help you request a client certificate and update the AWS CLI profile.

    Important: When you request a client certificate, make sure to supply at least one distinguished name, like CommonName.

The sample output should look like the following.


‐‐> This role can be used by the application by using the AWS CLI profile 'developer'.
‐‐> For instance, the following output illustrates how to access secret values by using the AWS CLI profile 'developer'.
‐‐> Sample AWS CLI: aws secretsmanager get-secret-value ‐‐secret-id $SECRET_ARN ‐‐profile developer

At this point, the client-side utility (helper.sh client-profile-setup) should have updated the AWS CLI configuration file with the following profile.

[profile developer]
region = <aws-region>
credential_process = /Users/<local-laptop-user>/.aws/aws_signing_helper credential-process
    ‐‐certificate /Users/<local-laptop-user>/.aws/client_cert.pem
    ‐‐private-key /Users/<local-laptop-user>/.aws/my_private_key.clear.key
    ‐‐trust-anchor-arn arn:aws:rolesanywhere:<aws-region>:444455556666:trust-anchor/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111 
    ‐‐profile-arn arn:aws:rolesanywhere:<aws-region>:444455556666:profile/a1b2c3d4-5678-90ab-cdef-EXAMPLE22222 
    ‐‐role-arn arn:aws:iam::444455556666:role/RolesanywhereabacStack-onPremAppRole-1234567890ABC

To test Archetype 1 deployment

  • The application team can verify that the AWS CLI profile has been properly set up and is capable of retrieving secrets from Secrets Manager by running the following client-side utility command.
    ./helper.sh on-prem-test

This client-side utility (helper.sh) command verifies that the AWS CLI profile (for example, developer) has been set up for IAM Roles Anywhere and can run the GetSecretValue API action to retrieve the value of the secret stored in Secrets Manager.

The sample output should look like the following.

‐‐> Checking credentials ...
{
    "UserId": "AKIAIOSFODNN7EXAMPLE:EXAMPLE11111EXAMPLEEXAMPLE111111",
    "Account": "444455556666",
    "Arn": "arn:aws:sts::444455556666:assumed-role/RolesanywhereabacStack-onPremAppRole-1234567890ABC"
}
‐‐> Assume role worked for:
arn:aws:sts::444455556666:assumed-role/RolesanywhereabacStack-onPremAppRole-1234567890ABC
‐‐> This role can be used by the application by using the AWS CLI profile 'developer'. 
‐‐> For instance, the following output illustrates how to access secret values by using the AWS CLI profile 'developer'. 
‐‐> Sample AWS CLI: aws secretsmanager get-secret-value --secret-id $SECRET_ARN ‐‐profile $PROFILE_NAME
-------Output-------
{
  "password": "randomuniquepassword",
  "servertype": "testserver1",
  "username": "testuser1"
}
-------Output-------

Archetype 2: Application has migrated to AWS

Archetype 2 has the following requirement:

  • Deploy a sample application to demonstrate how ABAC authorization works for Secrets Manager APIs.

The application, environment, and security teams work together to define a tagging strategy that will be used to restrict access to secrets. After this, the proposed workflow for each persona is as follows:

  1. The environment engineer deploys a common environment infrastructure stack to bootstrap the AWS account with secrets and an IAM policy by using the supplied tagging requirement.
  2. The application developer updates the secrets required by the application by using the client-side utility (helper.sh).
  3. The application developer tests the sample application to confirm operability of ABAC.

Figure 2 shows the workflow for Archetype 2.

Figure 2: Sample migrated application connecting to Secrets Manager

Figure 2: Sample migrated application connecting to Secrets Manager

To deploy Archetype 2

  1. (Actions by the application team persona) Clone the repository and update the tagging details at configs/tagconfig.json.

    Note: Don’t modify the tag/attributes name/key, only modify value.

  2. (Actions by the environment team persona) Run the following command to deploy the common platform infrastructure stack.
    ./helper.sh prepare
  3. (Actions by the application team persona) Update the secret value of the dummy secrets provided by the environment team, using the following command.
    ./helper.sh update-secret

    Note: This command will only update the secret if it is still using the dummy value.

    Then, run the following command to deploy a sample app stack.
    ./helper.sh on-aws

    Note: If your secrets were migrated from a system that did not have the correct access controls, as a best security practice, you should rotate them at least once manually.

At this point, the client-side utility should have deployed a sample application Lambda function. This function connects to a MySQL database by using credentials stored in Secrets Manager. It retrieves the secret values, validates them, and establishes a connection to the database. The function returns a message that indicates whether the connection to the database is working or not.

To test Archetype 2 deployment

  • The application team can use the following client-side utility (helper.sh) to invoke the Lambda function and verify whether the connection is functional or not.
    ./helper.sh on-aws-test

The sample output should look like the following.

‐‐> Check if AWS CLI is installed
‐‐> AWS CLI found 
‐‐> Using tags to create Lambda function name and invoking a test 
‐‐> Checking the Lambda invoke response..... 
‐‐> The status code is 200
‐‐> Reading response from test function: 
"Connection to the DB is working."
‐‐> Response shows database connection is working from Lambda function using secret.

Conclusion

Building an effective secrets management solution requires careful planning and implementation. AWS Secrets Manager can help you effectively manage the lifecycle of your secrets at scale. We encourage you to take an iterative approach to building your secrets management solution, starting by focusing on core functional requirements like managing access, defining audit requirements, and building preventative and detective controls for secrets management. In future iterations, you can improve your solution by implementing more advanced functionalities like automatic rotation or resource policies for secrets.

To read Part 1 of this series, go to Migrating your secrets to AWS, Part I: Discovery and design.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Adesh Gairola

Adesh Gairola

Adesh Gairola is a Senior Security Consultant at Amazon Web Services in Sydney, Australia. Adesh is eager to help customers build robust defenses, and design and implement security solutions that enable business transformations. He is always looking for new ways to help customers improve their security posture.

Eric Swamy

Eric Swamy

Eric is a Senior Security Consultant working in the Professional Services team in Sydney, Australia. He is passionate about helping customers build the confidence and technical capability to move their most sensitive workloads to cloud. When not at work, he loves to spend time with his family and friends outdoors, listen to music, and go on long walks.