Tag Archives: Intermediate (200)

Use Amazon Inspector to manage your build and deploy pipelines for containerized applications

2022-11-03 Scott Ward

Post Syndicated from Scott Ward original https://aws.amazon.com/blogs/security/use-amazon-inspector-to-manage-your-build-and-deploy-pipelines-for-containerized-applications/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector currently supports vulnerability reporting for Amazon Elastic Compute Cloud (Amazon EC2) instances and container images stored in Amazon Elastic Container Registry (Amazon ECR).

With the emergence of Docker in 2013, container technology has quickly moved from the experimentation phase into a viable production tool. Many customers are using containers to modernize their existing applications or as the foundations for new applications or services that they build. In this blog post, we’ll explore the process that Amazon Inspector takes to scan container images. We’ll also show how you can integrate Amazon Inspector into your containerized application build and deployment pipeline, and control pipeline steps based on the results of an Amazon Inspector container image scan.

Solution overview and walkthrough

The solution outlined in this post covers a deployment pipeline modeled in AWS CodePipeline. The source for the pipeline is AWS CodeCommit, and the build of the container image is performed by AWS CodeBuild. The solution uses a collection of AWS Lambda functions and an Amazon DynamoDB table to evaluate the container image status and make an automated decision about deploying the container image. Finally, the pipeline has a deploy stage that will deploy the container image into an Amazon Elastic Container Service (Amazon ECS) cluster. In this section, I’ll outline the key components of the solution and how they work. In the following section, Deploy the solution, I’ll walk you through how to actually implement the solution.

Although this solution uses AWS continuous integration and continuous delivery (CI/CD) services such as CodePipeline and CodeBuild, you can also build similar capabilities by using third-party CI/CD solutions. In addition to CodeCommit, other third-party code repositories such as GitHub or Amazon Simple Storage Service (Amazon S3) can be substituted in as a source for the pipeline.

Solution architecture

Figure 1 shows the high-level architecture of the solution, which integrates Amazon Inspector into a container build and deploy pipeline.

Figure 1: Overall container build and deploy architecture

The high-level workflow is as follows:

You commit the image definition to a CodeCommit repository.
An Amazon EventBridge rule detects the repository commit and initiates the container pipeline.
The source stage of the pipeline pulls the image definition and build instructions from the CodeCommit repository.
The build stage of the pipeline creates the container image and stores the final image in Amazon ECR.
The ContainerVulnerabilityAssessment stage sends out a request for approval by using an Amazon Simple Notification Service (Amazon SNS) topic. A Lambda function associated with the topic stores the details about the container image and the active pipeline, which will be needed in order to send a response back to the pipeline stage.
Amazon Inspector scans the Amazon ECR image for vulnerabilities.
The Lambda function receives the Amazon Inspector scan summary message, through EventBridge, and makes a decision on allowing the image to be deployed. The function retrieves the pipeline approval details so that the approve or reject message is sent to the correct active pipeline stage.
The Lambda function submits an Approved or Rejected status to the deployment pipeline.
CodePipeline deploys the container image to an Amazon ECS cluster and completes the pipeline successfully if an approval is received. The pipeline status is set to Failed if the image is rejected.

Container image build stage

Let’s now review the build stage of the pipeline that is associated with the Amazon Inspector container solution. When a new commit is made to the CodeCommit repository, an EventBridge rule, which is configured to look for updates to the CodeCommit repository, initiates the CodePipeline source action. The source action then collects files from the source repository and makes them available to the rest of the pipeline stages. The pipeline then moves to the build stage.

In the build stage, CodeBuild extracts the Dockerfile that holds the container definition and the buildspec.yaml file that contains the overall build instructions. CodeBuild creates the final container image and then pushes the container image to the designated Amazon ECR repository. As part of the build, the image digest of the container image is stored as a variable in the build stage so that it can be used by later stages in the pipeline. Additionally, the build process writes the name of the container URI, and the name of the Amazon ECS task that the container should be associated with, to a file named imagedefinitions.json. This file is stored as an artifact of the build and will be referenced during the deploy phase of the pipeline.

Now that the image is stored in an Amazon ECR repository, Amazon Inspector scanning begins to check the image for vulnerabilities.

The details of the build stage are shown in Figure 2.

Figure 2: The container build stage

Container image approval stage

After the build stage is completed, the ContainerVulnerabilityAssessment stage begins. This stage is lightweight and consists of one stage action that is focused on waiting for an Approved or Rejected message for the container image that was created in the build stage. The ContainerVulnerabilityAssessment stage is configured to send an approval request message to an SNS topic. As part of the approval request message, the container image digest, from the build stage, will be included in the comments section of the message. The image digest is needed so that approval for the correct container image can be submitted later. Figure 3 shows the comments section of the approval action where the container image digest is referenced.

Figure 3: Container image digest reference in approval action configuration

The SNS topic that the pipeline approval message is sent to is configured to invoke a Lambda function. The purpose of this Lambda function is to pull key details from the SNS message. Details retrieved from the SNS message include the pipeline name and stage, stage approval token, and the container image digest. The pipeline name, stage, and approval token are needed so that an approved or rejected response can be sent to the correct pipeline. The container image digest is the unique identifier for the container image and is needed so that it can be associated with the correct active pipeline. This information is stored in a DynamoDB table so that it can be referenced later when the step that assesses the result of an Amazon Inspector scan submits an approved or rejected decision for the container image. Figure 4 illustrates the flow from the approval stage through storing the pipeline approval data in DynamoDB.

Figure 4: Flow to capture container image approval details

This approval action will remain in a pending status until it receives an Approved or Rejected message or the timeout limit of seven days is reached. The seven-day timeout for approvals is the default for CodePipeline and cannot be changed. If no response is received in seven days, the stage and pipeline will complete with a Failed status.

Amazon Inspector and container scanning

When the container image is pushed to Amazon ECR, Amazon Inspector scans it for vulnerabilities.

In order to show how you can use the findings from an Amazon Inspector container scan in a build and deploy pipeline, let’s first review the workflow that occurs when Amazon Inspector scans a container image located in Amazon ECR.

Figure 5: Image push, scan, and notification workflow

The workflow diagram in Figure 5 outlines the steps that happen after an image is pushed to Amazon ECR all the way to messaging that the image has been successfully scanned and what the final scan results are. The steps in this workflow are as follows:

The final container image is pushed to Amazon ECR by an individual or as part of a build.
Amazon ECR sends a message indicating that a new image has been pushed.
The message about the new image is received by Amazon Inspector.
Amazon Inspector pulls a copy of the container image from Amazon ECR and performs a vulnerability scan.
When Amazon Inspector is done scanning the image, a message summarizing the severity of vulnerabilities that were identified during the container image scan is sent to Amazon EventBridge. You can create EventBridge rules that match the vulnerability summary message to route the message onto a target for notifications or to enable further action to be taken.

Here’s a sample EventBridge pattern that matches the scan summary message from Amazon Inspector.

{
  "detail-type": ["Inspector2 Scan"],
  "source": ["aws.inspector2"]
}

This entire workflow, from ingesting the initial image to sending out the status on the Amazon Inspector scan, is fully managed. You just focus on how you want to use the Amazon Inspector scan status message to govern the approval and deployment of your container image.

The following is a sample of what the Amazon Inspector vulnerability summary message looks like. Note, in bold, the container image Amazon Resource Name (ARN), image repository ARN, message detail type, image digest, and the vulnerability summary.

{
    "version": "0",
    "id": "bf67fc08-f522-f598-6946-8e7b372ba426",
    "detail-type": "Inspector2 Scan",
    "source": "aws.inspector2",
    "account": "<account id>",
    "time": "2022-05-25T16:08:17Z",
    "region": "us-east-2",
    "resources":
    [
        "arn:aws:ecr:us-east-2:<account id>:repository/vuln-images/vulhub/rsync"
    ],
    "detail":
    {
        "scan-status": "INITIAL_SCAN_COMPLETE",
        "repository-name": "arn:aws:ecr:us-east-2:<account id>:repository/vuln-images/vulhub/rsync",
        "finding-severity-counts": { "CRITICAL": 3, "HIGH": 16, "MEDIUM": 4, "TOTAL": 24 },
        "image-digest": "sha256:21ae0e3b7b7xxxx",
        "image-tags":
        [
            "latest"
        ]
    }
}

Processing Amazon Inspector scan results

After Amazon Inspector sends out the scan status event, a Lambda function receives and processes that event. This function needs to consume the Amazon Inspector scan status message and make a decision about whether the image can be deployed.

The eval_container_scan_results Lambda function serves two purposes: The first is to extract the findings from the Amazon Inspector scan message that invoked the Lambda function. The second is to evaluate the findings based on thresholds that are defined as parameters in the Lambda function definition. Based on the threshold evaluation, the container image will be flagged as either Approved or Rejected. Figure 6 shows examples of thresholds that are defined for different Amazon Inspector vulnerability severities, as part of the Lambda function.

Figure 6: Vulnerability thresholds defined in Lambda environment variables

Based on the container vulnerability image results, the Lambda function determines whether the image should be approved or rejected for deployment. The function will retrieve the details about the current pipeline that the image is associated with from the DynamoDB table that was populated by the image approval action in the pipeline. After the details about the pipeline are retrieved, an Approved or Rejected message is sent to the pipeline approval action. If the status is Approved, the pipeline continues to the deploy stage, which will deploy the container image into the defined environment for that pipeline stage. If the status is Rejected, the pipeline status is set to Rejected and the pipeline will end.

Figure 7 highlights the key steps that occur within the Lambda function that evaluates the Amazon Inspector scan status message.

Figure 7: Amazon Inspector scan results decision

Image deployment stage

If the container image is approved, the final image is deployed to an Amazon ECS cluster. The deploy stage of the pipeline is configured with Amazon ECS as the action provider. The deploy action contains the name of the Amazon ECS cluster and stage that the container image should be deployed to. The image definition file (imagedefinitions.json) that was created in the build stage is also listed in the deploy configuration. When the deploy stage runs, it will create a revision to the existing Amazon ECS task definition. This task definition contains the name of the Amazon ECR image that has been approved for deployment. The task definition is then deployed to the Amazon ECS cluster and service.

Deploy the solution

Now that you have an understanding of how the container pipeline solution works, you can deploy the solution to your own AWS account. This section will walk you through the steps to deploy the container approval pipeline, and show you how to verify that each of the key steps is working.

Step 1: Activate Amazon Inspector in your AWS account

The sample solution provided by this blog post requires that you activate Amazon Inspector in your AWS account. If this service is not activated in your account, learn more about the free trial and pricing for this service, and follow the steps in Getting started with Amazon Inspector to set up the service and start monitoring your account.

Step 2: Deploy the AWS CloudFormation template

For this next step, make sure you deploy the template within the AWS account and AWS Region where you want to test this solution.

To deploy the CloudFormation stack

Choose the following Launch Stack button to launch a CloudFormation stack in your account. Use the AWS Management Console navigation bar to choose the region you want to deploy the stack in.
Review the stack name and the parameters for the template. The parameters are pre-populated with the necessary values, and there is no need to change them.
Scroll to the bottom of the Quick create stack screen and select the checkbox next to I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack. The deployment of this CloudFormation stack will take 3–5 minutes.

After the CloudFormation stack has deployed successfully, you can proceed to reviewing and interacting with the deployed solution.

Step 3: Review the container pipeline and supporting resources

The CloudFormation stack is designed to deploy a collection of resources that will be used for an initial container build. When the CodePipeline resource is created, it will automatically pull the assets from the CodeCommit repository and start the pipeline for the container image.

To review the pipeline and resources

In the CodePipeline console, navigate to the Region that the stack was deployed in.
Choose the pipeline named ContainerBuildDeployPipeline to show the full pipeline details.
Review the Source and Build stage, which will show a status of Succeeded.
Review the ContainerVulnerabilityAssessment stage, which will show as failed with a Rejected status in the Manual Approval step.
Figure 8 shows the full completed pipeline.

Figure 8: Rejected container pipeline
Choose the Details link in the Manual Approval stage to reveal the reasons for the rejection. An example review summary is shown in Figure 9.

Figure 9: Container pipeline approval rejection

Review findings in Amazon Inspector (Optional)

You can use the Amazon Inspector console to see the full findings detail for this container image, if needed.

To view the findings in Amazon Inspector

In the Amazon Inspector console, under Findings, choose By repository.
From the list of repositories, choose the inspector-blog-images repository.
Choose the Image tag link to bring up a list of the individual vulnerabilities that were found within the container image. Figure 10 shows an example of the vulnerabilities list in the findings details.

Figure 10: Container image findings in Amazon Inspector

Step 4: Adjust the Amazon ECS desired count for the cluster service

Up to this point, you’ve deployed a pipeline to build and validate the container image, and you’ve seen an example of how the pipeline handles a container image that did not meet the defined vulnerability thresholds. Now you’ll deploy a new container image that will pass a vulnerability assessment and complete the pipeline.

The Amazon ECS service that the CloudFormation template deploys is initially created with the number of desired tasks set to 0. In order to allow the container pipeline to successfully deploy a container, you need to update the desired tasks value.

To adjust the task count in Amazon ECS (console)

In the Amazon ECS console, choose the link for the cluster, in this case InspectorBlogCluster.
On the Services tab, choose the link for the service named InspectorBlogService.
Choose the Update button. On the Configure service page, set Number of tasks to 1.
Choose Skip to review, and then choose Update Service.

To adjust the task count in Amazon ECS (AWS CLI)

Alternatively, you can run the following AWS CLI command to update the desired task count to 1. In order to run this command, you need the ARN of the Amazon ECS cluster, which you can retrieve from the Output tab of the CloudFormation stack that you created. You can run this command from the command line of an environment of your choosing, or by using AWS CloudShell. Make sure to replace <Cluster ARN> with your own value.

$ aws ecs update-service --cluster <Cluster ARN> --service InspectorBlogService --desired-count 1

Step 5: Build and deploy a new container image

Deploying a new container image will involve pushing an updated Dockerfile to the ContainerComponentsRepo repository in CodeCommit. With CodeCommit you can interact by using standard Git commands from a command line prompt, and there are multiple approaches that you can take to connect to the AWS CodeCommit repository from the command line. For this post, in order to simplify the interactions with CodeCommit, you will be shown how to add an updated file directly through the CodeCommit console.

To add an updated Dockerfile to CodeCommit

In the CodeCommit console, choose the repository named ContainerComponentsRepo.
In the screen listing the repository files, choose the Dockerfile file link and choose Edit.
In the Edit a file form, overwrite the existing file contents with the following command:
FROM public.ecr.aws/amazonlinux/amazonlinux:latest
In the Commit changes to main section, fill in the following fields.
1. Author name: your name
2. Email address: your email
3. Commit message: ‘Updated Dockerfile’
Figure 11 shows what the completed form should look like.

Figure 11: Complete CodeCommit entry for an updated Dockerfile
Choose Commit changes to save the new Dockerfile.

This update to the Dockerfile will immediately invoke a new instance of the container pipeline, where the updated container image will be pulled and evaluated by Amazon Inspector.

Step 6: Verify the container image approval and deployment

With a new pipeline initiated through the push of the updated Dockerfile, you can now review the overall pipeline to see that the container image was approved and deployed.

To see the full details in CodePipeline

In the CodePipeline console, choose the container-build-deploy pipeline. You should see the container pipeline in an active status. In about five minutes, you should see the ContainerVulnerabilityAssessment stage move to completed with an Approved status, and the deploy stage should show a Succeeded status.
To confirm that the final image was deployed to the Amazon ECS cluster, from the Deploy stage, choose Details. This will open a new browser tab for the Amazon ECS service.
In the Amazon ECS console, choose the Tasks tab. You should see a task with Last status showing RUNNING. This is confirmation that the image was successfully approved and deployed through the container pipeline. Figure 12 shows where the task definition and status are located.

Figure 12: Task status after deploying the container image
Choose the task definition to bring up the latest task definition revision, which was created by the deploy stage of the container pipeline.
Scroll down in the task definition screen to the Container definitions section. Note that the task is tied to the image you deployed, providing further verification that the approved container image was successfully deployed. Figure 13 shows where the container definition can be found and what you should expect to see.

Figure 13: Container associated with revised task definition

Clean up the solution

When you’re finished deploying and testing the solution, use the following steps to remove the solution stack from your account.

To delete images from the Amazon ECR repository

In the Amazon ECR console, navigate to the AWS account and Region where you deployed the solution.
Choose the link for the repository named inspector-blog-images.
Delete all of the images that are listed in the repository.

To delete objects in the CodePipeline artifact bucket

In the Amazon S3 console in your AWS account, locate the bucket whose name starts with blog-base-setup-codepipelineartifactstorebucket.
Delete the ContainerBuildDeploy folder that is in the bucket.

To delete the CloudFormation stack

In the CloudFormation console, delete the CloudFormation stack that was created to perform the steps in this post.

Conclusion

This post describes a solution that allows you to build your container images, have the images scanned for vulnerabilities by Amazon Inspector, and use the output from Amazon Inspector to determine whether the image should be allowed to be deployed into your environments.

This solution represents a pipeline with very simple build and deploy stages. Your pipeline will vary and may consist of multiple test stages and deployment stages for multiple environments. Additionally, the logic you use to determine whether a container image should be deployed may be different. The contents of this blog post are intended to help serve as a foundation that you can build on as you decide how to use Amazon Inspector for container vulnerability scanning. Feel free to use this guidance, and the example we provided, to extend the solution into your specific deployment pipeline.

If you have questions, contact AWS Support, or start a new thread on the AWS re:Post Amazon Inspector Forum. If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Export historical Security Hub findings to an S3 bucket to enable complex analytics

2022-11-01 Jonathan Nguyen

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/export-historical-security-hub-findings-to-an-s3-bucket-to-enable-complex-analytics/

AWS Security Hub is a cloud security posture management service that you can use to perform security best practice checks, aggregate alerts, and automate remediation. Security Hub has out-of-the-box integrations with many AWS services and over 60 partner products. Security Hub centralizes findings across your AWS accounts and supported AWS Regions into a single delegated administrator account in your aggregation Region of choice, creating a single pane of glass to consolidate and view individual security findings.

Because there are a large number of possible integrations across accounts and Regions, your delegated administrator account in the aggregation Region might have hundreds of thousands of Security Hub findings. To perform complex analytics or machine learning across the existing (historical) findings that are maintained in Security Hub, you can export findings to an Amazon Simple Storage Service (Amazon S3) bucket. To export new findings that have recently been created, you can implement the solution in the aws-security-hub-findings-export GitHub repository. However, Security Hub has data export API rate quotas, which can make exporting a large number of findings challenging.

In this blog post, we provide an example solution to export your historical Security Hub findings to an S3 bucket in your account, even if you have a large number of findings. We walk you through the components of the solution and show you how to use the solution after deployment.

Prerequisites

To deploy the solution, complete the following prerequisites:

Enable Security Hub.
If you want to export Security Hub findings for multiple accounts, designate a Security Hub administrator account.
If you want to export Security Hub findings across multiple Regions, enable cross-Region aggregation.

Solution overview and architecture

In this solution, you use the following AWS services and features:

Security Hub export orchestration
- AWS Step Functions helps you orchestrate automation and long-running jobs, which are integral to this solution. You need the ability to run a workflow for hours due to the Security Hub API rate limits and number of findings and objects.
- AWS Lambda functions handle the logic for exporting and storing findings in an efficient and cost-effective manner. You can customize Lambda functions to most use cases.
Storage of exported findings
- Amazon Simple Storage Service (Amazon S3) helps you share the exported findings and use them in a standardized format for multiple use cases across AWS services.
Job status tracking
- Amazon EventBridge tracks changes in the status of the Step Functions workflow. The solution can run for over 100 hours; by using EventBridge, you don’t have to manually check the status.
- Amazon Simple Notification Service (Amazon SNS) sends you notifications when the long-running jobs are complete or when they might have issues.
- AWS Systems Manager Parameter Store provides a quick way to track overall status by maintaining a numeric count of successfully exported findings that you can compare with the number of findings shown in the Security Hub dashboard.

Figure 1 shows the architecture for the solution, deployed in the Security Hub delegated administrator account in the aggregation Region. The figure shows multiple Security Hub member accounts to illustrate how you can export findings for an entire AWS Organizations organization from a single delegated administrator account.

Figure 1: High-level overview of process and resources deployed in the Security Hub account

As shown in Figure 1, the workflow after deployment is as follows:

The Step Functions workflow for the Security Hub export is invoked.
The Step Functions workflow invokes a single Lambda function that does the following:
1. Retrieves Security Hub findings that have an Active status and puts them in a temporary file.
2. Pushes the file as an object to Amazon S3.
3. Adds the global count of exported findings from the Step Functions workflow to a Systems Manager parameter for validation and tracking purposes.
4. Repeats steps b–c for about 10 minutes to get the most findings while preventing the Lambda function from timing out.
5. If a nextToken is present, pushes the nextToken to the output of the Step Functions.
  
  Note: If the number of items in the output is smaller than the number of items returned by the API call, then the return output includes a nextToken, which can be passed to a subsequent command to retrieve the next set of items.
The Step Functions workflow goes through a Choice state as follows:
- If a Security Hub nextToken is present, Step Functions invokes the Lambda function again.
- If a Security Hub nextToken isn’t present, Step Functions ends the workflow successfully.
An EventBridge rule tracks changes in the status of the Step Functions workflow and sends events to an SNS topic. Subscribers to the SNS topic receive a notification when the status of the Step Functions workflow changes.

Deploy the solution

You can deploy the solution through either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

To deploy the solution (console)

In your delegated administrator Security Hub account, launch the AWS CloudFormation template by choosing the following Launch Stack button. It will take about 10 minutes for the CloudFormation stack to complete.

Note: The stack will launch in the US East (N. Virginia) Region (us-east-1). If you are using cross-Region aggregation, deploy the solution into the Region where Security Hub findings are consolidated. You can download the CloudFormation template for the solution, modify it, and deploy it to your selected Region.

To deploy the solution (AWS CDK)

Download the code from our aws-security-hub-findings-historical-export GitHub repository, where you can also contribute to the sample code. The CDK initializes your environment and uploads the Lambda assets to Amazon S3. Then, you deploy the solution to your account.
While you are authenticated in the security tooling account, run the following commands in your terminal. Make sure to replace <AWS_ACCOUNT> with the account number, and replace <REGION> with the AWS Region where you want to deploy the solution.
cdk bootstrap aws://<AWS_ACCOUNT>/<REGION> cdk deploy SechubHistoricalPullStack

Solution walkthrough and validation

Now that you’ve successfully deployed the solution, you can see each aspect of the automation workflow in action.

Before you start the workflow, you need to subscribe to the SNS topic so that you’re notified of status changes within the Step Functions workflow. For this example, you will use email notification.

To subscribe to the SNS topic

Open the Amazon SNS console.
Go to Topics and choose the Security_Hub_Export_Status topic.
Choose Create subscription.
For Protocol, choose Email.
For Endpoint, enter the email address where you want to receive notifications.
Choose Create subscription.
After you create the subscription, go to your email and confirm the subscription.

You’re now subscribed to the SNS topic, so any time that the Step Functions status changes, you will receive a notification. Let’s walk through how to run the export solution.

To run the export solution

Open the Amazon Step Functions console.
In the left navigation pane, choose State machines.
Choose the new state machine named sec_hub_finding_export.
Choose Start execution.
On the Start execution page, for Name – optional and Input – optional, leave the default values and then choose Start execution.

Figure 2: Example input values for execution of the Step Functions workflow
This will start the Step Functions workflow and redirect you to the Graph view. If successful, you will see that the overall Execution status and each step have a status of Successful.
For long-running jobs, you can view the CloudWatch log group associated with the Lambda function to view the logs.
To track the number of Security Hub findings that have been exported, open the Systems Manager console, choose Parameter Store, and then select the /sechubexport/findingcount parameter. Under Value, you will see the total number of Security Hub findings that have been exported, as shown in Figure 3.

Figure 3: Systems Manager Parameter Store value for the number of Security Hub findings exported

Depending on the number of Security Hub findings, this process can take some time. This is primarily due to the GetFindings quota of 3 requests per second. Each GetFindings request can return a maximum of 100 findings, so this means that you can get up to 300 findings per second. On average, the solution can export about 1 million findings per hour. If you have a large number of findings, you can start the finding export process and wait for the SNS topic to notify you when the process is complete.

How to customize the solution

The solution provides a general framework to help you export your historical Security Hub findings. There are many ways that you can customize this solution based on your needs. The following are some enhancements that you can consider.

Change the Security Hub finding filter

The solution currently pulls all findings with RecordState: ACTIVE, which pulls the active Security Hub findings in the AWS account. You can update the Lambda function code, specifically the finding_filter JSON value within the create_filter function, to pull findings for your use case. For example, to get all active Security Hub findings from the AWS Foundational Security Best Practices standard, update the Lambda function code as follows.

{
                 WorkflowState: [
                     {
                         "Value": "NEW ",
                         "Comparison": "EQUALS"
                     },
                 ],
                 "RecordState": [
                     {
                         "Value": "ACTIVE",
                         "Comparison": "EQUALS"
                     },
                 ]
            }

Export more than 100 million Security Hub findings

The example solution can export about 100 million Security Hub findings. This number is primarily determined by the speed at which findings can be exported, due to the following factors:

If you want to export more than 100 million Security Hub findings, do one of the following:

Use nested Step Functions workflows. For instructions, see Start Workflow Executions from a Task State.
Implement a pattern by using a Lambda function to start a new execution of your state machine to split ongoing work across multiple workflow executions. For more information, see the tutorial Continuing Ongoing Work as a New Execution.

Note: If you implement either of these solutions, make sure that the nextToken also gets passed to the new Step Functions execution by updating the Lambda function code to parse and pass the nextToken received in the last request.

Speed up the export

One way to increase the export bandwidth, and reduce the overall execution time, is to run the export job in parallel across the individual Security Hub member accounts rather than from the single delegated administrator account.

You could use CloudFormation StackSets to deploy this solution in each Security Hub member account and send the findings to a centralized S3 bucket. You would need to modify the solution to allow an S3 bucket to be provided as an input, and all the Lambda function Identity and Access Management (IAM) roles would need cross-account access to the S3 bucket and corresponding AWS Key Management Service (AWS KMS) key. You would also need to make updates in each member account to iterate through the various Regions in which the Security Hub findings exist.

Next steps

The solution in this post is designed to assist in the retrieval and export of all existing findings currently in Security Hub. After you successfully run this solution to export historical findings, you can continuously export new Security Hub findings by using the sample solution in the aws-security-hub-findings-export GitHub repository.

Now that you’ve exported the Security Hub findings, you can set up and run custom complex reporting or queries against the S3 bucket by using Amazon Athena and AWS Glue. Additionally, you can run machine learning and analytics capabilities by using services like Amazon SageMaker or Amazon Lookout for Metrics.

Conclusion

In this post, you deployed a solution to export the existing Security Hub findings in your account to a central S3 bucket, so that you can apply complex analytics and machine learning to those findings. We walked you through how to use the solution and apply it to some example use cases after you successfully exported existing findings across your AWS environment. Now your security team can use the data in the S3 bucket for predictive analytics and determine if there are Security Hub findings and specific resources that might need to be prioritized for review due to a deviation from normal behavior. Additionally, you can use this solution to enable more complex analytics on multiple fields by querying large and complex datasets with AWS Athena.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a thread on AWS Security Hub re:Post.

Want more AWS Security news? Follow us on Twitter.

Retain more for less with tiered storage for Amazon MSK

2022-10-28 Masudur Rahaman Sayem

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/retain-more-for-less-with-tiered-storage-for-amazon-msk/

Organizations are adopting Apache Kafka and Amazon Managed Streaming for Apache Kafka (Amazon MSK) to capture and analyze data in real-time. Amazon MSK allows you to build and run production applications on Apache Kafka without needing Kafka infrastructure management expertise or having to deal with the complex overheads associated with running Apache Kafka on your own. With increasing maturity, customers seek to build sophisticated use cases that combine aspects of real time and batch processing. For instance, you may want to train machine learning (ML) models based on historic data and then use these models to do real time inferencing. Or you may want to be able to recompute previous results when the application logic changed, e.g., when a new KPI is added to a streaming analytics application or when a bug was fixed that caused incorrect output. These use cases often require storing data for several weeks, months, or even years.

Apache Kafka is well positioned to support these kind of use cases. Data is retained in the Kafka cluster as long as required by configuring the retention policy. In this way, the most recent data can be processed in real time for low-latency use cases while historic data remains accessible in the cluster and can be processed in a batch fashion.

However, retaining data in a Kafka cluster can become expensive because storage and compute are tightly coupled in a cluster. To scale storage, you need to add more brokers. But adding more brokers with the sole purpose of increasing the storage squanders the rest of the compute resources like CPU and memory. Also, a large cluster with more nodes adds operational complexity with a longer time to recover and rebalance when a broker fails. To avoid that operational complexity and higher cost, you can move your data to Amazon Simple Storage Service (Amazon S3) for long-term access and with cost-effective storage classes in Amazon S3 you can optimize your overall storage cost. This solves cost challenges, but now you have to build and maintain that part of the architecture for data movement to a different data store. You also need to build different data processing logic using different APIs for consuming data (Kafka API for streaming, Amazon S3 API for historic reads).

Today, we’re announcing Amazon MSK tiered storage, which brings a virtually unlimited and low-cost storage tier for Amazon MSK, making it simpler and cost-effective for developers to build streaming data applications. Since the launch of Amazon MSK in 2019, we have enabled capabilities such as vertical scaling and automatic scaling of broker storage so you can operate your Kafka workloads in a cost-effective way. Earlier this year, we launched provisioned throughput which enables seamlessly scaling I/O without having to provision additional brokers. Tiered storage makes it even more cost-effective for you to run Kafka workloads. You can now store data in Apache Kafka without worrying about limits. You can effectively balance your performance and costs by using the performance-optimized primary storage for real-time data and the new low-cost tier for the historical data. With a few clicks, you can move streaming data into a lower-cost tier to store data and only pay for what you use.

Tiered storage frees you from making hard trade-offs between supporting the data retention needs of your application teams and the operational complexity that comes with it. This enables you to use the same code to process both real-time and historical data to minimize redundant workflows and simplify architectures. With Amazon MSK tiered storage, you can implement a Kappa architecture – a streaming-first software architecture deployment pattern – to use the same data processing pipeline for correctness and completeness of data over a much longer time horizon for business analysis.

How Amazon MSK tiered storage works

Let’s look at how tiered storage works for Amazon MSK. Apache Kafka stores data in files called log segments. As each segment completes, based on the segment size configured at cluster or topic level, it’s copied to the low-cost storage tier. Data is held in performance-optimized storage for a specified retention time, or up to a specified size, and then deleted. There is a separate time and size limit setting for the low-cost storage, which must be longer than the performance-optimized storage tier. If clients request data from segments stored in the low-cost tier, the broker reads the data from it and serves the data in the same way as if it were being served from the performance-optimized storage. The APIs and existing clients work with minimal changes. When your application starts reading data from the low-cost tier, you can expect an increase in read latency for the first few bytes. As you start reading the remaining data sequentially from the low-cost tier, you can expect latencies that are similar to the primary storage tier. With tiered storage, you pay for the amount of data you store and the amount of data you retrieve.

For a pricing example, let’s consider a workload where your ingestion rate is 15 MB/s, with a replication factor of 3, and you want to retain data in your Kafka cluster for 7 days. For such a workload, it requires 6x m5.large brokers, with 32.4 TB EBS storage, which costs $4,755. But if you use tiered storage for the same workload with local retention of 4 hours and overall data retention of 7 days, it requires 3x m5.large brokers, with 0.8 TB EBS storage and 9 TB of tiered storage, which costs $1,584. If you want to read all the historic data at once, it costs $13 ($0.0015 per GB retrieval cost). In this example with tiered storage, you save around 66% of your overall cost.

Get started using Amazon MSK tiered storage

To enable tiered storage on your existing cluster, upgrade your MSK cluster to Kafka version 2.8.2.tiered and then choose Tiered storage and EBS storage as your cluster storage mode on the Amazon MSK console.

After tiered storage is enabled on the cluster level, run the following command to enable tiered storage on an existing topic. In this example, you’re enabling tiered storage on a topic called msk-ts-topic with 7 days’ retention (local.retention.ms=604800000) for a local high-performance storage tier, setting 180 days’ retention (retention.ms=15550000000) to retain the data in the low-cost storage tier, and updating the log segment size to 48 MB:

bin/kafka-configs.sh --bootstrap-server $bsrv --alter --entity-type topics --entity-name msk-ts-topic --add-config 'remote.storage.enable=true, local.retention.ms=604800000, retention.ms=15550000000, segment.bytes=50331648'

Availability and pricing

Amazon MSK tiered storage is available in all AWS regions where Amazon MSK is available excluding the AWS China, AWS GovCloud regions. This low-cost storage tier scales to virtually unlimited storage and requires no upfront provisioning. You pay only for the volume of data retained and retrieved in the low-cost tier.

For more information about this feature and its pricing, see the Amazon MSK developer guide and Amazon MSK pricing page. For finding the right sizing for your cluster, see the best practices page.

Summary

With Amazon MSK tiered storage you don’t need to provision storage for the low-cost tier or manage the infrastructure. Tiered storage enables you to scale to virtually unlimited storage. You can access data in the low-cost tier using the same clients you currently use to read data from the high-performance primary storage tier. Apache Kafka’s consumer API, streams API, and connectors consume data from both tiers without changes. You can modify the retention limits on the low-cost storage tier similarly as to how you can modify the retention limits on the high-performance storage.

Enable tiered storage on your MSK clusters today to retain data longer at a lower cost.

About the Author

Masudur Rahaman Sayem is a Streaming Architect at AWS. He works with AWS customers globally to design and build data streaming architecture to solve real-world business problems. He is passionate about distributed systems. He also likes to read, especially classic comic books.

Measure the adoption of your Amazon QuickSight dashboards and view your BI portfolio in a single pane of glass

2022-10-28 Maitri Brahmbhatt

Post Syndicated from Maitri Brahmbhatt original https://aws.amazon.com/blogs/big-data/measure-the-adoption-of-your-amazon-quicksight-dashboards-and-view-your-bi-portfolio-in-a-single-pane-of-glass/

Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service. If you plan to deploy enterprise-grade QuickSight dashboards, measuring user adoption and usage patterns is an important ingredient for the success of your BI investment. For example, knowing the usage patterns like geo location, department, and job role can help you fine-tune your dashboards to the right audience. Furthermore, to return the investment of your BI portfolio, with dashboard usage, you can reduce license costs by identifying inactive QuickSight authors.

In this post, we introduce the latest Admin Console, an AWS packaged solution that you can easily deploy and use to create a usage and inventory dashboard for your QuickSight assets. The Admin Console helps identify usage patterns of an individual user and dashboards. It can also help you track which dashboards and groups you have or need access to, and what you can do with that access, by providing more details on QuickSight group and user permissions and activities and QuickSight asset (dashboards, analyses, and datasets) permissions. With timely access to interactive usage metrics, the Admin Console can help BI leaders and administrators make a cost-efficient plan for dashboard improvements. Another common use case of this dashboard is to provide a centralized repository of the QuickSight assets. QuickSight artifacts consists of multiple types of assets (dashboards, analyses, datasets, and more) with dependencies between them. Having a single repository to view all assets and their dependencies can be an important element in your enterprise data dictionary.

This post demonstrates how to build the Admin Console using a serverless data pipeline. With basic AWS knowledge, you can create this solution in your own environment within an hour. Alternatively, you can dive deep into the source code to meet your specific needs.

Admin Console dashboard

The following animation displays the contents of our demo dashboard.

The Admin Console dashboard includes six sheets:

Landing Page – Provides drill-down into each detailed tabs.
User Analysis – Provides detailed analysis of the user behavior and identifies active and inactive users and authors.
Dashboard Analysis – Shows the most commonly viewed dashboards.
Assets Access Permissions – Provides information on permissions applied to each asset, such as dashboard, analysis, datasets, data source, and themes.
Data Dictionary – Provides information on the relationships between each of your assets, such as which analysis was used to build each dashboard, and which datasets and data sources are being used in each analysis. It also provides details on each dataset, including schema name, table name, columns, and more.
Overview – Provides instructions on how to use the dashboard.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

Let’s look at Forwood Safety, an innovative, values-driven company with a laser focus on fatality prevention. An early adopter of QuickSight, they collaborated with AWS to deploy this solution to collect BI application usage insights.

“Our engineers love this admin console solution,” says Faye Crompton, Leader of Analytics and Benchmarking at Forwood. “It helps us to understand how users analyze critical control learnings by helping us to quickly identify the most frequently visited dashboards in Forwood’s self-service analytics and reporting tool, FAST.”

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

The AWS Lambda function Data_Prepare is scheduled to run hourly. This function calls QuickSight APIs to get the QuickSight namespace, group, user, and asset access permissions information.
The Lambda function Dataset_Info is scheduled to run hourly. This function calls QuickSight APIs to get dashboard, analysis, dataset, and data source information.
Both the functions save the results to an Amazon Simple Storage Service (Amazon S3) bucket.
AWS CloudTrail logs are stored in an S3 bucket.
Based on the file in Amazon S3 that contains user-group information, dataset information, QuickSight assets access permissions information, as well as dashboard views and user login events from the CloudTrail logs, five Amazon Athena tables are created. Optionally, the BI engineer can combine these tables with employee information tables to display human resource information of the users.
Four QuickSight datasets fetch the data from the Athena tables created in Step 5 and import them into SPICE. Then, based on these datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
Access to the following AWS services:
- AWS Athena
- AWS CloudFormation
- AWS Lambda
- AWS QuickSight
- Amazon Simple Storage Service (Amazon S3)
Basic knowledge of Python
Basic knowledge of SQL

Create solution resources

We can create all the resources needed for this dashboard using three CloudFormation templates: one for Lambda functions, one for Athena tables, and one for QuickSight objects.

CloudFormation template for Lambda functions

This template creates the Lambda functions data_prepare and dataset_info.

Choose Launch Stack and follow the steps to create these resources.

After the stack creation is successful, you have two Lambda functions, data_prepare and dataset_info, and one S3 bucket named admin-console[AWS-account-ID]. You can verify if the Lambda function can run successfully and if the group_membership, object_access, datasets_info, and data_dictionary folders are created in the S3 bucket under admin-console[AWS-account-ID]/monitoring/quicksight/, as shown in the following screenshots.

The Data_Prepare Lambda function is scheduled to run hourly with the CloudWatch Events rule admin-console-every-hour. This function calls the QuickSight Assets APIs to get QuickSight users, assets, and the access permissions information. Finally, this function creates two files, group_membership.csv and object_access.csv, and saves these files to an S3 bucket.

The Dataset_Info Lambda function is scheduled to run hourly and calls the QuickSight Assets APIs to get datasets, schemas, tables, and fields (columns) information. Then this function creates two files, datasets_info.csv and data_dictionary.csv, and saves these files to an S3 bucket.

Create a CloudTrail log if you don’t already have one and note down the S3 bucket name of the log files for future use.
Note down all the resources created from the previous steps. If the S3 bucket name for the CloudTrail log from step 2 is different from the one in step 1’s output, use the S3 bucket from step 2.

The following table summarizes the keys and values you use when creating the Athena tables with the next CloudFormation stack.

Key	Value	Description
cloudtraillog	`s3://cloudtrail-awslogs-[aws-account-id]-do-not-delete/AWSLogs/[aws-account-id]/CloudTrail`	The Amazon S3 location of the CloudTrail log
cloudtraillogtablename	`cloudtrail_logs`	The table name of CloudTrail log
groupmembership	`s3://admin-console[aws-account-id]/monitoring/quicksight/group_membership`	The Amazon S3 location of `group_membership.csv`
objectaccess	`s3://admin-console[aws-account-id]/monitoring/quicksight/object_access`	The Amazon S3 location of `object_access.csv`
dataset info	`s3://admin-console[aws-account-id]/monitoring/quicksight/datsets_info`	The Amazon S3 location of `datsets_info.csv`
datadict	`s3://admin-console[aws-account-id]/monitoring/quicksight/data_dictionary`	The Amazon S3 location of `data_dictionary.csv`

CloudFormation template for Athena tables

To create your Athena tables, complete the following steps:

Download the following JSON file.
Edit the file and replace the corresponding fields with the keys and values you noted in the previous section.

For example, search for the groupmembership keyword.

Then replace the location value with the Amazon S3 location for the groupmembership folder.

Create Athena tables by deploying this edited file as a CloudFormation template. For instructions, refer to Get started.

After a successful deployment, you have a database called admin-console created in AwsDataCatalog in Athena and three tables in the database: cloudtrail_logs, group_membership, object_access, datasets_info and data_dict

Confirm the tables via the Athena console.

The following screenshot shows sample data of the group_membership table.

The following screenshot shows sample data of the object_access table.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

After all five tables are created in Athena, go to the security permissions on the QuickSight console to enable bucket access for s3://admin-console[AWS-account-ID] and s3://cloudtrail-awslogs-[aws-account-id]-do-not-delete.
Enable Athena access under Security & Permissions.

Now QuickSight can access all five tables through Athena.

CloudFormation template for QuickSight objects

To create the QuickSight objects, complete the following steps:

Get the QuickSight admin user’s ARN by running following command in the AWS Command Line Interface (AWS CLI):
```
aws quicksight describe-user --aws-account-id [aws-account-id] --namespace default --user-name [admin-user-name]
```
For example: arn:aws:quicksight:us-east-1:12345678910:user/default/admin/xyz.

Choose Launch Stack to create the QuickSight datasets and dashboard:

Provide the ARN you noted earlier.

After a successful deployment, four datasets named Admin-Console-Group-Membership, Admin-Console-dataset-info, Admin-Console-Object-Access, and Admin-Console-CFN-Main are created and you have the dashboard named admin-console-dashboard. If modifying the dashboard is preferred, use the dashboard save-as option, then recreate the analysis, make modifications, and publish a new dashboard.

Set your preferred SPICE refresh schedule for the four SPICE datasets, and share the dashboard in your organization as needed.

Dashboard demo

The following screenshot shows the Admin Console Landing page.

The following screenshot shows the User Analysis sheet.

The following screenshot shows the Dashboards Analysis sheet.

The following screenshot shows the Access Permissions sheet.

The following screenshot shows the Data Dictionary sheet.

The following screenshot shows the Overview sheet.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

You can reference the public template of the preceding dashboard in create-template, create-analysis, and create-dashboard API calls to create this dashboard and analysis in your account. The public template of this dashboard with the template ARN is 'TemplateArn': 'arn:aws:quicksight:us-east-1:889399602426:template/admin-console'.

Tips and tricks

Here are some advanced tips and tricks to build the dashboard as the Admin Console to analyze usage metrics. The following steps are based on the dataset admin_console. You can apply the same logic to create the calculated fields to analyze user login activities.

Create parameters – For example, we can create a parameter called InActivityMonths, as in the following screenshot. Similarly, we can create other parameters such as InActivityDays, Start Date, and End Date.

Create controls based on the parameters – In the following screenshot, we create controls based on the start and end date.

Create calculated fields – For instance, we can create a calculated field to detect the active or inactive status of QuickSight authors. If the time span between the latest view dashboard activity and now is larger or equal to the number defined in the Inactivity Months control, the author status is Inactive. The following screenshot shows the relevant code. According to the end-user’s requirements, we can define several calculated fields to perform the analysis.

Create visuals – For example, we create an insight to display the top three dashboard views by reader and a visual to display the authors of these dashboards.

Add URL actions – You can add an URL action to define some extra features to email inactive authors or check details of users.

The following sample code defines the action to email inactive authors:

mailto:<<email>>?subject=Alert to inactive author! &body=Hi, <<username>>, any author without activity for more than a month will be deleted. Please log in to your QuickSight account to continue accessing and building analyses and dashboards!

Clean up

To avoid incurring future charges, delete all the resources you created with the CloudFormation templates.

Conclusion

This post discussed how BI administrators can use QuickSight, CloudTrail, and other AWS services to create a centralized view to analyze QuickSight usage metrics. We also presented a serverless data pipeline to support the Admin Console dashboard.

If you would like to have a demo, please email us.

Appendix

We can perform some additional sophisticated analysis to collect advanced usage metrics. For example, Forwood Safety raised a unique request to analyze the readers who log in but don’t view any dashboard actions (see the following code). This helps their clients identify and prevent any wasting of reader sessions fees. Leadership teams value the ability to minimize uneconomical user activity.

CREATE OR REPLACE VIEW "loginwithoutviewdashboard" AS
with login as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventname = 'AssumeRoleWithSAML'
GROUP BY  1,2,3),
dashboard as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventsource = 'quicksight.amazonaws.com'
AND
eventname = 'GetDashboard'
GROUP BY  1,2,3),
users as 
(select Namespace,
Group,
User,
(case
when Group in (‘quicksight-fed-bi-developer’, ‘quicksight-fed-bi-admin’)
then ‘Author’
else ‘Reader’
end)
as author_status
from "group_membership" )
select l.* 
from login as l 
join dashboard as d 
join users as u 
on l.user_name=d.user_name 
and 
l.awsregion=d.awsregion 
and 
l.user_name=u.user_name
where d.event_time>(l.event_time + interval '30' minute ) 
and 
d.event_time<l.event_time 
and 
u.author_status='Reader'

About the Authors

Ying Wang is a Manager of Software Development Engineer. She has 12 years of expertise in data analytics and science. She assisted customers with enterprise data architecture solutions to scale their data analytics in the cloud during her time as a data architect. Currently, she helps customer to unlock the power of Data with QuickSight from engineering by delivering new features.

Ian Liao is a Senior Data Visualization Architect at AWS Professional Services. Before AWS, Ian spent years building startups in data and analytics. Now he enjoys helping customer to scale their data application on the cloud.

Maitri Brahmbhatt is a Business Intelligence Engineer at AWS. She helps customers and partners leverage their data to gain insights into their business and make data driven decisions by developing QuickSight dashboards.

Deploy DataHub using AWS managed services and ingest metadata from AWS Glue and Amazon Redshift – Part 2

2022-10-25 Corvus Lee

Post Syndicated from Corvus Lee original https://aws.amazon.com/blogs/big-data/part-2-deploy-datahub-using-aws-managed-services-and-ingest-metadata-from-aws-glue-and-amazon-redshift/

In the first post of this series, we discussed the need of a metadata management solution for organizations. We used DataHub as an open-source metadata platform for metadata management and deployed it using AWS managed services with the AWS Cloud Development Kit (AWS CDK).

In this post, we focus on how to populate technical metadata from the AWS Glue Data Catalog and Amazon Redshift into DataHub, and how to augment data with a business glossary and visualize data lineage of AWS Glue jobs.

Overview of solution

The following diagram illustrates the solution architecture and its key components:

DataHub runs on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, using Amazon OpenSearch Service, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon RDS for MySQL as the storage layer for the underlying data model and indexes.
The solution pulls technical metadata from AWS Glue and Amazon Redshift to DataHub.
We enrich the technical metadata with a business glossary.
Finally, we run an AWS Glue job to transform the data and observe the data lineage in DataHub.

In the following sections, we demonstrate how to ingest the metadata using various methods, enrich the dataset, and capture the data lineage.

Pull technical metadata from AWS Glue and Amazon Redshift

In this step, we look at three different approaches to ingest metadata into DataHub for search and discovery.

DataHub supports both push-based and pull-based metadata ingestion. Push-based integrations (for example, Spark) allow you to emit metadata directly from your data systems when metadata changes, whereas pull-based integrations allow you to extract metadata from the data systems in a batch or incremental-batch manner. In this section, you pull technical metadata from the AWS Glue Data Catalog and Amazon Redshift using the DataHub web interface, Python, and the DataHub CLI.

Ingest data using the DataHub web interface

In this section, you use the DataHub web interface to ingest technical metadata. This method supports both the AWS Glue Data Catalog and Amazon Redshift, but we focus on Amazon Redshift here as a demonstration.

As a prerequisite, you need an Amazon Redshift cluster with sample data, accessible from the EKS cluster hosting DataHub (default TCP port 5439).

Create an access token

Complete the following steps to create an access token:

Go to the DataHub web interface and choose Settings.
Choose Generate new token.
Enter a name (GMS_TOKEN), optional description, and expiry date and time.
Copy the value of the token to a safe place.

Create an ingestion source

Next, we configure Amazon Redshift as our ingestion source.

On the DataHub web interface, choose Ingestion.
Choose Generate new source.
Choose Amazon Redshift.
In the Configure Recipe step, enter the values of host_port and database of your Amazon Redshift cluster and keep the rest unchanged:

# Coordinates
host_port:example.something.<region>.redshift.amazonaws.com:5439
database: dev

The values for ${REDSHIFT_USERNAME}, ${REDSHIFT_PASSWORD}, and ${GMS_TOKEN} reference secrets that you set up in the next step.

Choose Next.
For the run schedule, enter your desired cron syntax or choose Skip.
Enter a name for the data source (for example, Amazon Redshift demo) and choose Done.

Create secrets for the data source recipe

To create your secrets, complete the following steps:

On the DataHub Manage Ingestion page, choose Secrets.
Choose Create new secret.
For Name¸ enter REDSHIFT_USERNAME.
For Value¸ enter awsuser (default admin user).
For Description, enter an optional description.
Repeat these steps for REDSHIFT_PASSWORD and GMS_TOKEN.

Run metadata ingestion

To ingest the metadata, complete the following steps:

On the DataHub Manage Ingestion page, choose Sources.
Choose Execute next to the Amazon Redshift source you just created.
Choose Execute again to confirm.
Expand the source and wait for the ingestion to complete, or check the error details (if any).

Tables in the Amazon Redshift cluster are now populated in DataHub. You can view these by navigating to Datasets > prod > redshift > dev > public > users.

You’ll further work on enriching this table metadata using the DataHub CLI in a later step.

Ingest data using Python code

In this section, you use Python code to ingest technical metadata to the DataHub CLI, using the AWS Glue Data Catalog as an example data source.

As a prerequisite, you need a sample database and table in the Data Catalog. You also need an AWS Identity and Access Management (IAM) user with the required IAM permissions:

{
    "Effect": "Allow",
    "Action": [
        "glue:GetDatabases",
        "glue:GetTables"
    ],
    "Resource": [
        "arn:aws:glue:$region-id:$account-id:catalog",
        "arn:aws:glue:$region-id:$account-id:database/*",
        "arn:aws:glue:$region-id:$account-id:table/*"
    ]
}

Note the GMS_ENDPOINT value for DataHub by running kubectl get svc, and locate the load balancer URL and port number (8080) for the service datahub-datahub-gms.

Install the DataHub client

To install the DataHub client with AWS Cloud9, complete the following steps:

Open the AWS Cloud9 IDE and start the terminal.
Create a new virtual environment and install the DataHub client:

# Install the virtualenv
python3 -m venv datahub
# Activate the virtualenv
Source datahub/bin/activate
# Install/upgrade datahub client
pip3 install --upgrade acryl-datahub

Check the installation:

datahub version

If DataHub is successfully installed, you see the following output:

DataHub CLI version: 0.8.44.4
Python version: 3.X.XX (default,XXXXX)

Install the DataHub plugin for AWS Glue:

pip3 install --upgrade 'acryl-datahub[glue]'

Prepare and run the ingestion Python script

Complete the following steps to ingest the data:

Download glue_ingestion.py from the GitHub repository.
Edit the values of both the source and sink objects:

from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create(
    {
        "source": {
            "type": "glue",
            "config": {
                "aws_access_key_id": "<aws_access_key>",
                "aws_secret_access_key": "<aws_secret_key>",
                "aws_region": "<aws_region>",
                "emit_s3_lineage" : False,
            },
        },
        "sink": {
            "type": "datahub-rest",
            "config": {
                "server": "http://<your_gms_endpoint.region.elb.amazonaws.com:8080>",
                 "token": "<your_gms_token_string>"
                },
        },
    }
)

# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()

For production purposes, use the IAM role and store other parameters and credentials in AWS Systems Manager Parameter Store or AWS Secrets Manager.

To view all configuration options, refer to Config Details.

Run the script within the DataHub virtual environment:

python3 glue_ingestion.py

If you navigate back to the DataHub web interface, the databases and tables in your AWS Glue Data Catalog should appear under Datasets > prod > glue.

Ingest data using the DataHub CLI

In this section, you use the DataHub CLI to ingest a sample business glossary about data classification, personal information, and more.

As a prerequisite, you must have the DataHub CLI installed in the AWS Cloud9 IDE. If not, go through the steps in the previous section.

Prepare and ingest the business glossary

Complete the following steps:

Open the AWS Cloud9 IDE.
Download business_glossary.yml from the GitHub repository.
Optionally, you can explore the file and add custom definitions (refer to Business Glossary for more information).
Download business_glossary_to_datahub.yml from the GitHub repository.
Edit the full path to the business glossary definition file, GMS endpoint, and GMS token:

source:
  type: datahub-business-glossary
  config:
    file: /home/ec2-user/environment/business_glossary.yml    

sink:
  type: datahub-rest 
  config:
    server: 'http://<your_gms_endpoint.region.elb.amazonaws.com:8080>'
    token:  '<your_gms_token_string>'

Run the following code:

datahub ingest -c business_glossary_to_datahub.yml

Navigate back to the DataHub interface, and choose Govern, then Glossary.

You should now see the new business glossary to use in the next section.

Enrich the dataset with more metadata

In this section, we enrich a dataset with additional context, including description, tags, and a business glossary, to help data discovery.

As a prerequisite, follow the earlier steps to ingest the metadata of the sample database from Amazon Redshift, and ingest the business glossary from a YAML file.

In the DataHub web interface, browse to Datasets > prod > redshift > dev > public > users.
Starting at the table level, we add related documentation and a link to the About section.

This allows analysts to understand the table relationships at a glance, as shown in the following screenshot.

To further enhance the context, add the following:
- Column description.
- Tags for the table and columns to aid search and discovery.
- Business glossary terms to organize data assets using a shared vocabulary. For example, we define userid in the USERS table as an account in business terms.
- Owners.
- A domain to group data assets into logical collections. This is useful when designing a data mesh on AWS.

Now we can search using the additional context. For example, searching for the term email with the tag tickit correctly returns the USERS table.

We can also search using tags, such as tags:"PII" OR fieldTags:"PII" OR editedFieldTags:"PII".

In the following example, we search using the field description fieldDescriptions:The user's home state, such as GA.

Feel free to explore the search features in DataHub to enhance the data discovery experience.

Capture data lineage

In this section, we create an AWS Glue job to capture the data lineage. This requires use of a datahub-spark-lineage JAR file as an additional dependency.

Download the NYC yellow taxi trip records for 2022 January (in parquet file format) and save it under s3://<<Your S3 Bucket>>/tripdata/.
Create an AWS Glue crawler pointing to s3://<<Your S3 Bucket>>/tripdata/ and create a landing table called landing_nyx_taxi inside the database nyx_taxi.
Download the datahub-spark-lineage JAR file (v0.8.41-3-rc3) and store it in s3://<<Your S3 Bucket>>/externalJar/.
Download the log4j.properties file and store it in s3://<<Your S3 Bucket>>/externalJar/.
Create a target table using the following SQL script.

The AWS Glue job reads the data in parquet file format using the landing table, performs some basic data transformation, and writes it to target table in parquet format.

Create an AWS Glue Job using the following script and modify your GMS_ENDPOINT, GMS_TOKEN, and source and target database table name.
On the Job details tab, provide the IAM role and disable job bookmarks.

Add the path of datahub-spark-lineage (s3://<<Your S3 Bucket>>/externalJar/datahub-spark-lineage-0.8.41-3-rc3.jar) for Dependent JAR path.
Enter the path of log4j.properties for Referenced files path.

The job reads the data from the landing table as a Spark DataFrame and then inserts the data into the target table. The JAR is a lightweight Java agent that listens for Spark application job events and pushes metadata out to DataHub in real time. The lineage of datasets that are read and written is captured. Events such as application start and end, and SQLExecution start and end are captured. This information can be seen under pipelines (DataJob) and tasks (DataFlow) in DataHub.

Run the AWS Glue job.

When the job is complete, you can see the lineage information is being populated in the DataHub UI.

The preceding lineage shows the data is being read from a table backed by an Amazon Simple Storage Service (Amazon S3) location and written to an AWS Glue Data Catalog table. The Spark run details like query run ID are captured, which can be mapped back to the Spark UI using the Spark application name and Spark application ID.

Clean up

To avoid incurring future charges, complete the following steps to delete the resources:

Run helm uninstall datahub and helm uninstall prerequisites.
Run cdk destroy --all.
Delete the AWS Cloud9 environment.

Conclusion

In this post, we demonstrated how to search and discover data assets stored in your data lake (via the AWS Glue Data Catalog) and data warehouse in Amazon Redshift. You can augment data assets with a business glossary, and visualize the data lineage of AWS Glue jobs.

About the Authors

Debadatta Mohapatra is an AWS Data Lab Architect. He has extensive experience across big data, data science, and IoT, across consulting and industrials. He is an advocate of cloud-native data platforms and the value they can drive for customers across industries.

Corvus Lee is a Solutions Architect for AWS Data Lab. He enjoys all kinds of data-related discussions, and helps customers build MVPs using AWS databases, analytics, and machine learning services.

Suraj Bang is a Sr Solutions Architect at AWS. Suraj helps AWS customers in this role on their Analytics, Database and Machine Learning use cases, architects a solution to solve their business problems and helps them build a scalable prototype.

Deploy DataHub using AWS managed services and ingest metadata from AWS Glue and Amazon Redshift – Part 1

2022-10-25 Debadatta Mohapatra

Post Syndicated from Debadatta Mohapatra original https://aws.amazon.com/blogs/big-data/part-1-deploy-datahub-using-aws-managed-services-and-ingest-metadata-from-aws-glue-and-amazon-redshift/

Many organizations are establishing enterprise data warehouses, data lakes, or a modern data architecture on AWS to build data-driven products. As the organization grows, the number of publishers and subscribers to data and the volume of data keeps increasing. Additionally, different varieties of datasets are introduced (structured, semistructured, and unstructured). This can lead to metadata management issues, and the following questions:

“Can I trust this data?”
“Where does this data (lineage) come from?”
“How accurate is this data?”
“What does this column mean in my business terminology?”
“Who is the owner of this data?”
“When was the data last refreshed?”
“How can I classify the data (PII, non-PII, and so on) and build data governance?”

Metadata conveys both technical and business context to help you understand your data better and use it appropriately. It provides two primary types of information about data assets:

Technical metadata – Information about the structure of the data, such as schema and how the data is populated
Business metadata – Information in business terms, such as table and column description, owner, and data profile

Metadata management becomes a key element to allow users (data analysts, data scientists, data engineers, and data owners) to discover and locate the right data assets to address business requirements and perform data governance. Some common features of metadata management are:

Search and discovery – Data schemas, fields, tags, usage information
Access control – Access control, groups, users, policies
Data lineage – Pipeline runs, queries, transformation logic
Compliance – Taxonomy of data privacy, compliance annotation types
Classification – Classify different datasets and data elements
Data quality – Data quality rule definitions, run results, data profiles

These features can help organizations build standard metadata management processes, which can help remove redundancy and inconsistency in data assets, and allow users to collaborate and build richer data products quickly.

In this two-part series, we discuss how to deploy DataHub on AWS using managed services with the AWS Cloud Development Kit (AWS CDK), populate technical metadata from the AWS Glue Data Catalog and Amazon Redshift into DataHub, and augment data with a business glossary and visualize data lineage of AWS Glue jobs.

In this post, we focus on the first step: deploying DataHub on AWS using managed services with the AWS CDK. This will allow organizations to launch DataHub using AWS managed services and begin the journey of metadata management.

Why DataHub?

DataHub is one of the most popular open-source metadata management platforms. It enables end-to-end discovery, data observability, and data governance. It has a rich set of features, including metadata ingestion (automated or programmatic), search and discovery, data lineage, data governance, and many more. It provides an extensible framework and supports federated data governance.

DataHub offers out-of-the-box support to ingest metadata from different sources like Amazon Redshift, the AWS Glue Data Catalog, Snowflake, and many more.

Overview of solution

The following diagram illustrates the solution architecture and its components:

DataHub runs on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, using Amazon OpenSearch Service, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon RDS for MySQL as the storage layer for the underlying data model and indexes.
The solution pulls technical metadata from AWS Glue and Amazon Redshift to DataHub.
We enrich the technical metadata with a business glossary.
Finally, we run an AWS Glue job to transform the data and observe the data lineage in DataHub.

In the following sections, we demonstrate how to deploy DataHub and provision different AWS managed services.

Prerequisites

We need kubectl, Helm, and the AWS Command Line Interface (AWS CLI) to set up DataHub in an AWS environment. We can complete all the steps either from a local desktop or using AWS Cloud9. If you’re using AWS Cloud9, follow the instructions in the next section to spin up an AWS Cloud9 environment, otherwise skip to the next step.

Set up AWS Cloud9

To get started, you need an AWS account, preferably free from any production workloads. AWS Cloud9 is a cloud-based IDE that lets you write, run, and debug your code with just a browser. AWS Cloud9 comes preconfigured with many of the dependencies we require for this post, such as git, npm, and the AWS CDK.

Create an AWS Cloud9 environment from the AWS Management Console with an instance type of t3.small or larger. Provide the required name, and leave the remaining default values. After your environment is created, you should have access to a terminal window.

You must increase the size of the Amazon Elastic Block Store (Amazon EBS) volume attached to your AWS Cloud9 instance to at least 50 GB, because the default size (10 GB) is not enough. For instructions, refer to Resize an Amazon EBS volume used by an environment.

Set up kubectl, Helm, and the AWS CLI

This post requires the following CLI tools to be installed:

kubectl to manage the Kubernetes resources deployed to the EKS cluster
Helm to deploy the resources based on Helm charts (note that we only support Helm 3)
The AWS CLI to manage AWS resources

Complete the following steps:

Download kubectl (version 1.21.x) and make the file executable:

sudo curl --silent --location -o /usr/local/bin/kubectl https://s3.us-west-2.amazonaws.com/amazon-eks/1.21.5/2022-01-21/bin/linux/amd64/kubectl

sudo chmod +x /usr/local/bin/kubectl

To install kubectl in AWS Cloud9, use the following instructions. AWS Cloud9 normally manages AWS Identity and Access Management (IAM) credentials dynamically. This isn’t currently compatible with Amazon EKS IAM authentication, so we disable it and rely on the IAM role instead.

Download Helm (version 3.9.3):

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

chmod 700 get_helm.sh

DESIRED_VERSION=v3.9.3 ./get_helm.sh

Install the AWS CLI (version 2.x.x) or migrate AWS CLI version 1 to version 2.

After installation, make sure aws --version is pointing to version 2, or close the terminal and create a new terminal session.

Create a service-linked role

OpenSearch Service uses IAM service-linked roles. A service-linked role is a unique type of IAM role that is linked directly to OpenSearch Service. Service-linked roles are predefined by OpenSearch Service and include all the permissions that the service requires to call other AWS services on your behalf. To create a service-linked role for OpenSearch Service, issue the following command:

aws iam create-service-linked-role --aws-service-name es.amazonaws.com

Install the AWS CDK Toolkit v2

Install AWS CDK v2 with the following code:

npm install -g aws-cdk@latest

In case of any error, use the following code:

npm install -g aws-cdk@latest –force

Provision different AWS managed services

In this section, we walk through the steps to provision different AWS managed services.

Clone the GitHub repository

Clone the GitHub repo with the following code:

git clone https://github.com/aws-samples/deploy-datahub-using-aws-managed-services-ingest-metadata.git

cd deploy-datahub-using-aws-managed-services-ingest-metadata

Initialize the AWS CDK stack

To initialize the AWS CDK stack, change the ACCOUNT_ID and REGION values in the cdk.json file.

Then run the following code, providing your account ID and Region:

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
# Execute the below command once per account, if you have never executed this before
cdk bootstrap aws://<account_id>/<aws_region>
# Synthesize CloudFormation
cdk synth

Deploy the AWS CDK stack

Deploy the AWS CDK stack with the following code:

# To keep confirmation prompts, remove --require-approval never 
cdk deploy --all --require-approval never

Now that the deployment is complete, we need to assemble all the credentials and hostnames for different components.

Check AWS CloudFormation output

We created different AWS CloudFormation stacks when we ran the AWS CDK stack. We need the values from the stack outputs to use in the next steps.

On the AWS CloudFormation console, navigate to the EKS stack.
Get the following command on the Outputs tab(key:eksclusterConfigCommandXXX), and then run it:

aws eks update-kubeconfig --region <region-code> --name <cluster-name> --role-arn <role_arn>

Similarly, navigate to the ElasticSearch stack and get the following key:

MasterPW <pwd>
MasterUser opensearch

CDK stack also created an AWS Secrets Manager secret.

On the Secrets Manager console, navigate to the secret with the name MySqlInstanceDataHubSecret****.
In the Secret value section, choose Retrieve secret value to get the following:

password <pwd>
dbname db1
engine mysql
port 3306
dbInstanceIdentifier <identfier-name>
host <host>
username admin

On the OpenSearch Service console, get the domain endpoint for the cluster opensearch-domain-datahub, which is in the following format:

vpc-opensearch-domain-DataHub-<id>.<region>.es.amazonaws.com

On the Amazon MSK console, navigate to your cluster (MSK-DataHub).
Choose View client information and copy both the plaintext Kafka bootstrap server and Apache ZooKeeper connection,which is in the following format:

#MSK Bootstarp servers(Plaintext)
b-1.mskdatahub.<msk>.c5.kafka.<region>.amazonaws.com:9092,b-2.mskdatahub.<msk>.c5.kafka.<region>.amazonaws.com:9092
#Apache ZooKeeper connection(Plaintext)
z-1.mskdatahub.<zk>.c5.kafka.<region>.amazonaws.com:2181,z-2.mskdatahub.<zk>.c5.kafka.<region>.amazonaws.com:2181,z-3.mskdatahub.<zk>.c5.kafka.<region>.amazonaws.com:2181

Install DataHub containers to the provisioned EKS cluster

To install the DataHub containers, complete the following steps:

Create Kubernetes secrets using the following kubectl command, using the MySQL and OpenSearch Service passwords what we collected earlier:

kubectl create secret generic mysql-secrets --from-literal=mysql-root-password=<mysql-pwd-copied-from-previous-step>

kubectl create secret generic elasticsearch-secrets --from-literal=elasticsearch-password=<opensearch-pwd-copied-from-previous-step>

Add the DataHub Helm repo by running the following Helm command:

helm repo add datahub https://helm.DataHubproject.io/

Modify the following config files and replace the value of the MSK broker, MySQL hostname, and OpenSearch Service domain:
1. Edit the values for values.yaml (in the charts/datahub folder on GitHub):

kafka->bootstrap->server with kafka bootstrap server
kafka->zookeeper->server with zookeeper details
elasticserach->host with ES domain name
sql->datasource->host with MySQL host name
sql->datasource -> hostforMySqlClient with MySQL host name
sql->datasource -> url with MySQL host name

1. Edit the values for values.yaml (in charts/prerequisites folder on GitHub):

kafka->bootstrap->server with kafka bootstrap server

Now you can deploy the following two Helm charts to spin up the DataHub front end and backend components to the EKS cluster:

helm install prerequisites datahub/datahub-prerequisites --values ./charts/prerequisites/values.yaml --version 0.0.10

helm install datahub datahub/datahub --values ./charts/datahub/values.yaml --version 0.2.108

If you want to use a newer Helm chart, replace the following chart values from your existing values.yaml:

elasticsearchSetupJob
global : graph_service_impl
global : elasticsearch
global :kafka
global :sql

If the installation fails, debug with the following commands to check the status of the different pods:

#Confirm kubectl points to the EKS cluster:
kubectl config current-context

#Get Status of Pods
kubectl get pods

# If any service has error from above command, then execute below command for the error service.
kubectl logs -f <error-pod-name>

After you identify the issue from the log and fix it manually, set up DataHub with following Helm upgrade command:

helm upgrade --install datahub datahub/datahub --values ./charts/datahub/values.yaml --version 0.2.108

After the DataHub setup is successful, run the following command to get DataHub’s front end URL that uses port 9002:

kubectl get svc

Access the DataHub URL in a browser with HTTP and use the default user name and password as datahub to log in to the URL http://<id>.<region>.elb.amazonaws.com:9002/.

Note that this isn’t recommended for production deployment. We strongly recommend changing the default user name and password or configuring single sign-on (SSO) via OpenID Connect. For more information, refer to Adding Users to DataHub. Additionally, expose the endpoint by setting up an ingress controller with a custom domain name. Follow the instructions in AWS setup guide to meet your networking requirements.

Clean up

The clean-up instructions are provided in the Part 2 of this series.

Conclusion

In this post, we demonstrated how to deploy DataHub using AWS managed services. Part 2 of this series will focus on search and discover of data assets stored in your data lake (via the AWS Glue Data Catalog) and data warehouse in Amazon Redshift.

About the Authors

How a blockchain startup built a prototype solution to solve the need of analytics for decentralized applications with AWS Data Lab

2022-10-24 Dr. Quan Hoang Nguyen

Post Syndicated from Dr. Quan Hoang Nguyen original https://aws.amazon.com/blogs/big-data/how-a-blockchain-startup-built-a-prototype-solution-to-solve-the-need-of-analytics-for-decentralized-applications-with-aws-data-lab/

This post is co-written with Dr. Quan Hoang Nguyen, CTO at Fantom Foundation.

Here at Fantom Foundation (Fantom), we have developed a high performance, highly scalable, and secure smart contract platform. It’s designed to overcome limitations of the previous generation of blockchain platforms. The Fantom platform is permissionless, decentralized, and open source. The majority of decentralized applications (dApps) hosted on the Fantom platform lack an analytics page that provides information to the users. Therefore, we would like to build a data platform that supports a web interface that will be made public. This will allow users to search for a smart contract address. The application then displays key metrics for that smart contract. Such an analytics platform can give insights and trends for applications deployed on the platform to the users, while the developers can continue to focus on improving their dApps.

AWS Data Lab offers accelerated, joint-engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives. Data Lab has three offerings: the Build Lab, the Design Lab, and a Resident Architect. The Build Lab is a 2–5 day intensive build with a technical customer team. The Design Lab is a half-day to 2-day engagement for customers who need a real-world architecture recommendation based on AWS expertise, but aren’t yet ready to build. Both engagements are hosted either online or at an in-person AWS Data Lab hub. The Resident Architect provides AWS customers with technical and strategic guidance in refining, implementing, and accelerating their data strategy and solutions over a 6-month engagement.

In this post, we share the experience of our engagement with AWS Data Lab to accelerate the initiative of developing a data pipeline from an idea to a solution. Over 4 weeks, we conducted technical design sessions, reviewed architecture options, and built the proof of concept data pipeline.

Use case review

The process started with us engaging with our AWS Account team to submit a nomination for the data lab. This followed by a call with the AWS Data Lab team to assess the suitability of requirements against the program. After the Build Lab was scheduled, an AWS Data Lab Architect engaged with us to conduct a series of pre-lab calls to finalize the scope, architecture, goals, and success criteria for the lab. The scope was to design a data pipeline that would ingest and store historical and real-time on-chain transactions data, and build a data pipeline to generate key metrics. Once ingested, data should be transformed, stored, and exposed via REST-based APIs and consumed by a web UI to display key metrics. For this Build Lab, we choose to ingest data for Spooky, which is a decentralized exchange (DEX) deployed on the Fantom platform and had the largest Total Value Locked (TVL) at that time. Key metrics such number of wallets that have interacted with the dApp over time, number of tokens and their value exchanged for the dApp over time, and number of transactions for the dApp over time were selected to visualize through a web-based UI.

We explored several architecture options and picked one for the lab that aligned closely with our end goal. The total historical data for the selected smart contract was approximately 1 GB since deployment of dApp on the Fantom platform. We used FTMScan, which allows us to explore and search on the Fantom platform for transactions, to estimate the rate of transfer transactions to be approximately three to four per minute. This allowed us to design an architecture for the lab that can handle this data ingestion rate. We agreed to use an existing application known as the data producer that was developed internally by the Fantom team to ingest on-chain transactions in real time. On checking transactions’ payload size, it was found to not exceed 100 kb for each transaction, which gave us the measure of number of files that will be created once ingested through the data producer application. A decision was made to ingest the past 45 days of historic transactions to populate the platform with enough data to visualize key metrics. Because the feature of backdating exists within the data producer application, we agreed to use that. The Data Lab Architect also advised us to consider using AWS Database Migration Service (AWS DMS) to ingest historic transactions data post lab. As a last step, we decided to build a React-based webpage with Material-UI that allows users to enter a smart contract address and choose the time interval, and the app fetches the necessary data to show the metrics value.

Solution overview

We collectively agreed to incorporate the following design principles for the data lab architecture:

Simplified data pipelines
Decentralized data architecture
Minimize latency as much as possible

The following diagram illustrates the architecture that we built in the lab.

We collectively defined the following success criteria for the Build Lab:

End-to-end data streaming pipeline to ingest on-chain transactions
Historical data ingestion of the selected smart contract
Data storage and processing of on-chain transactions
REST-based APIs to provide time-based metrics for the three defined use cases
A sample web UI to display aggregated metrics for the smart contract

Prior to the Build Lab

As a prerequisite for the lab, we configured the data producer application to use the AWS Software Development Kit (AWS SDK) and PUTRecords API operation to send transactions data to an Amazon Simple Storage Service (Amazon S3) bucket. For the Build Lab, we built additional logic within the application to ingest historic transactions data together with real-time transactions data. As a last step, we verified that transactions data was captured and ingested into a test S3 bucket.

AWS services used in the lab

We used the following AWS services as part of the lab:

AWS Identity and Access Management (IAM) – We created multiple IAM roles with appropriate trust relationships and necessary permissions that can be used by multiple services to read and write on-chain transactions data and generated logs.
Amazon S3 – We created an S3 bucket to store the incoming transactions data as JSON-based files. We created a separate S3 bucket to store incoming transaction data that failed to be transformed and will be reprocessed later.
Amazon Kinesis Data Streams – We created a new Kinesis data stream in on-demand mode, which automatically scales based on data ingestion patterns and provides hands-free capacity management. This stream was used by the data producer application to ingest historical and real-time on-chain transactions. We discussed having the ability to manage and predict cost, and therefore were advised to use the provisioned mode when reliable estimates were available for throughput requirements. We were also advised to continue to use on-demand mode until the data traffic patterns were unpredictable.
Amazon Kinesis Data Firehose – We created a Firehose delivery stream to transform the incoming data and writes it to the S3 bucket. To minimize latency, we set the delivery stream buffer size to 1 MiB and buffer interval to 60 seconds. This would ensure a file is written to the S3 bucket when either of the two conditions are satisfied regardless of the order. Transactions data written to the S3 bucket was in JSON Lines format.
Amazon Simple Queue Service (Amazon SQS) – We set up an SQS queue of the type Standard and an access policy for that SQS queue to allow incoming messages generated from S3 bucket event notifications.
Amazon DynamoDB – In order to pick a data store for on-chain transactions, we needed a service that can store transactions payload of unstructured data with varying schemas, provides the ability to cache query results, and is a managed service. We picked DynamoDB for those reasons. We created a single DynamoDB table that holds the incoming transactions data. After analyzing the access query patterns, we decided to use the address field of the smart contract as the partition key and the timestamp field as the sort key. The table was created with auto scaling of read and write capacity modes because the actual usage requirements would be hard to predict at that time.
AWS Lambda – We created the following functions:
- A Python-based Lambda function to perform transformations on the incoming data from the data producer application to flatten the JSON structure, convert the Unix-based epoch timestamp to a date/time value, and convert hex-based string values to a decimal value representing the number of tokens.
- A second Lambda function to parse incoming SQS queue messages. This message contained values for bucket_name and object_key, which holds the reference to a newly created object within the S3 bucket. The Lambda function logic included parsing of this value to obtain the reference to the S3 object, get the contents of the object, read it into a data frame object using the AWS SDK for pandas (awswrangler) library, convert it into a Pandas data frame object, and use the put_df API call to write a Pandas data frame object as an item into a DynamoDB table. We choose to use Pandas due to familiarity with the library and functions required to perform data transform operations.
- Three separate Lambda functions that contains the logic to query the DynamoDB table and retrieve items to aggregate and calculate metrics values. This calculated metrics value within the Lambda function was formatted as an HTTP response to expose as REST-based APIs.
Amazon API Gateway – We created a REST based API endpoint that uses Lambda proxy integration to pass a smart contract address and time-based interval in minutes as a query string parameter to the backend Lambda function. The response from the Lambda function was a metrics value. We also enabled cross-origin resource sharing (CORS) support within API Gateway to successfully query from the web UI that resides in a different domain.
Amazon CloudWatch – We used a Lambda function in-built mechanism to send function metrics to CloudWatch. Lambda functions come with a CloudWatch Logs log group and a log stream for each instance of your function. The Lambda runtime environment sends details of each invocation to the log stream, and relays logs and other output from your function’s code.

Iterative development approach

Across 4 days of the Build Lab, we undertook iterative development. We started by developing the foundational layer and iteratively added extra features through testing and data validation. This allowed us to develop confidence of the solution being built as we tested the output of the metrics through a web-based UI and verified with the actual data. As errors got discovered, we deleted the entire dataset and reran all the jobs to verify results and resolve those errors.

Lab outcomes

In 4 days, we built an end-to-end streaming pipeline ingesting 45 days of historical data and real-time on-chain transactions data for the selected Spooky smart contract. We also developed three REST-based APIs for the selected metrics and a sample web UI that allows users to insert a smart contract address, choose a time frequency, and visualize the metrics values. In a follow-up call, our AWS Data Lab Architect shared post-lab guidance around the next steps required to productionize the solution:

Scaling of the proof of concept to handle larger data volumes
Security best practices to protect the data while at rest and in transit
Best practices for data modeling and storage
Building an automated resilience technique to handle failed processing of the transactions data
Incorporating high availability and disaster recovery solutions to handle incoming data requests, including adding of the caching layer

Conclusion

Through a short engagement and small team, we accelerated this project from an idea to a solution. This experience gave us the opportunity to explore AWS services and their analytical capabilities in-depth. As a next step, we will continue to take advantage of AWS teams to enhance the solution built during this lab to make it ready for the production deployment.

Learn more about how the AWS Data Lab can help your data and analytics on the cloud journey.

About the Authors

Dr. Quan Hoang Nguyen is currently a CTO at Fantom Foundation. His interests include DLT, blockchain technologies, visual analytics, compiler optimization, and transactional memory. He has experience in R&D at the University of Sydney, IBM, Capital Markets CRC, Smarts – NASDAQ, and National ICT Australia (NICTA).

Ankit Patira is a Data Lab Architect at AWS based in Melbourne, Australia.

New AWS whitepaper: Using AWS in the Context of Canada’s Controlled Goods Program (CGP)

2022-10-18 Michael Davie

Post Syndicated from Michael Davie original https://aws.amazon.com/blogs/security/new-aws-whitepaper-using-aws-in-the-context-of-canadas-controlled-goods-program-cgp/

Amazon Web Services (AWS) has released a new whitepaper to help Canadian defense and security customers accelerate their use of the AWS Cloud.

The new guide, Using AWS in the Context of Canada’s Controlled Goods Program (CGP), continues our efforts to help AWS customers navigate the regulatory expectations of the Government of Canada’s Controlled Goods Program in a shared responsibility environment.

This whitepaper is intended for customers that are looking to store and process controlled goods information in the AWS Cloud, and is particularly useful for leadership, security, risk, and compliance teams that need to understand CGP requirements and guidance.

The whitepaper summarizes CGP requirements and guidance related to the protection of controlled goods information, and gives CGP-regulated customers information they can use to commence their due diligence and assess how to implement the appropriate programs for their use of AWS Cloud services.

This document is our first that is specific to Canadian regulatory requirements and joins other guides related to specific regulatory regimes around the world. As the regulatory environment continues to evolve, we’ll provide further updates on the AWS Security Blog and the AWS Compliance page. You can find more information on cloud-related regulatory compliance at the AWS Compliance Center. You can also reach out to your AWS account manager for help finding the resources you need.

If you have feedback about this blog post, submit comments in the Comments section below. You can also start a new thread on re:Post to get answers from the community.

Want more AWS Security news? Follow us on Twitter.

Build incremental crawls of data lakes with existing Glue catalog tables

2022-10-14 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/build-incremental-crawls-of-data-lakes-with-existing-glue-catalog-tables/

AWS Glue includes crawlers, a capability that make discovering datasets simpler by scanning data in Amazon Simple Storage Service (Amazon S3) and relational databases, extracting their schema, and automatically populating the AWS Glue Data Catalog, which keeps the metadata current. This reduces the time to insight by making newly ingested data quickly available for analysis with your preferred analytics and machine learning (ML) tools.

Previously, you could reduce crawler cost by using Amazon S3 Event Notifications to incrementally crawl changes on Data Catalog tables created by crawler. Today, we’re extending this support to crawling and updating Data Catalog tables that are created by non-crawler methods, such as using data pipelines. This crawler feature can be useful for several use cases, such as following:

You currently have a data pipeline to create AWS Glue Data Catalog tables and want to offload detection of partition information from the data pipeline to a scheduled crawler
You have an S3 bucket with event notifications enabled and want to continuously catalog new changes and prevent creation of new tables in case of ill-formatted files that break the partition detection
You have manually created Data Catalog tables and want to run incremental crawls on new file additions instead of running full crawls due to long crawl times

To accomplish incremental crawling, you can configure Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Service (Amazon SQS) queue. You can then use the SQS queue as a source to identify changes and can schedule or run an AWS Glue crawler with Data Catalog tables as a target. With each run of the crawler, the SQS queue is inspected for new events. If no new events are found, the crawler stops. If events are found in the queue, the crawler inspects their respective folders, processes through built-in classifiers (for CSV, JSON, AVRO, XML, and so on), and determines the changes. The crawler then updates the Data Catalog with new information, such as newly added or deleted partitions or columns. This feature reduces the cost and time to crawl large and frequently changing Amazon S3 data.

This post shows how to create an AWS Glue crawler that supports Amazon S3 event notification on existing Data Catalog tables using the new crawler UI and an AWS CloudFormation template.

Overview of solution

To demonstrate how the new AWS Glue crawler performs incremental updates, we use the Toronto parking tickets dataset—specifically data about parking tickets issued in the city of Toronto between 2019–2020. The goal is to create a manual dataset as well as its associated metadata tables in AWS Glue, followed by an event-based crawler that detects and implements changes to the manually created datasets and catalogs.

As mentioned before, instead of crawling all the subfolders on Amazon S3, we use an Amazon S3 event-based approach. This helps improve the crawl time by using Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder that triggered the event instead of listing the full Amazon S3 target. To accomplish this, we create an S3 bucket, an event-based crawler, an Amazon Simple Storage Service (Amazon SNS) topic, and an SQS queue. The following diagram illustrates our solution architecture.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An AWS Identity and Access Management (IAM) user with access to the following services:
- AWS CloudFormation
- AWS Glue
- Amazon SNS
- Amazon SQS
- Amazon S3

If the AWS account you use to follow this post uses Lake Formation to manage permissions on the AWS Glue Data Catalog, make sure that you log in as a user with access to create databases and tables. For more information, refer to Implicit Lake Formation permissions.

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

Launch your CloudFormation stack in us-east-1:
For Stack name, enter a name for your stack .
For paramBucketName, enter a name for your S3 bucket (with your account number).
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Wait for the stack formation to finish provisioning the requisite resources. When you see the CREATE_COMPLETE status, you can proceed to the next steps.

Additionally, note down the ARN of the SQS queue to use at a later point.

Query your Data Catalog

Next, we use Amazon Athena to confirm that the manual tables have been created in the Data Catalog, as part of the CloudFormation template.

On the Athena console, choose Launch query editor.
For Data source, choose AwsDataCatalog.
For Database, choose torontoparking.

The tickets table should appear in the Tables section.

Now you can query the table to see its contents.
You can write your own query, or choose Preview Table on the options menu.

This writes a simple SQL query to show us the first 10 rows.
Choose Run to run the query.

As we can see in the query results, the database and table for 2019 parking ticket data have been created and partitioned.

Create the Amazon S3 event crawler

The next step is to create the crawler that detects and crawls only on incrementally updated tables.

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
For Name, enter a name.
Choose Next.

Now we need to select the data source for the crawler.
Select Yes to indicate that our data is already mapped to our AWS Glue Data Catalog.
Choose Add tables.
For Database, choose torontoparking and for Tables, choose tickets.
Select Crawl based on events.
For Include SQS ARN, enter the ARN you saved from the CloudFormation stack outputs.
Choose Confirm.

You should now see the table populated under Glue tables, with the parameter set as Recrawl by event.
Choose Next.
For Existing IAM role, choose the IAM role created by the CloudFormation template (GlueCrawlerTableRole).
Choose Next.
For Frequency, choose On demand.

You also have the option of choosing a schedule on which the crawler will run regularly.
Choose Next.
Review the configurations and choose Create crawler.

Now that the crawler has been created, we add the 2020 ticketing data to our S3 bucket so that we can test our new crawler. For this step, we use the AWS Command Line Interface (AWS CLI)

To add this data, use the following command:

aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui2/source/year=2020/Parking_Tags_Data_2020.000.csv s3://glue-table-crawler-blog-<YOURACCOUNTNUMBER>/year=2020/Parking_Tags_Data_2020.000.csv

After successful completion of this command, your S3 bucket should contain the 2020 ticketing data and your crawler is ready to run. The terminal should return the following:

copy: s3://aws-bigdata-blog/artifacts/gluenewcrawlerui2/source/year=2020/Parking_Tags_Data_2020.000.csv to s3://glue-table-crawler-blog-<YOURACCOUNTNUMBER>/year=2020/Parking_Tags_Data_2020.000.csvRun the crawler and verify the updates

Run the crawler and verify the updates

Now that the new folder has been created, we run the crawler to detect the changes in the table and partitions.

Navigate to your crawler on the AWS Glue console and choose Run crawler.

After running the crawler, you should see that it added the 2020 data to the tickets table.
On the Athena console, we can ensure that the Data Catalog has been updated by adding a where year = 2020 filter to the query.

AWS CLI option

You can also create the crawler using the AWS CLI. For more information, refer to create-crawler.

Clean up

To avoid incurring future charges, and to clean up unused roles and policies, delete the resources you created: the CloudFormation stack, S3 bucket, AWS Glue crawler, AWS Glue database, and AWS Glue table.

Conclusion

You can use AWS Glue crawlers to discover datasets, extract schema information, and populate the AWS Glue Data Catalog. In this post, we provided a CloudFormation template to set up AWS Glue crawlers to use Amazon S3 event notifications on existing Data Catalog tables, which reduces the time and cost needed to incrementally process table data updates in the Data Catalog.

With this feature, incremental crawling can now be offloaded from data pipelines to the scheduled AWS Glue crawler, reducing cost. This alleviates the need for full crawls, thereby reducing crawl times and Data Processing Units (DPUs) required to run the crawler. This is especially useful for customers that have S3 buckets with event notifications enabled and want to continuously catalog new changes.

To learn more about this feature, refer to Accelerating crawls using Amazon S3 event notifications.

Special thanks to everyone who contributed to this crawler feature launch: Theo Xu, Jessica Cheng, Arvin Mohanty, and Joseph Barlan.

About the authors

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, he has over a decade of experience in data management, helping customers around the globe address their business and technical needs.

Aayzed Tanweer is a Solutions Architect working with startup customers in the FinTech space, with a special focus on analytics services. Originally hailing from Toronto, he recently moved to New York City, where he enjoys eating his way through the city and exploring its many peculiar nooks and crannies.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Fine-tuning Operations at Slice using AWS DevOps Guru

2022-10-12 Adnan Bilwani

Post Syndicated from Adnan Bilwani original https://aws.amazon.com/blogs/devops/fine-tuning-operations-at-slice-using-aws-devops-guru/

This guest post was authored by Sapan Jain, DevOps Engineer at Slice, and edited by Sobhan Archakam and Adnan Bilwani, at AWS.

Slice empowers over 18,000 independent pizzerias with the modern tools that have grown the major restaurant chains. By uniting these small businesses with specialized technology, marketing, data insights, and shared services, Slice enables them to serve their digitally-minded customers and move away from third-party apps. Using Amazon DevOps Guru, Slice is able to fine-tune their operations to better support these customers.

Serial tech entrepreneur Ilir Sela started Slice to modernize and support his family’s New York City pizzerias. Today, the company partners with restaurants in 3,000 cities and all 50 states, forming the nation’s largest pizza network. For more information, visit slicelife.com.

Slice’s challenge

At Slice, we manage a wide variety of systems, services, and platforms, all with varying levels of complexity. Observability, monitoring, and log aggregation are things we excel at, and they’re always critical for our platform engineering team. However, deriving insights from this data still requires some manual investigation, particularly when dealing with operational anomalies and/or misconfigurations.

To gain automated insights into our services and resources, Slice conducted a proof-of-concept utilizing Amazon DevOps Guru to analyze a small selection of AWS resources. Amazon DevOps Guru identified potential issues in our environment, resulting in actionable insights (ultimately leading to remediation). As a result of this analysis, we enabled Amazon DevOps Guru account-wide, thereby leading to numerous insights into our production environment.

Insights with Amazon DevOps Guru

After we configured Amazon DevOps Guru to begin its account-wide analysis, we left the tool alone to begin the process of collecting and analyzing data. We immediately began seeing some actionable insights for various production AWS resources, some of which are highlighted in the following section:

Amazon DynamoDB Point-in-time recovery

Amazon DynamoDB offers a point-in-time recovery (PITR) feature that provides continuous backups of your DynamoDB data for 35 days to help you protect against accidental write or deletes. If enabled, this lets you restore your respective table to a previous state. Amazon DevOps Guru identified several tables in our environment that had PITR disabled, along with a corresponding Recommendation.

The graphic shows proactive insights for the last 1 month. The one insight shown is 'Dynamo Table Point in Time Recovery not enabled' with a status of OnGoing and a severity of low.

Figure 1. The graphic shows proactive insights for the last 1 month. The one insight shown is ‘Dynamo Table Point in Time Recovery not enabled’ with a status of OnGoing and a severity of low.

Elasticache anomalous evictions

Amazon Elasticache for Redis is used by a handful of our services to cache any relevant application data. Amazon DevOps Guru identified that one of our instances was exhibiting anomalous behavior regarding its cache eviction rate. Essentially, due to the memory pressure of the instance, the eviction rate of cache entries began to increase. DevOps Guru recommended revisiting the sizing of this instance and scaling it vertically or horizontally, where appropriate.

The graph shows the metric: count of ElastiCache evictions plotted for the time period Jul 3, 20:35 to Jul 3, 21:35 UTC. A highlighted section shows that the evictions increased to a peak of 2500 between 21:00 and 21:08. Outside of this interval the evictions are below 500.

Figure 2. The graph shows the metric: count of ElastiCache evictions plotted for the time period Jul 3, 20:35 to Jul 3, 21:35 UTC. A highlighted section shows that the evictions increased to a peak of 2500 between 21:00 and 21:08. Outside of this interval the evictions are below 500

AWS Lambda anomalous errors

We manage a few AWS Lambda functions that all serve different purposes. During the beginning of normal work day, we began to see increased error rates for a particular function resulting in an exception being thrown. DevOps Guru was able to detect the increase in error rates and flag them as anomalous. Although retries in this case wouldn’t have solved the problem, it did increase our visibility into the issue (which was also corroborated by our APM platform).

The graph shows the metric: count of AWS/Lambda errors plotted between 11:00 and 13:30 on Jul 6. The sections between the times 11:23 and 12:15 and at 12:37 and 13:13 UTC are highlighted to show the anomalies.

Figure 3. The graph shows the metric: count of AWS/Lambda errors plotted between 11:00 and 13:30 on Jul 6. The sections between the times 11:23 and 12:15 and at 12:37 and 13:13 UTC are highlighted to show the anomalies

Figure 3. The graph shows the metric: count of AWS/Lambda errors plotted between 11:00 and 13:30 on Jul 6. The sections between the times 11:23 and 12:15 UTC are highlighted to show the anomalies

Conclusion

Amazon DevOps Guru integrated into our environment quickly, with no more additional configuration or setup aside from a few button clicks to enable the service. After reviewing several of the proactive insights that DevOps Guru provided, we could formulate plans of action regarding remediation. One specific case example of this is where DevOps Guru flagged several of our Lambda functions for not containing enough subnets. After triaging the finding, we discovered that we were lacking multi-AZ redundancy for several of those functions. As a result, we could implement a change that maximized our availability of those resources.

With the continuous analysis that DevOps Guru performs, we continue to gain new insights into the resources that we utilize and deploy in our environment. This lets us improve operationally while simultaneously maintaining production stability.

About the author:

Improve federated queries with predicate pushdown in Amazon Athena

2022-10-11 Rohit Bansal

Post Syndicated from Rohit Bansal original https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/

In modern data architectures, it’s common to store data in multiple data sources. However, organizations embracing this approach still need insights from their data and require technologies that help them break down data silos. Amazon Athena is an interactive query service that makes it easy to analyze structured, unstructured, and semi-structured data stored in Amazon Simple Storage Service (Amazon S3) in addition to relational, non-relation, object, and custom data sources through its query federation capabilities. Athena is serverless, so there’s no infrastructure to manage, and you only pay for the queries that you run.

Organizations building a modern data architecture want to query data in-place from purpose-built data stores without building complex extract, transform, and load (ETL) pipelines. Athena’s federated query feature allows organizations to achieve this and makes it easy to:

Create reports and dashboards from data stored in relational, non-relational, object, and custom data sources
Run on-demand analysis on data spread across multiple systems of record using a single tool and single SQL dialect
Join multiple data sources together to produce new input features for machine learning model training workflows

However, when querying and joining huge amounts of data from different data stores, it’s important for queries to run quickly, at low cost, and without impacting source systems. Predicate pushdown is supported by many query engines and is a technique that can drastically reduce query processing time by filtering data at the source early in the processing workflow. In this post, you’ll learn how predicate pushdown improves query performance and how you can validate when Athena applies predicate pushdown to federated queries.

Benefits of predicate pushdown

The key benefits of predicate pushdown are as follows:

Improved query runtime
Reduced network traffic between Athena and the data source
Reduced load on the remote data source
Reduced cost resulting from reduced data scans

Let’s explore a real-world scenario to understand when predicate pushdown is applied to federated queries in Athena.

Solution overview

Imagine a hypothetical ecommerce company with data stored in

Amazon Redshift – Company’s Datawarehouse, used for current and historical analytics
Amazon Aurora MySQL-Compatible Edition – Relational Database, used for pricing related transactions done by company’s customer
Amazon DynamoDB – NoSQL Database, used for real-time inventory tracking and latest supplier data in the company

Record counts for these tables are as follows.

Data Store	Table Name	Number of Records	Description
Amazon Redshift	`Catalog_Sales`	4.3 billion	Current and historical Sales data fact Table
Amazon Redshift	`Date_dim`	73,000	Date Dimension table
DynamoDB	`Part`	20,000	Realtime Parts and Inventory data
DynamoDB	`Partsupp`	80,000	Realtime Parts and supplier data
Aurora MySQL	`Supplier`	1,000	Latest Supplier transactions
Aurora MySQL	`Customer`	15,000	Latest Customer transactions

Our requirement is to query these sources individually and join the data to track pricing and supplier information and compare recent data with historical data using SQL queries with various filters applied. We’ll use Athena federated queries to query and join data from these sources to meet this requirement.

The following diagram depicts how Athena federated queries use data source connectors run as Lambda functions to query data stored in sources other than Amazon S3.

When a federated query is submitted against a data source, Athena invokes the data source connector to determine how to read the requested table and identify filter predicates in the WHERE clause of the query that can be pushed down to the source. Applicable filters are automatically pushed down by Athena and have the effect of omitting unnecessary rows early in the query processing workflow and improving overall query execution time.

Let’s explore three use cases to demonstrate predicate pushdown for our ecommerce company using each of these services.

Prerequisites

As a prerequisite, review Using Amazon Athena Federated Query to know more about Athena federated queries and how to deploy these data source connectors.

Use case 1: Amazon Redshift

In our first scenario, we run an Athena federated query on Amazon Redshift by joining its Catalog_sales and Date_dim tables. We do this to show the number of sales orders grouped by order date. The following query gets the information required and takes approximately 14 seconds scanning approximately 43 MB of data:

SELECT "d_date" AS Order_date,
     count(1) AS Total_Orders
 FROM "lambda:redshift"."order_schema"."catalog_sales" l,
     "lambda:redshift"."order_schema"."date_dim" d
 WHERE l.cs_sold_date_sk = d_date_sk
     and cs_sold_date_sk between 2450815 and 2450822 --Date keys for first week of Jan 1998
 GROUP BY "d_date"
 order by "d_date"

Athena pushes the following filters to the source for processing:

cs_sold_date_sk between 2450815 and 2450822 for the Catalog_Sales table in Amazon Redshift.
d_date_sk between 2450815 and 2450822; because of the join l.cs_sold_date_sk=d_date_sk in the query, the Date_dim table is also filtered at the source, and only filtered data is moved from Amazon Redshift to Athena.

Let’s analyze the query plan by using recently released visual explain tool to confirm the filter predicates are pushed to the data source:

As shown above (only displaying the relevant part of the visual explain plan), because of the predicate pushdown, the Catalog_sales and Date_dim tables have filters applied at the source. Athena processes only the resulting filtered data.

Using the Athena console, we can see query processing details using the recently released query stats to interactively explore processing details with predicate pushdown at the query stage:

Displaying only the relevant query processing stages, Catalog_sales table has approximately 4.3 billion records, and Date_dim has approximately 73,000 records in Amazon Redshift. Only 11 million records from the Catalog_sales (Stage 4) and 8 records from the Date_dim (Stage 5) are passed from source to Athena, because the predicate pushdown pushes query filter conditions to the data sources. This filters out unneeded records at the source, and only brings the required rows to Athena.

Using predicate pushdown resulted in scanning 99.75% less data from Catalog_sales and 99.99% less data from Date_dim. This results in a faster query runtime and lower cost.

Use case 2: Amazon Redshift and Aurora MySQL

In our second use case, we run an Athena federated query on Aurora MySQL and Amazon Redshift data stores. This query joins the Catalog_sales and Date_dim tables in Amazon Redshift with the Customer table in the Aurora MySQL database to get the total number of orders with the total amount spent by each customer for the first week in January 1998 for the market segment of AUTOMOBILE. The following query gets the information required and takes approximately 35 seconds scanning approximately 337 MB of data:

SELECT  cs_bill_customer_sk Customer_id ,"d_date" Order_Date 
 ,count("cs_order_number") Total_Orders ,sum(l.cs_net_paid_inc_ship_tax) AS Total_Amount
 FROM "lambda:mysql".sales.customer c,"lambda:redshift"."order_schema"."catalog_sales" l
 ,"lambda:redshift"."order_schema"."date_dim" d
 WHERE c_mktsegment = 'AUTOMOBILE'
 AND c_custkey = cs_bill_customer_sk
 AND l.cs_sold_date_sk=d_date_sk 
 AND cs_sold_date_sk between 2450815 and 2450822 --Date keys for first week of Jan 1998
 GROUP BY cs_bill_customer_sk,"d_date"  
 ORDER BY cs_bill_customer_sk,"d_date"

Athena pushes the following filters to the data sources for processing:

cs_sold_date_sk between 2450815 and 2450822 for the Catalog_Sales table in Amazon Redshift.
d_date_sk between 2450815 and 2450822; because of the join l.cs_sold_date_sk=d_date_sk in the query, the Date_dim table is also filtered at the source (Amazon Redshift) and only filtered data is moved from Amazon Redshift to Athena.
c_mktsegment = 'AUTOMOBILE' for the Customer table in the Aurora MySQL database.

Now let’s consult the visual explain plan for this query to show the predicate pushdown to the source for processing:

As shown above (only displaying the relevant part of the visual explain plan), because of the predicate pushdown, Catalog_sales and Date_dim have the query filter applied at the source (Amazon Redshift), and the customer table has the market segment AUTOMOBILE filter applied at the source (Aurora MySQL). This brings only the filtered data to Athena.

As before, we can see query processing details using the recently released query stats to interactively explore processing details with predicate pushdown at the query stage:

Displaying only the relevant query processing stages, Catalog_sales has 4.3 billion records, Date_Dim has 73,000 records in Amazon Redshift, and Customer has 15,000 records in Aurora MySQL. Only 11 million records from Catalog_sales (Stage 6), 8 records from Date_dim (Stage 7), and 3,000 records from Customer (Stage 5) are passed from the respective sources to Athena because the predicate pushdown pushes query filter conditions to the data sources. This filters out unneeded records at the source and only brings the required rows to Athena.

Here, predicate pushdown resulted in scanning 99.75% less data from Catalog_sales, 99.99% less data from Date_dim, and 79.91% from Customer. Furthermore, this results in a faster query runtime and reduced cost.

Use case 3: Amazon Redshift, Aurora MySQL, and DynamoDB

For our third use case, we run an Athena federated query on Aurora MySQL, Amazon Redshift, and DynamoDB data stores. This query joins the Part and Partsupp tables in DynamoDB, the Catalog_sales and Date_dim tables in Amazon Redshift, and the Supplier and Customer tables in Aurora MySQL to get the quantities available at each supplier for orders with the highest revenue during the first week of January 1998 for the market segment of AUTOMOBILE and parts manufactured by Manufacturer#1.

The following query gets the information required and takes approximately 33 seconds scanning approximately 428 MB of data in Athena:

SELECT "d_date" Order_Date 
     ,c_mktsegment
     ,"cs_order_number"
     ,l.cs_item_sk Part_Key
     ,p.p_name Part_Name
     ,s.s_name Supplier_Name
     ,ps.ps_availqty Supplier_Avail_Qty
     ,l.cs_quantity Order_Qty
     ,l.cs_net_paid_inc_ship_tax Order_Total
 FROM "lambda:dynamo".default.part p, 
     "lambda:mysql".sales.supplier s, 
     "lambda:redshift"."order_schema"."catalog_sales" l, 
     "lambda:dynamo".default.partsupp ps, 
     "lambda:mysql".sales.customer c,
     "lambda:redshift"."order_schema"."date_dim" d
 WHERE 
     c_custkey = cs_bill_customer_sk
     AND l.cs_sold_date_sk=d_date_sk 
     AND c.c_mktsegment = 'AUTOMOBILE'
     AND cs_sold_date_sk between 2450815 and 2450822 --Date keys for first week of Jan 1998
     AND p.p_partkey=ps.ps_partkey
     AND s.s_suppkey=ps.ps_suppkey
     AND p.p_partkey=l.cs_item_sk
     AND p.p_mfgr='Manufacturer#1'

Athena pushes the following filters to the data sources for processing:

cs_sold_date_sk between 2450815 and 2450822 for the Catalog_Sales table in Amazon Redshift.
d_date_sk between 2450815 and 2450822; because of the join l.cs_sold_date_sk=d_date_sk in the query, the Date_dim table is also filtered at the source and only filtered data is moved from Amazon Redshift to Athena.
c_mktsegment = 'AUTOMOBILE' for the Customer table in the Aurora MySQL database.
p.p_mfgr='Manufacturer#1' for the Part table in DynamoDB.

Now let’s run the explain plan for this query to confirm predicates are pushed down to the source for processing:

As shown above (displaying only the relevant part of the plan), because of the predicate pushdown, Catalog_sales and Date_dim have the query filter applied at the source (Amazon Redshift), the Customer table has the market segment AUTOMOBILE filter applied at the source (Aurora MySQL), and the Part table has the part manufactured by Manufacturer#1 filter applied at the source (DynamoDB).

We can analyze query processing details using the recently released query stats to interactively explore processing details with predicate pushdown at the query stage:

Displaying only the relevant processing stages, Catalog_sales has 4.3 billion records, Date_Dim has 73,000 records in Amazon Redshift, Customer has 15,000 records in Aurora MySQL, and Part has 20,000 records in DynamoDB. Only 11 million records from Catalog_sales (Stage 5), 8 records from Date_dim (Stage 9), 3,000 records from Customer (Stage 8), and 4,000 records from Part (Stage 4) are passed from their respective sources to Athena, because the predicate pushdown pushes query filter conditions to the data sources. This filters out unneeded records at the source, and only brings the required rows from the sources to Athena.

Considerations for predicate pushdown

When using Athena to query your data sources, consider the following:

Depending on the data source, data source connector, and query complexity, Athena can push filter predicates to the source for processing. The following are some of the sources Athena supports predicate pushdown with:
- Hbase
- Amazon DocumentDB
- DynamoDB
- JDBC (MySQL, Postgres, and Amazon Redshift)
- Amazon CloudWatch Logs
- Amazon CloudWatch Metrics
Athena also performs predicate pushdown on data stored in an S3 data lake. And, with predicate pushdown for supported sources, you can join all your data sources in one query and achieve fast query performance.
You can use the recently released query stats as well as EXPLAIN and EXPLAIN ANALYZE on your queries to confirm predicates are pushed down to the source.
Queries may not have predicates pushed to the source if the query’s WHERE clause uses Athena-specific functions (for example, WHERE log2(col)<10).

Conclusion

In this post, we demonstrated three federated query scenarios on Aurora MySQL, Amazon Redshift, and DynamoDB to show how predicate pushdown improves federated query performance and reduces cost and how you can validate when predicate pushdown occurs. If the federated data source supports parallel scans, then predicate pushdown makes it possible to achieve performance that is close to the performance of Athena queries on data stored in Amazon S3. You can utilize the patterns and recommendations outlined in this post when querying supported data sources to improve overall query performance and minimize data scanned.

About the authors

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He has nearly two decades of experience helping customers modernize their data platforms. He is passionate about helping customers build scalable, cost-effective data and analytics solutions in the cloud. In his spare time, he enjoys spending time with his family, travel, and road cycling.

Ruchir Tripathi is a Senior Analytics Solutions Architect aligned to Global Financial Services at AWS. He is passionate about helping enterprises build scalable, performant, and cost-effective solutions in the cloud. Prior to joining AWS, Ruchir worked with major financial institutions and is based out of New York Office.

Land data from databases to a data lake at scale using AWS Glue blueprints

2022-10-11 Moustafa Mahmoud

Post Syndicated from Moustafa Mahmoud original https://aws.amazon.com/blogs/big-data/land-data-from-databases-to-a-data-lake-at-scale-using-aws-glue-blueprints/

To build a data lake on AWS, a common data ingestion pattern is to use AWS Glue jobs to perform extract, transform, and load (ETL) data from relational databases to Amazon Simple Storage Service (Amazon S3). A project often involves extracting hundreds of tables from source databases to the data lake raw layer. And for each source table, it’s recommended to have a separate AWS Glue job to simplify operations, state management, and error handling. This approach works perfectly with a small number of tables. However, with hundreds of tables, this results in hundreds of ETL jobs, and managing AWS Glue jobs at this scale may pose an operational challenge if you’re not yet ready to deploy using a CI/CD pipeline. Instead, we tackle this issue by decoupling the following:

ETL job logic – We use an AWS Glue blueprint, which allows you to reuse one blueprint for all jobs with the same logic
Job definition – We use a JSON file, so you can define jobs programmatically without learning a new language
Job deployment – With AWS Step Functions, you can copy workflows to manage different data processing use cases on AWS Glue

In this post, you will learn how to handle data lake landing jobs deployment in a standardized way—by maintaining a JSON file with table names and a few parameters (for example, a workflow catalog). AWS Glue workflows are created and updated after manually running the resources deployment flow in Step Functions. You can further customize the AWS Glue blueprints to make your own multi-step data pipelines to move data to downstream layers and purpose-built analytics services (example use cases include partitioning or importing to an Amazon DynamoDB table).

Overview of solution

The following diagram illustrates the solution architecture, which contains two major areas:

Resource deployment (components 1–2) – An AWS Step Functions workflow is run manually on demand to update or deploy the required AWS Glue resources. These AWS Glue resources will be used for landing data into the data lake
ETL job runs (components 3–6) – The AWS Glue workflows (one per source table) run on the defined schedule, and extract and land data to the data lake raw layer

The solution workflow contains the following steps:

An S3 bucket stores an AWS Glue blueprint (ZIP) and the workflow catalog (JSON file).
A Step Functions workflow orchestrates the AWS Glue resources creation.
We use Amazon Aurora as the data source with our sample data, but any PostgreSQL database works with the provided script, or other JDBC sources with customization.
AWS Secrets Manager stores the secrets of the source databases.
On the predefined schedule, AWS Glue triggers relevant AWS Glue jobs to perform ETL.
Extracted data is loaded into an S3 bucket that serves as the data lake raw layer.

Prerequisites

To follow along with this post, complete the following prerequisite steps.

If you want to use a new database with sample data, you need two private subnets, with a Secrets Manager VPC endpoint associated to the subnets and security groups, and an Amazon S3 VPC endpoint associated to the corresponding route tables.

If you want to use your existing database either in AWS or on premises as a data source, you need network connectivity (a subnet and security group) for the AWS Glue jobs that can access the source database, Amazon S3, and Secrets Manager.

Provision resources with AWS CloudFormation

In this step, we provision our solution resources with AWS CloudFormation.

Database with sample data (optional)

This CloudFormation stack works only in AWS Regions where Amazon Aurora Serverless v1 is supported. Complete the following steps to create a database with sample data:

Choose Launch Stack.
On the Create stack page, choose Next.
For Stack name, enter demo-database.
For DBSecurityGroup, choose select the security group for the database (for example, default).
For DBSubnet, choose two or more private subnets to host the database.
For ETLAZ, choose the Availability Zone for ETL jobs. It must match with ETLSubnet.
For ETLSubnet, choose the subnet for the jobs. This must match with ETLAZ.

To find the subnet and corresponding Availability Zone, go to the Amazon Virtual Private Cloud (Amazon VPC) console and look at the columns Subnet ID and Availability Zone.

Choose Next.
On the Configure stack options page, skip the inputs and choose Next.
On the Review page, choose Create stack.
When the stack is complete, go to the Outputs tab and note the value for SecretsARN.

This CloudFormation stack creates the following resources:

An Amazon Aurora PostgreSQL-Compatible Edition (Serverless v1, engine version 11.13) database
A Secrets Manager secret (DemoDBSecret) storing the connection details to the source database
An AWS Glue network connection (rds_vpc) that can communicate with the source database and Amazon S3

Now you can populate the database with sample data. The data is generated by referencing to the sample HR schema.

Open the Amazon RDS Query Editor.
In the Connect to database section, provide the following information:
- For Database instance, enter demo-<123456789012>.
- For Database username, connect with a Secrets Manager ARN.
- For Secrets Manager ARN, enter the ARN from the outputs of the CloudFormation stack.
- For Database name, enter hr.
Choose Connect to database.
Enter the contents of the SQL file into the editor, then choose Run.

Main stack (required)

This CloudFormation stack works in all AWS Regions.

Choose Launch Stack.
On the Create stack page, choose Next.
For Stack name, enter data-lake-landing.
For BlueprintName, enter a name for your blueprint (default: data-lake-landing).
For S3BucketNamePrefix, enter a prefix (default: data-lake-raw-layer).
Choose Next.
On the Configure stack options page, skip the inputs and choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.
When the stack is complete, go to the Outputs tab and note the names of the S3 bucket (for example, data-lake-raw-layer-123456789012-region) and Step Functions workflow (for example, data-lake-landing).

The CloudFormation stack creates the following resources:

An S3 bucket as the data lake raw layer
A Step Functions workflow (see the definition on the GitHub repo)
AWS Identity and Access Management (IAM) roles and policies for the Step Functions workflow to provision AWS Glue resources and AWS Glue job executions.

The GlueExecutionRole is limited to the DemoDBSecret in Secrets Manager. If you need to connect to other databases which has a different endpoint/address or credentials, don’t forget to create new secrets and grant additional permissions to the IAM role or secrets so your AWS Glue jobs can authenticate with the source databases.

Prepare database connections

If you want to use this solution to perform ETL against your existing databases, follow this section. Otherwise, if you have deployed the CloudFormation stack for the database with sample data, jump to the section “Edit the workflow catalog”.

You need to have a running PostgreSQL database ready. To connect to other database engines, you need to customize this solution, particularly the jdbcUrl in the supplied PySpark script.

Create the database secret

To create your Secrets Manager secret, complete the following steps:

On the Secrets Manager console, choose Store a new secret.
For Secret type, choose Credentials for Amazon RDS database or Credentials for other database.
For Credentials, enter the user name and password to your database.
For Encryption key, keep the default AWS Key Management Service (AWS KMS) managed key aws/secretsmanager.
For Database, choose the database instance, or manually input the engine, server address, database name, and port.
Choose Next.
For Secret name, enter a name for your secret (for example, rds-secrets).
Choose Next.
Skip the Configure rotation – optional page and choose Next.
Review the summary and choose Store.

Create the AWS Glue connection

To create your AWS Glue connection, complete the following steps:

On the AWS Glue Studio console, choose Connectors.
Under Connections, choose Create connection.
For Name, enter a name (for example, rds-vpc).
For Connection type, choose Network.
For the VPC, subnet, and security groups (prepared in the prerequisite steps), enter where the ETL jobs run and are able to connect to the source database, Amazon S3, and Secrets Manager.
Choose Create connection.

You’re now ready to configure the rest of the solution.

Edit the workflow catalog

To download the workflow catalog, complete the following steps:

Download and edit the sample file.
If you are using the provided sample database, you must change the values of GlueExecutionRole and DestinationBucketName. If you are using your own databases, you must change all vaules except WorkflowName, JobScheduleType, and ScheduleCronPattern.

{
  "WorkflowList": [
    {
      "WorkflowName": "rds_to_s3_public_regions",
      "GlueExecutionRole": "arn:aws:iam::123456789012:role/AWSGlueServiceRole-data-lake-landing",
      "NetworkConnectionName": "rds-vpc",
      "SecretName": "DemoDBSecret",
      "SourceTable": "public.regions",
      "DestinationBucketName": "data-lake-raw-layer-123456789012-eu-west-1",
      "TransactionTimestampColumnName": "updated_at",
      "JobScheduleType": "Cron",
      "ScheduleCronPattern": "10 21 * * ? *"
    }
  ]
}

Rename the file your_blueprint_name.json and upload it to your S3 bucket (for example, s3://data-lake-raw-layer-123456789012-eu-west-1/data-lake-landing.json).

The example workflow has the JobScheduleType set to Cron. See Time-based schedules for jobs and crawlers for examples setting cron patterns. Alternatively set JobScheduleType to OnDemand.

See blueprint.cfg for the full list of parameters.

The provided workflow catalog JSON file contains job definitions of seven tables: public.regions, public.countries, public.locations, public.departments, public.jobs, public.employees, and public.job_history.

Review the PySpark script (optional)

The sample script performs the following:

Read the updated records from the source table:

jdbc_df = (spark.read.format("jdbc")
           .option("url", jdbcUrl)
           .option("user", secret["username"])
           .option("password", secret["password"])
           .option("query", sql_query)
           .load()
)

Add the date and timestamp columns:

df_withdate = jdbc_df.withColumn("ingestion_timestamp", lit(current_timestamp()))

Write the DataFrame to Amazon S3 as Parquet files.

Prepare the AWS Glue blueprint

Prepare your AWS Glue blueprint with the following steps:

Download the sample file and unzip it in your local computer.
Make any necessary changes to the PySpark script to include your own logic, and compress the three files (blueprint.cfg, jdbc_to_s3.py, layout.py; exclude any folders) as your_blueprint_name.zip (for example, data-lake-landing.zip):

zip data-lake-landing.zip blueprint.cfg jdbc_to_s3.py layout.py

Upload to the S3 bucket (for example, s3://data-lake-raw-layer-123456789012-region/data-lake-landing.zip).

Now you should have two files uploaded to your S3 bucket.

Run the Step Functions workflow to deploy AWS Glue resources

To run the Step Functions workflow, complete the following steps:

On the Step Functions console, select your state machine (data-lake-landing) and choose View details.
Choose Start execution.
Keep the default values in the pop-up.
Choose Start execution.
Wait until the Success step at the bottom turns green.

It’s normal to have some intermediate steps with the status “Caught error.”

When the workflow catalog contains a large number of ETL job entries, you can expect some delays. In our test environment, creating 100 jobs from a clean state can take around 22 minutes; the second run (deleting existing AWS Glue resources and creating 100 jobs) can take around 27 minutes.

Verify the workflow in AWS Glue

To check the workflow, complete the following steps:

On the AWS Glue console, choose Workflows.
Verify that all AWS Glue workflows defined in workflow_config.json are listed.
Select one of the workflows, and on the Action menu, choose Run.
Wait for about 3 minutes (or longer if not using the provided database with sample data), and verify on the Amazon S3 console that new Parquet files are created in your data lake (for example, s3://data-lake-raw-layer-123456789012-region/database/table/ingestion_date=yyyy-mm-dd/).

Step Functions workflow overview

This section describes the major steps in the Step Functions workflow.

Register the AWS Glue blueprint

A blueprint allows you to parameterize a workflow (defining jobs and crawlers), and subsequently generate multiple AWS Glue workflows reusing the same code logic to handle similar data ETL activities. The following diagram illustrates the AWS Glue blueprint registration part of the Step Functions workflow.

The step Glue: CreateBlueprint takes the ZIP archive in Amazon S3 (sample) and registers it for later use.

To understand how to develop a blueprint, see Developing blueprints in AWS Glue.

Parse the workflow catalog and clean up resources

The step S3: ParseGlueWorkflowsConfig triggers the following Map state, and runs a set of steps for each element of an input array.

We set the maximum concurrency to five parallel iterations to lower the chance of exceeding the maximum allowed API request rate (per account per Region). For each ETL job definition, the Step Functions workflow cleans up relevant AWS Glue resources (if they exist), including the workflow, job, and trigger.

For more information on the Map state, refer to Map.

Run the AWS Glue blueprint

Within the Map state, the step Glue: CreateWorkflowFromBlueprint starts an asynchronous process to create the AWS Glue workflow (for each job definition), and the jobs and triggers that the workflow encapsulates.

In this solution, all AWS Glue workflows share the same logic, beginning with a trigger to handle the schedule, followed by a job to run the ETL logic.

As indicated by the step CreateWorkflowFailed, any AWS Glue blueprint creation failure stops the whole Step Functions workflow and marks it with a failed status. Note that no rollback will happen. Fix the errors and rerun the Step Functions workflow. This will not result in duplicated AWS Glue resources and existing ones will be cleaned up in the process.

Limitations

Note the following limitations of this solution:

Each run of the Step Functions workflow deletes all relevant AWS Glue jobs defined in the workflow catalog, and creates new jobs with a different (random) suffix. As a result, you will lose the job run history in AWS Glue. The underlying metrics and logs are retained in Amazon CloudWatch.

Clean up

To avoid incurring future charges, perform the following steps:

Disable the schedules of the deployed AWS Glue jobs:
1. Open the workload configuration file in your S3 bucket (s3://data-lake-raw-layer-123456789012-eu-west-1/data-lake-landing.json) and replace the value of JobScheduleType to OnDemand for all workflow definitions.
2. Run the Step Functions workflow (data-lake-landing).
3. Observe that all AWS Glue triggers ending with _starting_trigger have the trigger type On-demand instead of Schedule.
Empty the S3 bucket and delete the CloudFormation stack.
Delete the deployed AWS Glue resources:
1. All AWS Glue triggers ending with _starting_trigger.
2. All AWS Glue jobs starting with the WorkflowName defined in the workflow catalog.
3. All AWS Glue workflows with the WorkflowName defined in the workflow catalog.
4. AWS Glue blueprints.

Conclusion

AWS Glue blueprints allow data engineers to build and maintain AWS Glue jobs landing data from RDBMS to your data lake at scale.By adopting this standardized and reusable approach, instead of maintaining hundreds of AWS Glue jobs, you now keep track the workflow catalog. When you have new tables to land to your data lake, simply add the entries to your workflow catalog and rerun the Step Functions workflow to deploy resources.

We highly encourage you to customize the blueprints for your multi-step data pipeline (for example, detect and mask sensitive data) and make them available to your organization and the AWS Glue community. To get started, see the Performing complex ETL activities using blueprints and workflows in AWS Glue and the sample blueprints on GitHub. If you have any questions, please leave a comment.

About the Authors

Moustafa Mahmoud is a Solutions Architect of AWS Data Lab with a passion for data integration, data analysis, machine learning, and BI. Moustafa helps customers convert their ideas to a production-ready data product on AWS. He has over 10 years of experience as a data engineer, machine learning practitioner, and software developer. In his spare time, Moustafa loves exploring nature, reading, and spending time with friends and family.

Corvus Lee is a Solutions Architect of AWS Data Lab. He enjoys all kinds of data-related discussions, and helps customers build MVPs using AWS Databases, Analytics, and Machine Learning services.

IAM Access Analyzer makes it simpler to author and validate role trust policies

2022-10-04 Mathangi Ramesh

Post Syndicated from Mathangi Ramesh original https://aws.amazon.com/blogs/security/iam-access-analyzer-makes-it-simpler-to-author-and-validate-role-trust-policies/

AWS Identity and Access Management (IAM) Access Analyzer provides many tools to help you set, verify, and refine permissions. One part of IAM Access Analyzer—policy validation—helps you author secure and functional policies that grant the intended permissions. Now, I’m excited to announce that AWS has updated the IAM console experience for role trust policies to make it simpler for you to author and validate the policy that controls who can assume a role. In this post, I’ll describe the new capabilities and show you how to use them as you author a role trust policy in the IAM console.

Overview of changes

A role trust policy is a JSON policy document in which you define the principals that you trust to assume the role. The principals that you can specify in the trust policy include users, roles, accounts, and services. The new IAM console experience provides the following features to help you set the right permissions in the trust policy:

An interactive policy editor prompts you to add the right policy elements, such as the principal and the allowed actions, and offers context-specific documentation.
As you author the policy, IAM Access Analyzer runs over 100 checks against your policy and highlights issues to fix. This includes new policy checks specific to role trust policies, such as a check to make sure that you’ve formatted your identity provider correctly. These new checks are also available through the IAM Access Analyzer policy validation API.
Before saving the policy, you can preview findings for the external access granted by your trust policy. This helps you review external access, such as access granted to a federated identity provider, and confirm that you grant only the intended access when you create the policy. This functionality was previously available through the APIs, but now it’s also available in the IAM console.

In the following sections, I’ll walk you through how to use these new features.

Example scenario

For the walkthrough, consider the following example, which is illustrated in Figure 1. You are a developer for Example Corp., and you are working on a web application. You want to grant the application hosted in one account—the ApplicationHost account—access to data in another account—the BusinessData account. To do this, you can use an IAM role in the BusinessData account to grant temporary access to the application through a role trust policy. You will grant a role in the ApplicationHost account—the PaymentApplication role—to access the BusinessData account through a role—the ApplicationAccess role. In this example, you create the ApplicationAccess role and grant cross-account permissions through the trust policy by using the new IAM console experience that helps you set the right permissions.

Figure 1: Visual explanation of the scenario

Create the role and grant permissions through a role trust policy with the policy editor

In this section, I will show you how to create a role trust policy for the ApplicationAccess role to grant the application access to the data in your account through the policy editor in the IAM console.

To create a role and grant access

In the BusinessData account, open the IAM console, and in the left navigation pane, choose Roles.
Choose Create role, and then select Custom trust policy, as shown in Figure 2.

Figure 2: Select “Custom trust policy” when creating a role
In the Custom trust policy section, for 1. Add actions for STS, select the actions that you need for your policy. For example, to add the action sts:AssumeRole, choose AssumeRole.

Figure 3: JSON role trust policy
For 2. Add a principal, choose Add to add a principal.
In the Add principal box, for Principal type, select IAM roles. This populates the ARN field with the format of the role ARN that you need to add to the policy, as shown in Figure 4.

Figure 4: Add a principal to your role trust policy
Update the role ARN template with the actual account and role information, and then choose Add principal. In our example, the account is ApplicationHost with an AWS account number of 111122223333, and the role is PaymentApplication role. Therefore, the role ARN is arn:aws:iam:: 111122223333: role/PaymentApplication. Figure 5 shows the role trust policy with the action and principal added.

Figure 5: Sample role trust policy
(Optional) To add a condition, for 3. Add a condition, choose Add, and then complete the Add condition box according to your needs.

Author secure policies by reviewing policy validation findings

As you author the policy, you can see errors or warnings related to your policy in the policy validation window, which is located below the policy editor in the console. With this launch, policy validation in IAM Access Analyzer includes 13 new checks focused on the trust relationship for the role. The following are a few examples of these checks and how to address them:

Role trust policy unsupported wildcard in principal – you can’t use a * in your role trust policy.
Invalid federated principal syntax in role trust policy – you need to fix the format of the identity provider.
Missing action for condition key – you need to add the right action for a given condition, such as the sts:TagSession when there are session tag conditions.

For a complete list of checks, see Access Analyzer policy check reference.

To review and fix policy validation findings

In the policy validation window, do the following:
- Choose the Security tab to check if your policy is overly permissive.
- Choose the Errors tab to review any errors associated with the policy.
- Choose the Warnings tab to review if aspects of the policy don’t align with AWS best practices.
- Choose the Suggestions tab to get recommendations on how to improve the quality of your policy.
Figure 6: Policy validation window in IAM Access Analyzer with a finding for your policy
For each finding, choose Learn more to review the documentation associated with the finding and take steps to fix it. For example, Figure 6 shows the error Mismatched Action For Principal. To fix the error, remove the action sts:AssumeRoleWithWebIdentity.

Preview external access by reviewing cross-account access findings

IAM Access Analyzer also generates findings to help you assess if a policy grants access to external entities. You can review the findings before you create the policy to make sure that the policy grants only intended access. To preview the findings, you create an analyzer and then review the findings.

To preview findings for external access

Below the policy editor, in the Preview external access section, choose Go to Access Analyzer, as shown in Figure 7.

Note: IAM Access Analyzer is a regional service, and you can create a new analyzer in each AWS Region where you operate. In this situation, IAM Access Analyzer looks for an analyzer in the Region where you landed on the IAM console. If IAM Access Analyzer doesn’t find an analyzer there, it asks you to create an analyzer.

Figure 7: Preview external access widget without an analyzer
On the Create analyzer page, do the following to create an analyzer:
- For Name, enter a name for your analyzer.
- For Zone of trust, select the correct account.
- Choose Create analyzer.
Figure 8: Create an analyzer to preview findings
After you create the analyzer, navigate back to the role trust policy for your role to review the external access granted by this policy. The following figure shows that external access is granted to PaymentApplication.

Figure 9: Preview finding
If the access is intended, you don’t need to take any action. In this example, I want the PaymentApplication role in the ApplicationHost account to assume the role that I’m creating.
If the access is unintended, resolve the finding by updating the role ARN information.
Select Next and grant the required IAM permissions for the role.
Name the role ApplicationAccess, and then choose Save to save the role.

Now the application can use this role to access the BusinessData account.

Conclusion

By using the new IAM console experience for role trust policies, you can confidently author policies that grant the intended access. IAM Access Analyzer helps you in your least-privilege journey by evaluating the policy for potential issues to make it simpler for you to author secure policies. IAM Access Analyzer also helps you preview external access granted through the trust policy to help ensure that the granted access is intended. To learn more about how to preview IAM Access Analyzer cross-account findings, see Preview access in the documentation. To learn more about IAM Access Analyzer policy validation checks, see Access Analyzer policy validation. These features are also available through APIs.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread at AWS IAM re:Post or contact AWS Support.

Automate ETL jobs between Amazon RDS for SQL Server and Azure Managed SQL using AWS Glue Studio

2022-09-26 Daniel Maldonado

Post Syndicated from Daniel Maldonado original https://aws.amazon.com/blogs/big-data/automate-etl-jobs-between-amazon-rds-for-sql-server-and-azure-managed-sql-using-aws-glue-studio/

Nowadays many customers are following a multi-cloud strategy. They might choose to use various cloud-managed services, such as Amazon Relational Database Service (Amazon RDS) for SQL Server and Azure SQL Managed Instances, to perform data analytics tasks, but still use traditional extract, transform, and load (ETL) tools to integrate and process the data. However, traditional ETL tools may require you to develop custom scripts, which makes ETL automation difficult.

In this post, I show you how to automate ETL jobs between Amazon RDS for SQL Server and Azure SQL Managed Instances using AWS Glue Studio, which is part of AWS Glue, a fully managed serverless integration service. AWS Glue Studio has a graphical interface that makes it easy to create, run, and monitor ETL jobs, and you can create a schedule to run your jobs at specific times.

Solution overview

To move data from one database to another, there are different services available either on-premise or in the cloud, varied by bandwidth limits, ongoing changes (CDC), schema and table modifications, and other features. Beyond that, we need to apply advanced data transformations, monitor, and automate the ETL jobs. This is where AWS Glue Studio can help us facilitate these activities.

As shown in the following diagram, we use AWS Glue Studio as the middleware to pull data from the source database (in this case an Azure SQL Managed Instance), then create and automate the ETL job using one of the pre-built transformations in AWS Glue Studio. Finally, we load the data to the target database (in this case an RDS for SQL Server instance).

The solution workflow consists of the following steps:

Create connections for the source and target databases.
Create and run AWS Glue crawlers.
Create and run an ETL job that transforms the data and loads it from source to target.
Schedule the ETL job to run automatically.
Monitor the ETL job.

Prerequisites

Complete the following prerequisite steps:

Install SQL Server Management Studio (SSMS) or an equivalent client tool.
Set up a VPN connection between Amazon Virtual Private Cloud (Amazon VPC) and the Azure private subnet.
Create a security group for AWS Glue ENI in your VPC.
Create an AWS Identity and Access Management (IAM) role for AWS Glue. For instructions, refer to Setting up IAM permissions for AWS Glue.
Open the appropriate firewall ports in the Azure private subnet.
Create a source database table (Azure SQL Managed Instance). You can deploy the Azure database instance using the following QuickStart. For testing purposes, I import the public AdventureWorks sample database and use the dbo.Employee table. See the following code:
```
#Query table
SELECT * FROM [AdventureWorksLT2019].[dbo].[Employee]
```
Create the target database table (Amazon RDS for SQL Server). To deploy the RDS instance, refer to Create and Connect to a Microsoft SQL Server Database with Amazon RDS. You can create an empty database and table with the following statements. This is the table where the data coming from Azure will be stored.

#Create database
CREATE DATABASE AdventureWorksonRDS;
#Create table
CREATE TABLE Employee
(EmpID INT NOT NULL,
EmpName VARCHAR(50) NOT NULL,
Designation VARCHAR(50) NULL,
Department VARCHAR(50) NULL,
JoiningDate DATETIME NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED (EmpID)

#Query table
SELECT * FROM [AdventureWorksonRDS].[dbo].[Employee]

Create connections

The first step is to populate our AWS Glue Data Catalog with the schema information coming from our source and target data sources.

To do that, we first create connections. A connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, VPC information, and more. Creating connections in the Data Catalog saves the effort of having to specify the connection details every time you create a crawler or job.

Create a connection for Azure SQL Managed Instance

To create the connection to our source database, complete the following steps:

On the AWS Glue console, choose AWS Glue Studio.
In the navigation pane of the AWS Glue Studio console, choose Connectors.
Choose Create connection.
For Name, enter AzureSQLManaged.
For Connection type, choose JDBC.
For JDBC URL, use the SQL Server syntax jdbc:protocol://host:port;database=db_name.

You can find the host and database name on the Azure SQL Managed Instance service console, on the Overview page.For this specific example, we use the following information for our Azure SQL Instance:

- Protocol – sqlserver
- Host – adi-qa-sql-managed-instance-test.public.xxxxxxxxxxxx.database.windows.net
- Port – 3342
- Database name – AdventureWorksLT2019

Enter your user name and password.
Choose Create connection.

Create a connection for Amazon RDS for SQL Server

To create a connection for our target data source, complete the following steps:

On the AWS Glue Studio console, choose Connectors in the navigation pane.
Choose Create connection.
For Name, enter AWSRDSforSQL.
For Connection type, choose Amazon RDS.
For Database engine, choose Microsoft SQL Server.
For Database instances, choose your RDS DB instance.
For Database name, enter AdventureWorksonRDS.
Enter your user name and password.
Choose Create connection.

You can now see the two connections created in the Connections section.

Create and run AWS Glue crawlers

You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the most common method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, it updates the Data Catalog with the tables it found. The ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Create a crawler for Azure SQL Managed Instance

To create a crawler for our source database, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
If the data hasn’t been mapped into an AWS Glue table, select Not yet and choose Add a data source.
For Data source¸ choose JDBC.
For Connection, choose AzureSQLManaged.
For Include path, specify the path of the database including the schema: AdventureWorksLT2019/dbo/%.
Choose Add a JDBC data source.
Choose Next.
Choose the IAM role created as part of the prerequisites and choose Next.
Choose Add database to create the target database in the AWS Glue Data Catalog.
For Name, enter azuresqlmanaged_db.
Choose Create database.
For Target database, choose azuresqlmanaged_db.
Choose Next.
Review if everything looks correct and choose Create crawler.

Create a crawler for Amazon RDS for SQL Server

Repeat the crawler creation steps to create the crawler for the target RDS for SQL Server database, using the following information:

Crawler name – AmazonRDSSQL_Crawler
Data source – JDBC
Connection – AWSRDSforSQL
Include path – AdventureWorksonRDS/dbo/%
IAM role – AWSGlueServiceRoleDefault
Database name – amazonrdssql_db

Run the crawlers

Now it’s time to run the crawlers.

On the AWS Glue console, choose Crawlers in the navigation pane.
Select the crawlers you created and choose Run.
When the crawler is complete, choose Databases in the navigation pane. Here you can find the databases discovered by the crawler.
Choose Tables in the navigation pane and explore the tables discovered by the crawler that correctly identified the data type as SQL Server.
Choose the table adventureworkslt2019_dbo_employee and review the schema created for the data source.

Create and run an ETL job

Now that we have crawled our source and target databases, and we have the data in the AWS Glue Data Catalog, we can create an ETL job to load and transform this data.

On the AWS Glue Studio console, choose Jobs in the navigation pane.
Select Visual with a blank canvas to use a visual interface to create our ETL jobs.
Choose Create.
On the Source menu, choose AWS Glue Data Catalog.
On the Data source properties tab, specify the database and table (for this post, azuresqlmanaged_db and adventureworkslt2019_dbo_employee).
On the Transform menu, choose Apply mapping to map the source fields to the target database.
On the Transform tab, you can see the data fields to be loaded, and you even can drop some of them if needed.
On the Target menu, choose AWS Glue Data Catalog.
On the Data target properties tab, choose the database and table where you want to load the transformed data (for this post, amazonrdssql_db and adventureworksrds_dbo_employee).
On the Job details tab, for Name, enter ETL_Azure_to_AWS.
For IAM Role, choose the appropriate role.
Choose Save.
Choose Run to run the job.

If the ETL job ran successfully, it should map the data from the source database (Azure SQL) to the target database (Amazon RDS for SQL). To confirm it, you can connect to the target database using SQL Server Management Studio (SSMS), and query the empty database/table AdventureWorksonRDS/dbo.Employee. It should have the data coming from the Azure SQL Managed Instance.

#Query table
SELECT * FROM [AdventureWorksonRDS].[dbo].[Employee]

Schedule your ETL job

In AWS Glue Studio, you can create a schedule to have your jobs run at specific times. This will reimport the full dataset and reference the use of bookmarks to do incremental loads. You can schedule your ETL jobs on an hourly, daily, weekly, monthly, or custom basis, depending on your needs. To schedule a job, complete the following steps:

On the AWS Glue Studio, navigate to the job you created.
On the Schedules tab, choose Create schedule.
For Name, enter a name (for example, dbo_employee_daily_load).
Choose your preferred frequency, start hour, and minute of the hour. For this post, we schedule it daily at 3:00 UTC.
For Description, enter an optional description.
Choose Create schedule.

Confirm on the Schedules tab that the schedule was created and activated successfully.

You have now automated your ETL job to run at your desired frequency.

Monitor your ETL job

The job monitoring dashboard provides an overall summary of the job runs, with totals for the jobs with a status of Running, Canceled, Success, or Failed.

The Runs tab shows the jobs for the specified date range and filters. You can filter the jobs on additional criteria, such as status, worker type, job type, and job name.

Conclusion

In this post, I went through the steps to automate ETL jobs using AWS Glue Studio, which is a user-friendly graphical interface to perform data integration tasks such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes. You can easily find and access this data using the AWS Glue Data Catalog. Data engineers and ETL developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio.

About the author

Daniel Maldonado is an AWS Solutions Architect, specializing in Microsoft workloads and big data technologies, and focused on helping customers migrate their applications and data to AWS. Daniel has over 13 years of experience working with information technologies and enjoys helping clients reap the benefits of running their workloads in the cloud.

Announcing an update to IAM role trust policy behavior

2022-09-22 Mark Ryland

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/announcing-an-update-to-iam-role-trust-policy-behavior/

AWS Identity and Access Management (IAM) is changing an aspect of how role trust policy evaluation behaves when a role assumes itself. Previously, roles implicitly trusted themselves from a role trust policy perspective if they had identity-based permissions to assume themselves. After receiving and considering feedback from customers on this topic, AWS is changing role assumption behavior to always require self-referential role trust policy grants. This change improves consistency and visibility with regard to role behavior and privileges. This change allows customers to create and understand role assumption permissions in a single place (the role trust policy) rather than two places (the role trust policy and the role identity policy). It increases the simplicity of role trust permission management: “What you see [in the trust policy] is what you get.”

Therefore, beginning today, for any role that has not used the identity-based behavior since June 30, 2022, a role trust policy must explicitly grant permission to all principals, including the role itself, that need to assume it under the specified conditions. Removal of the role’s implicit self-trust improves consistency and increases visibility into role assumption behavior.

Most AWS customers will not be impacted by the change at all. Only a tiny percentage (approximately 0.0001%) of all roles are involved. Customers whose roles have recently used the previous implicit trust behavior are being notified, beginning today, about those roles, and may continue to use this behavior with those roles until February 15, 2023, to allow time for making the necessary updates to code or configuration. Or, if these customers are confident that the change will not impact them, they can opt out immediately by substituting in new roles, as discussed later in this post.

The first part of this post briefly explains the change in behavior. The middle sections answer practical questions like: “why is this happening?,” “how might this change impact me?,” “which usage scenarios are likely to be impacted?,” and “what should I do next?” The usage scenario section is important because it shows that, based on our analysis, the self-assuming role behavior exhibited by code or human users is very likely to be unnecessary and counterproductive. Finally, for security professionals interested in better understanding the reasons for the old behavior, the rationale for the change, as well as its possible implications, the last section reviews a number of core IAM concepts and digs in to additional details.

What is changing?

Until today, an IAM role implicitly trusted itself. Consider the following role trust policy attached to the role named RoleA in AWS account 123456789012.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/RoleB"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

This role trust policy grants role assumption access to the role named RoleB in the same account. However, if the corresponding identity-based policy for RoleA grants the sts:AssumeRole action with respect to itself, then RoleA could also assume itself. Therefore, there were actually two roles that could assume RoleA: the explicitly permissioned RoleB, and RoleA, which implicitly trusted itself as a byproduct of the IAM ownership model (explained in detail in the final section). Note that the identity-based permission that RoleA must have to assume itself is not required in the case of RoleB, and indeed an identity-based policy associated with RoleB that references other roles is not sufficient to allow RoleB to assume them. The resource-based permission granted by RoleA’s trust policy is both necessary and sufficient to allow RoleB to assume RoleA.

Although earlier we summarized this behavior as “implicit self-trust,” the key point here is that the ability of Role A to assume itself is not actually implicit behavior. The role’s self-referential permission had to be explicit in one place or the other (or both): either in the role’s identity-based policy (perhaps based on broad wildcard permissions), or its trust policy. But unlike the case with other principals and role trust, an IAM administrator would have to look in two different policies to determine whether a role could assume itself.

As of today, for any new role, or any role that has not recently assumed itself while relying on the old behavior, IAM administrators must modify the previously shown role trust policy as follows to allow RoleA to assume itself, regardless of the privileges granted by its identity-based policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:role/RoleB",
                    "arn:aws:iam::123456789012:role/RoleA"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

This change makes role trust behavior clearer and more consistent to understand and manage, whether directly by humans or as embodied in code.

How might this change impact me?

As previously noted, most customers will not be impacted by the change at all. For those customers who do use the prior implicit trust grant behavior, AWS will work with you to eliminate your usage prior to February 15, 2023. Here are more details for the two cases of customers who have not used the behavior, and those who have.

If you haven’t used the implicit trust behavior since June 30, 2022

Beginning today, if you have not used the old behavior for a given role at any time since June 30, 2022, you will now experience the new behavior. Those existing roles, as well as any new roles, will need an explicit reference in their own trust policy in order to assume themselves. If you have roles that are used only very occasionally, such as once per quarter for a seldom-run batch process, you should identify those roles and if necessary either remove the dependency on the old behavior or update their role trust policies to include the role itself prior to their next usage (see the second sample policy above for an example).

If you have used the implicit trust behavior since June 30, 2022

If you have a role that has used the implicit trust behavior since June 30, 2022, then you will continue to be able to do so with that role until February 15, 2023. AWS will provide you with notice referencing those roles beginning today through your AWS Health Dashboard and will also send an email with the relevant information to the account owner and security contact. We are allowing time for you to make any necessary changes to your existing processes, code, or configurations to prepare for removal of the implicit trust behavior. If you can’t change your processes or code, you can continue to use the behavior by making a configuration change—namely, by updating the relevant role trust policies to reference the role itself. On the other hand, you can opt out of the old behavior at any time by creating a new role with a different Amazon Resource Name (ARN) with the desired identity-based and trust-policy-based permissions and substituting it for any older role that was identified as using the implicit trust behavior. (The new role will not be allow-listed, because the allow list is based on role ARNs.) You can also modify an existing allow-listed role’s trust policy to explicitly deny access to itself. See the “What should I do next?” section for more information.

Notifications and retirement

As we previously noted, starting today, accounts with existing roles that use the implicit self-assume role assumption behavior will be notified of this change by email and through their AWS Health Dashboard. Those roles have been allow-listed, and so for now their behavior will continue as before. After February 15, 2023, the old behavior will be retired for all roles and all accounts. IAM Documentation has been updated to make clear the new behavior.

After the old behavior is retired from the allow-listed roles and accounts, role sessions that make self-referential role assumption calls will fail with an Access Denied error unless the role’s trust policy explicitly grants the permission directly through a role ARN. Another option is to grant permission indirectly through an ARN to the root principal in the trust policy that acts as a delegation of privilege management, after which permission grants in identity-based policies determine access, similar to the typical cross-account case.

Which usage scenarios are likely to be impacted?

Users often attach an IAM role to an Amazon Elastic Compute Cloud (Amazon EC2) instance, an Amazon Elastic Container Service (Amazon ECS) task, or AWS Lambda function. Attaching a role to one of these runtime environments enables workloads to use short-term session credentials based on that role. For example, when an EC2 instance is launched, AWS automatically creates a role session and assigns it to the instance. An AWS best practice is for the workload to use these credentials to issue AWS API calls without explicitly requesting short-term credentials through sts:AssumeRole calls.

However, examples and code snippets commonly available on internet forums and community knowledge sharing sites might incorrectly suggest that workloads need to call sts:AssumeRole to establish short-term sessions credentials for operation within those environments.

We analyzed AWS Security Token Service (AWS STS) service metadata about role self-assumption in order to understand the use cases and possible impact of the change. What the data shows is that in almost all cases this behavior is occurring due to unnecessarily reassuming the role in an Amazon EC2, Amazon ECS, Amazon Elastic Kubernetes Services (EKS), or Lambda runtime environment already provided by the environment. There are two exceptions, discussed at the end of this section under the headings, “self-assumption with a scoped-down policy” and “assuming a target compute role during development.”

There are many variations on this theme, but overall, most role self-assumption occurs in scenarios where the person or code is unnecessarily reassuming the role that the code was already running as. Although this practice and code style can still work with a configuration change (by adding an explicit self-reference to the role trust policy), the better practice will almost always be to remove this unnecessary behavior or code from your AWS environment going forward. By removing this unnecessary behavior, you save CPU, memory, and network resources.

Common mistakes when using Amazon EKS

Some users of the Amazon EKS service (or possibly their shell scripts) use the command line interface (CLI) command aws eks get-token to obtain an authentication token for use in managing a Kubernetes cluster. The command takes as an optional parameter a role ARN. That parameter allows a user to assume another role other than the one they are currently using before they call get-token. However, the CLI cannot call that API without already having an IAM identity. Some users might believe that they need to specify the role ARN of the role they are already using. We have updated the Amazon EKS documentation to make clear that this is not necessary.

Common mistakes when using AWS Lambda

Another example is the use of an sts:AssumeRole API call from a Lambda function. The function is already running in a preassigned role provided by user configuration within the Lambda service, or else it couldn’t successfully call any authenticated API action, including sts:AssumeRole. However, some Lambda functions call sts:AssumeRole with the target role being the very same role that the Lambda function has already been provided as part of its configuration. This call is unnecessary.

AWS Software Development Kits (SDKs) all have support for running in AWS Lambda environments and automatically using the credentials provided in that environment. We have updated the Lambda documentation to make clear that such STS calls are unnecessary.

Common mistakes when using Amazon ECS

Customers can associate an IAM role with an Amazon ECS task to give the task AWS credentials to interact with other AWS resources.

We detected ECS tasks that call sts:AssumeRole on the same role that was provided to the ECS task. Amazon ECS makes the role’s credentials available inside the compute resources of the ECS task, whether on Amazon EC2 or AWS Fargate, and these credentials can be used to access AWS services or resources as the IAM role associated with the ECS talk, without being called through sts:AssumeRole. AWS handles renewing the credentials available on ECS tasks before the credentials expire. AWS STS role assumption calls are unnecessary, because they simply create a new set of the same temporary role session credentials.

AWS SDKs all have support for running in Amazon ECS environments and automatically using the credentials provided in that ECS environment. We have updated the Amazon ECS documentation to make clear that calling sts:AssumeRole for an ECS task is unnecessary.

Common mistakes when using Amazon EC2

Users can configure an Amazon EC2 instance to contain an instance profile. This instance profile defines the IAM role that Amazon EC2 assigns the compute instance when it is launched and begins to run. The role attached to the EC2 instance enables your code to send signed requests to AWS services. Without this attached role, your code would not be able to access your AWS resources (nor would it be able to call sts:AssumeRole). The Amazon EC2 service handles renewing these temporary role session credentials that are assigned to the instance before they expire.

We have observed that workloads running on EC2 instances call sts:AssumeRole to assume the same role that is already associated with the EC2 instance and use the resulting role-session for communication with AWS services. These role assumption calls are unnecessary, because they simply create a new set of the same temporary role session credentials.

AWS SDKs all have support for running in Amazon EC2 environments and automatically using the credentials provided in that EC2 environment. We have updated the Amazon EC2 documentation to make clear that calling sts:AssumeRole for an EC2 instance with a role assigned is unnecessary.

For information on creating an IAM role, attaching that role to an EC2 instance, and launching an instance with an attached role, see “IAM roles for Amazon EC2” in the Amazon EC2 User Guide.

Other common mistakes

If your use case does not use any of these AWS execution environments, you might still experience an impact from this change. We recommend that you examine the roles in your account and identify scenarios where your code (or human use through the AWS CLI) results in a role assuming itself. We provide Amazon Athena and AWS CloudTrail Lake queries later in this post to help you locate instances where a role assumed itself. For each instance, you can evaluate whether a role assuming itself is the right operation for your needs.

Self-assumption with a scoped-down policy

The first pattern we have observed that is not a mistake is the use of self-assumption combined with a scoped-down policy. Some systems use this approach to provide different privileges for different use cases, all using the same underlying role. Customers who choose to continue with this approach can do so by adding the role to its own trust policy. While the use of scoped-down policies and the associated least-privilege approach to permissions is a good idea, we recommend that customers switch to using a second generic role and assume that role along with the scoped-down policy rather than using role self-assumption. This approach provides more clarity in CloudTrail about what is happening, and limits the possible iterations of role assumption to one round, since the second role should not be able to assume the first. Another possible approach in some cases is to limit subsequent assumptions is by using an IAM condition in the role trust policy that is no longer satisfied after the first role assumption. For example, for Lambda functions, this would be done by a condition checking for the presence of the “lambda:SourceFunctionArn” property; for EC2, by checking for presence of “ec2:SourceInstanceARN.”

Assuming an expected target compute role during development

Another possible reason for role self-assumption may result from a development practice in which developers attempt to normalize the roles that their code is running in between scenarios in which role credentials are not automatically provided by the environment, and scenarios where they are. For example, imagine a developer is working on code that she expects to run as a Lambda function, but during development is using her laptop to do some initial testing of the code. In order to provide the same execution role as is expected later in product, the developer might configure the role trust policy to allow assumption by a principal readily available on the laptop (an IAM Identity Center role, for example), and then assume the expected Lambda function execution role when the code is initializing. The same approach could be used on a build and test server. Later, when the code is deployed to Lambda, the actual role is already available and in use, but the code need not be modified in order to provide the same post-role-assumption behavior that existing outside of Lambda: the unmodified code can automatically assume what is in this case the same role, and proceed. While this approach is not illogical, as with the scope-down policy case we recommend that customers configure distinct roles for assumption both in development and test environments as well as later production environments. Again, this approach provides more clarity in CloudTrail about what is happening, and limits the possible iterations of role assumption to one round, since the second role should not be able to assume the first.

What should I do next?

If you receive an email or AWS Health Dashboard notification for an account, we recommend that you review your existing role trust policies and corresponding code. For those roles, you should remove the dependency on the old behavior, or if you can’t, update those role trust policies with an explicit self-referential permission grant. After the grace period expires on February 15, 2023, you will no longer be able to use the implicit self-referential permission grant behavior.

If you currently use the old behavior and need to continue to do so for a short period of time in the context of existing infrastructure as code or other automated processes that create new roles, you can do so by adding the role’s ARN to its own trust policy. We strongly encourage you to treat this as a temporary stop-gap measure, because in almost all cases it should not be necessary for a role to be able to assume itself, and the correct solution is to change the code that results in the unnecessary self-assumption. If for some reason that self-service solution is not sufficient, you can reach out to AWS Support to seek an accommodation of your use case for new roles or accounts.

If you make any necessary code or configuration changes and want to remove roles that are currently allow-listed, you can also ask AWS Support to remove those roles from the allow list so that their behavior follows the new model. Or, as previously noted, you can opt out of the old behavior at any time by creating a new role with a different ARN that has the desired identity-based and trust-policy–based permissions and substituting it for the allow-listed role. Another stop-gap type of option is to add an explicit deny that references the role to its own trust policy.

If you would like to understand better the history of your usage of role self-assumption in a given account or organization, you can follow these instructions on querying CloudTrail data with Athena and then use the following Athena query against your account or organization CloudTrail data, as stored in Amazon Simple Storage Services (Amazon S3). The results of the query can help you understand the scenarios and conditions and code involved. Depending on the size of your CloudTrail logs, you may need to follow the partitioning instructions to query subsets of your CloudTrail logs sequentially. If this query yields no results, the role self-assumption scenario described in this blog post has never occurred within the analyzed CloudTrail dataset.

SELECT eventid, eventtime, userIdentity.sessioncontext.sessionissuer.arn as RoleARN, split_part(userIdentity.principalId, ':', 2) as RoleSessionName from cloudtrail_logs t CROSS JOIN UNNEST(t.resources) unnested (resources_entry) where eventSource = 'sts.amazonaws.com' and eventName = 'AssumeRole' and userIdentity.type = 'AssumedRole' and errorcode IS NULL and substr(userIdentity.sessioncontext.sessionissuer.arn,12) = substr(unnested.resources_entry.ARN,12)

As another option, you can follow these instructions to set up CloudTrail Lake to perform a similar analysis. CloudTrail Lake allows richer, faster queries without the need to partition the data. As of September 20, 2022, CloudTrail Lake now supports import of CloudTrail logs from Amazon S3. This allows you to perform a historical analysis even if you haven’t previously enabled CloudTrail Lake. If this query yields no results, the scenario described in this blog post has never occurred within the analyzed CloudTrail dataset.

SELECT eventid, eventtime, userIdentity.sessioncontext.sessionissuer.arn as RoleARN, userIdentity.principalId as RoleIdColonRoleSessionName from $EDS_ID where eventSource = 'sts.amazonaws.com' and eventName = 'AssumeRole' and userIdentity.type = 'AssumedRole' and errorcode IS NULL and userIdentity.sessioncontext.sessionissuer.arn = element_at(resources,1).arn

Understanding the change: more details

To better understand the background of this change, we need to review the IAM basics of identity-based policies and resource-based policies, and then explain some subtleties and exceptions. You can find additional overview material in the IAM documentation.

The structure of each IAM policy follows the same basic model: one or more statements with an effect (allow or deny), along with principals, actions, resources, and conditions. Although the identity-based and resource-based policies share the same basic syntax and semantics, the former is associated with a principal, the latter with a resource. The main difference between the two is that identity-based policies do not specify the principal, because that information is supplied implicitly by associating the policy with a given principal. On the other hand, resource policies do not specify an arbitrary resource, because at least the primary identifier of the resource (for example, the bucket identifier of an S3 bucket) is supplied implicitly by associating the policy with that resource. Note that an IAM role is the only kind of AWS object that is both a principal and a resource.

In most cases, access to a resource within the same AWS account can be granted by either an identity-based policy or a resource-based policy. Consider an Amazon S3 example. An identity-based policy attached to an IAM principal that allows the s3:GetObject action does not require an equivalent grant in the S3 bucket resource policy. Conversely, an s3:GetObject permission grant in a bucket’s resource policy is all that is needed to allow a principal in the same account to call the API with respect to that bucket; an equivalent identity-based permission is not required. Either the identity-based policy or the resource-based policy can grant the necessary permission. For more information, see IAM policy types: How and when to use them.

However, in order to more tightly govern access to certain security-sensitive resources, such as AWS Key Management Service (AWS KMS) keys and IAM roles, those resource policies need to grant access to the IAM principal explicitly, even within the same AWS account. A role trust policy is the resource policy associated with a role that specifies which IAM principals can assume the role by using one of the sts:AssumeRole* API calls. For example, in order for RoleB to assume RoleA in the same account, whether or not RoleB’s identity-based policy explicitly allows it to assume RoleA, RoleA’s role trust policy must grant access to RoleB. Within the same account, an identity-based permission by itself is not sufficient to allow assumption of a role. On the other hand, a resource-based permission—a grant of access in the role trust policy—is sufficient. (Note that it’s possible to construct a kind of hybrid permission to a role by using both its resource policy and other identity-based policies. In that case, the role trust policy grants permission to the root principal ARN; after that, the identity-based policy of a principal in that account would need to explicitly grant permission to assume that role. This is analogous to the typical cross-account role trust scenario.)

Until now, there has been a nonintuitive exception to these rules for situations where a role assumes itself. Since a role is both a principal (potentially with an identity-based policy) and a resource (with a resource-based policy), it is in the unique position of being both a subject and an object within the IAM system, as well as being an object owned by itself rather than its containing account. Due to this ownership model, roles with identity-based permission to assume themselves implicitly trusted themselves as resources, and vice versa. That is to say, roles that had the privilege as principals to assume themselves implicitly trusted themselves as resources, without an explicit self-referential Allow in the role trust policy. Conversely, a grant of permission in the role trust policy was sufficient regardless of whether there was a grant in the same role’s identity-based policy. Thus, in the self-assumption case, roles behaved like most other resources in the same account: only a single permission was required to allow role self-assumption, either on the identity side or the resource side of their dual-sided nature. Because of a role’s implicit trust of itself as a resource, the role’s trust policy—which might otherwise limit assumption of the role with properties such as actions and conditions—was not applied, unless it contained an explicit deny of itself.

The following example is a role trust policy attached to the role named RoleA in account 123456789012. It grants explicit access only to the role named RoleB.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/RoleB"
            },
            "Action": ["sts:AssumeRole", "sts:TagSession"],
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalTag/project": "BlueSkyProject"
                }
            }
        }
    ]
}

Assuming that the corresponding identity-based policy for RoleA granted the sts:AssumeRole action with regard to RoleA, this role trust policy provided that there were two roles that could assume RoleA: RoleB (explicitly referenced in the trust policy) and RoleA (assuming it was explicitly referenced in its identity policy). RoleB could assume RoleA only if it had the principal tag project:BlueSkyProject because of the trust policy condition. (The sts:TagSession permission is needed here in case tags need to be added by the caller as parted of the RoleAssumption call.) RoleA, on the other hand, did not need to meet that condition because it relied on a different explicit permission—the one granted in the identity-based policy. RoleA would have needed the principal tag project:BlueSkyProject to meet the trust policy condition if and only if it was relying on the trust policy to gain access through the sts:AssumeRole action; that is, in the case where its identity-based policy did not provide the needed privilege.

As we previously noted, after considering feedback from customers on this topic, AWS has decided that requiring self-referential role trust policy grants even in the case where the identity-based policy also grants access is the better approach to delivering consistency and visibility with regard to role behavior and privileges. Therefore, as of today, role assumption behavior requires an explicit self-referential permission in the role trust policy, and the actions and conditions within that policy must also be satisfied, regardless of the permissions expressed in the role’s identity-based policy. (If permissions in the identity-based policy are present, they must also be satisfied.)

Requiring self-reference in the trust policy makes role trust policy evaluation consistent regardless of which role is seeking to assume the role. Improved consistency makes role permissions easier to understand and manage, whether through human inspection or security tooling. This change also eliminates the possibility of continuing the lifetime of an otherwise temporary credential without explicit, trackable grants of permission in trust policies. It also means that trust policy constraints and conditions are enforced consistently, regardless of which principal is assuming the role. Finally, as previously noted, this change allows customers to create and understand role assumption permissions in a single place (the role trust policy) rather than two places (the role trust policy and the role identity policy). It increases the simplicity of role trust permission management: “what you see [in the trust policy] is what you get.”

Continuing with the preceding example, if you need to allow a role to assume itself, you now must update the role trust policy to explicitly allow both RoleB and RoleA. The RoleA trust policy now looks like the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:role/RoleB",
                    "arn:aws:iam::123456789012:role/RoleA"
                ]
            },
            "Action": ["sts:AssumeRole", "sts:TagSession"],
            "Condition": {
                "StringEquals": {
					"aws:PrincipalTag/project": "BlueSkyProject"
				}
            }
        }
    ]
}

Without this new principal grant, the role can no longer assume itself. The trust policy conditions are also applied, even if the role still has unconditioned access to itself in its identity-based policy.

Conclusion

In this blog post we’ve reviewed the old and new behavior of role assumption in the case where a role seeks to assume itself. We’ve seen that, according to our analysis of service metadata, the vast majority of role self-assumption behavior that relies solely on identity-based privileges is totally unnecessary, because the code (or human) who calls sts:AssumeRole is already, without realizing it, using the role’s credentials to call the AWS STS API. Eliminating that mistake will improve performance and decrease resource consumption. We’ve also explained in more depth the reasons for the old behavior and the reasons for making the change, and provided Athena and CloudTrail Lake queries that you can use to examine past or (in the case of allow-listed roles) current self-assumption behavior in your own environments. You can reach out to AWS Support or your customer account team if you need help in this effort.

If you currently use the old behavior and need to continue to do so, your primary option is to create an explicit allow for the role in its own trust policy. If that option doesn’t work due to operational constraints, you can reach out to AWS Support to seek an accommodation of your use case for new roles or new accounts. You can also ask AWS Support to remove roles from the allow-list if you want their behavior to follow the new model.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new IAM-tagged discussion on AWS re:Post or contact AWS Support.

AWS would like to thank several customers and partners who highlighted this behavior as something they found surprising and unhelpful, and asked us to consider making this change. We would also like to thank independent security researcher Ryan Gerstenkorn who engaged with AWS on this topic and worked with us prior to this update.

Want more AWS Security news? Follow us on Twitter.

Sign Amazon SNS messages with SHA256 hashing for HTTP subscriptions

2022-09-19 Daniel Caminhas

Post Syndicated from Daniel Caminhas original https://aws.amazon.com/blogs/security/sign-amazon-sns-messages-with-sha256-hashing-for-http-subscriptions/

Amazon Simple Notification Service (Amazon SNS) now supports message signatures based on Secure Hash Algorithm 256 (SHA256) hashing. Amazon SNS signs the messages that are delivered from your Amazon SNS topic so that subscribed HTTP endpoints can verify the authenticity of the messages. In this blog post, we will show you how to enable message signatures based on SHA256 for your Amazon SNS topics.

About message signing verification

To verify the authenticity of a message sent to your HTTP endpoint by Amazon SNS, you can verify the message signature. There are two cases where we recommend verifying the authenticity of the message. The first is when Amazon SNS sends a message to an HTTP endpoint that you subscribed to a topic. The second is when Amazon SNS sends a confirmation message to your HTTP endpoint after the Subscribe or the Unsubscribe API actions. For more information, see Verifying the signatures of Amazon SNS messages in the Amazon SNS Developer Guide.

Amazon SNS now supports two message signature versions:

Signature version 1 – Amazon SNS creates the signature based on the SHA1 hash of the message.
Signature version 2 – Amazon SNS creates the signature based on the SHA256 hash of the message.

Amazon SNS adds the SignatureVersion property to the JSON payload of messages delivered to HTTP endpoints, as shown in the following code snippet. For more information on the JSON payload format, see Parsing message formats in the Amazon SNS Developer Guide.

{
  "Type" : "Notification",
  "MessageId" : "22b80b92-fdea-4c2c-8f9d-bdfb0c7bf324",
  "TopicArn" : "arn:aws:sns:us-west-2:123456789012:MyTopic",
  "Subject" : "My First Message",
  "Message" : "Hello world!",
  "Timestamp" : "2022-08-02T00:54:06.655Z",
  "SignatureVersion" : "2",
  "Signature" : "EXAMPLEw6JRN...",
  "SigningCertURL" : "https://sns.us-west-2.amazonaws.com/SimpleNotificationService-f3ecfb7224c7233fe7bb5f59f96de52f.pem",
  "UnsubscribeURL" : "https://sns.us-west-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-west-2:123456789012:MyTopic:c9135db0-26c4-47ec-8998-413945fb5a96"
}

What to consider before you enable message signatures based on SHA256 for your Amazon SNS topic

As an Amazon SNS topic owner, before you enable SHA256 support for your topic, we recommend communicating with the owners of the HTTP endpoints that are subscribed to your topic. They might need to update their message signature verification logic to accommodate the new signature version. If the endpoint owners are using the AWS SDK feature for verifying the Amazon SNS message signatures, they need to make sure that they are using one of the following versions of the AWS SDK: Java 1.12.285, JavaScript 0.3.5, Ruby 1.54.0, PHP 1.8.0 or .NET 3.7.3.96.

How to enable message signatures based on SHA256 for your Amazon SNS topic

By default, Amazon SNS topics use SHA1 for hashing the message signature. You can enable SHA256 support for your topic by setting the topic attribute SignatureVersion to 2 using the AWS Software Development Kit (AWS SDK), or AWS Command Line Interface (AWS CLI).

The following code example shows how to set the topic attribute SignatureVersion by using the AWS CLI.

aws sns set-topic-attributes \
    --topic-arn arn:aws:sns:us-west-2:123456789012:MyTopic \
    --attribute-name SignatureVersion \
    --attribute-value 2

The following code example shows how to set the SignatureVersion attribute by using the AWS SDK for Java.

public static void enableSHA256Support(SnsClient snsClient, String topicArn) {

        try {

            SetTopicAttributesRequest request = SetTopicAttributesRequest.builder()
                .attributeName("SignatureVersion")
                .attributeValue("2")
                .topicArn(topicArn)
                .build();

            SetTopicAttributesResponse result = snsClient.setTopicAttributes(request);
            System.out.println("\n\nStatus was " + result.sdkHttpResponse().statusCode() + "\n\nTopic " + request.topicArn()
                + " updated " + request.attributeName() + " to " + request.attributeValue());

        } catch (SnsException e) {
            System.err.println(e.awsErrorDetails().errorMessage());
        }
    }

Conclusion

Amazon SNS topic owners can now enable message signatures based on SHA256 hashing. In this post, you learned how to choose the hashing algorithm, either SHA256 or SHA1, for your SNS topic. For more information, see Verifying the signatures of Amazon SNS messages in the Amazon SNS Developer Guide, and SetTopicAttributes in the Amazon SNS API Reference.

For more serverless learning resources, visit Serverless Land.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

10 reasons to import a certificate into AWS Certificate Manager (ACM)

2022-09-16 Nicholas Doropoulos

Post Syndicated from Nicholas Doropoulos original https://aws.amazon.com/blogs/security/10-reasons-to-import-a-certificate-into-aws-certificate-manager-acm/

AWS Certificate Manager (ACM) is a service that lets you efficiently provision, manage, and deploy public and private SSL/TLS certificates for use with AWS services and your internal connected resources. The certificates issued by ACM can then be used to secure network communications and establish the identity of websites on the internet or resources on private networks.

So why might you want to import a certificate into ACM, rather than using a certificate issued by ACM? According to the AWS Certificate Manager User Guide topic Importing certificates into AWS Certificate Manager, “you might do this because you already have a certificate from a third-party certificate authority (CA), or because you have application-specific requirements that are not met by ACM issued certificates.”

In this blog post, I’ll list 10 reasons why you might want to import a certificate into ACM, including what specific requirements you might have, and why you might want to use a certificate signed by a third-party CA in the first place.

1. To use an ECDSA certificate for faster TLS connections

Imported Elliptic Curve Digital Signature Algorithm (ECDSA) certificates use smaller keys than ACM issued public RSA certificates, allowing for TLS connections to be established faster. For this reason, ECDSA certificates are particularly useful for systems with limited processing resources, such as Internet of Things (IoT) devices. ACM supports imported certificates with ECDSA in 256, 384, and 521 bit variations. If you want to use an ECDSA certificate for your public-facing web application, you need to get a third-party certificate and then import it into ACM. For more information about supported cryptographic algorithms for imported certificates, see Prerequisites for importing certificates in the AWS Certificate Manager User Guide.

2. To control your certificate’s renewal cycle

When you import a certificate into ACM, you have greater control over its renewal cycle simply because you can re-import it as frequently as you want. You also have control over how often your imported certificate’s private key can be rotated. As a best practice, you should rotate your certificate’s private key based on your certificate’s usage frequency.

Note: When you re-import your certificate, to maintain the existing associations during renewal, ensure that you specify the existing certificate’s Amazon Resource Name (ARN). For more information and step-by-step instructions, see Reimporting a certificate in the AWS Certificate Manager User Guide.

3. To use certificate pinning

You might have an application that requires certificate pinning, which is the practice of bypassing the typical hierarchical model of trust that is governed by certificate authorities. With certificate pinning, a host’s identity is trusted based on a specific certificate or public key. As a certificate pinning best practice, AWS recommends that public certificates issued by ACM should not be pinned because ACM will generate a new public/private key pair at the next renewal phase, which essentially replaces the pinned certificate with a new one, causing service disruption along the process. If you want to use certificate pinning, you can pin an imported certificate because imported certificates are not subject to managed renewal, thereby reducing the risk of production impact.

4. To use a higher-assurance certificate

You might want to use a higher-assurance certificate, such as an organization validation (OV) or extended validation (EV) certificate. Certificates issued by ACM currently only support domain validation (DV). If the domain you want to protect is an application that requires OV or EV, you can import OV or EV certificates into ACM by using a third-party certificate of either type. You can use the ACM API action ImportCertificate to import OV or EV certificates into ACM.

5. To use a self-signed certificate

For internal testing environments where your developers want speed and flexibility, self-signed certificates are issued faster and effortlessly. However, it’s important to know that self-signed certificates are not trusted by default, which means that self-signed certificates need to be installed inside the trust stores of the intended clients, to avoid the risk of your users getting into the habit of ignoring browser warnings. For more information, see the additional requirements for self-signed certificates in Prerequisites for importing certificates in the AWS Certificate Manager User Guide.

6. To use an IP address for the certificate’s subject

By design, the subject field of an ACM certificate can only identify a fully qualified domain name (FQDN). If you want to use an IP address for the certificate’s subject, then you can create the certificate and import it to ACM.

7. To exceed the number of domains allowed by the ACM quotas

Certificates issued by ACM are subject to the ACM service quotas. The default quota for ACM is 10 domain names for each ACM certificate, and you can request an increase to the quota up to a maximum of 100 domain names for each certificate. However, if you import certificates, they are not subject to the quotas, and you can use a public certificate with more than 100 FQDNs in its domain scope without having to go through the process of requesting any limit increases.

8. To use a private certificate issued by ACM Private CA with the IssueCertificate API action

Certificates provisioned with the IssueCertificate API action have a private status and cannot be associated directly with an AWS integrated service, such as an internal Application Load Balancer. Instead, a private certificate issued by AWS Certificate Manager Private Certificate Authority (ACM Private CA) with the IssueCertificate API action needs to be exported and then imported into ACM before the association can be made. The same is true for certificate templates as well, which are configuration templates that can be passed as parameters to the IssueCertificate API action as a means to have greater control over the private certificate’s extensions.

9. To use a private certificate issued by your on-premises CA

You might want to use a private certificate issued by your on-premises CA instead of using ACM Private CA. To administer your internal public key infrastructure (PKI), AWS generally recommends that you use ACM Private CA. However, you might still come across scenarios where a certificate signed by your on-premises CA is better suited for your specific needs. For example, you might want to have a common root of trust, for consistency and interoperability purposes across a hybrid PKI solution. Furthermore, using an external parent CA with ACM Private CA also allows you to enforce CA name constraints. For more information, see Signing private CA certificates with an external CA in the AWS Certificate Manager Private Certificate Authority User Guide.

10. To use a certificate for something other than securing a public website

In addition to securing a public website, you can use certificates for other purposes. For example, you can import client and server certificates as part of an OpenVPN setup. For more information about this example, see How can I generate server and client certificates and their respective keys on a Windows server and upload them to AWS Certificate Manager (ACM)? In addition, you can import a code-signing certificate for use with AWS IoT Device Management. For more information about how to import a code-signing certificate, see (For IoT only) Obtain and import a code-signing certificate in the AWS Signer Developer Guide.

Conclusion

In this blog post, you learned about some of the reasons you might want to import a certificate into AWS Certificate Manager (ACM). For more information about importing certificates into ACM and step-by-step instructions, see Importing certificates into AWS Certificate Manager in the AWS Certificate Manager User Guide. For the latest pricing information, see the AWS Certificate Manager Pricing page on the AWS website. You can also use the AWS pricing calculator to estimate costs.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Certificate Manager forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Hazard analysis and Chaos engineering at Vanguard Group

2022-09-16 Jason Barto

Post Syndicated from Jason Barto original https://aws.amazon.com/blogs/devops/hazard-analysis-and-chaos-engineering-at-vanguard-group/

Anticipating events that can cause a disruption to your system’s service is critical to building highly available, reliable systems. Hazard analysis gives you a method to identify such events. Chaos engineering gives you a method to confirm that a system behaves as expected in adverse conditions. By combining these methods, Vanguard is building reliability into their systems.

Vanguard engineering teams perform hazard analysis on their systems and capture the identified events as failure scenarios. They use the identified failure scenarios to create hypotheses to support chaos engineering experiments. These hypotheses predict how the system will respond to failures and each hypothesis is then confirmed through experimentation to increase the team’s confidence in the system’s reliability.

In this article we will walk you through how Vanguard uses hazard analysis and chaos engineering. We will also provide guidance on how you can employ these techniques on your applications.

Failure Mode & Effects Analysis

A hazard analysis can be performed using different methods. At Vanguard, they have adapted the failure mode & effects analysis (FMEA) method to support their important services.

FMEA is a bottom-up approach to analyse an architecture and focus on the impact to system functions when one or more components of the system are disrupted. Members of the engineering team and architects responsible for designing and building a system brainstorm possible failure scenarios or failure modes, and document the impact of these failures on the system. Combined with a quantitative method for ranking the failure modes, the analysis process produces a prioritised list of failure modes which describes how the system would respond to individual or combined failures in its component parts or dependencies.

For each failure mode the team conducting the analysis will highlight what protections exist within the system to guard against the failure mode. Sometimes, fault isolation boundaries have been put in place to prevent client impact in failure scenarios. In other scenarios, for one reason or another, there are hard dependencies in place for which the engineering team has decided not to build in fault tolerance. For example, a team responsible for a less-critical function may have architected its system to operate across multiple availability zones, but could decide not to implement other mitigations to prioritize cost over increased resilience.

The FMEA method has been in use by engineers in the automotive, aeronautical, healthcare, and military industries for more than 60 years. Over that time, FMEA has been modified to best suit the organization and the field in which it was applied. In many variations the FMEA measures each failure mode with a risk priority number (RPN), which is intended to quantitatively rank the failure mode based upon:

The failure mode’s impact to the system as a whole
The probability of the failure mode’s occurrence
How easily the failure mode can be detected

Vanguard have adapted the FMEA process to serve their own specific requirements and processes. Vanguard have decided not to adopt the RPN element of the FMEA process, as teams found they spent a lot of time debating the impact, probability, and detectability of individual failure modes. To perform an FMEA more quickly, teams instead focus on the failure modes and system impact only, documenting a mental model of system performance which can be experimented through chaos engineering.

An excerpt of a Vanguard FMEA output is provided as an example in the following table:

The “Process Step” in the table above refers to a business function of the system being analyzed, for example “Request to retrieve stored data”. As part of the analysis, the team identifies the system components needed to perform the Process Step and considers the interactions of those components Focusing on a Process Step makes it easier to anticipate the failure scenarios that would affect the system in performing this particular business function. Also, the Process Step will imply an importance or criticality which can be a factor when prioritizing mitigations.

After selecting a Process Step, you walk through the system components involved and identify how component failures or disruptions will affect the wider system. Such component failures may involve individual components or a combination of components and are captured as “Failure Mode”. This identifies the component or components that are disrupted and their behaviour; for example, “Microservice is unavailable or returns an error”.

“Expected Behaviour” describes the effect of the failure mode on the wider system, in the context of the Process Step. This captures what other system components are affected by the Failure Mode and why, and how this impacts the Process Step as a whole.

Lastly, the “Hypothesis” column forms the basis for the chaos experiments that will follow from the FMEA to confirm that the system performs as expected.

At Vanguard, all mission-critical product teams are conducting FMEAs for their production applications. The outputs of these sessions are maintained over time and serve multiple purposes:

When onboarding new team members, it is helpful to provide the FMEA document alongside an architecture diagram and narrative. It will paint a more robust picture of how the system is intended to operate in both “happy path” and “unhappy path” scenarios.
When troubleshooting incidents, an FMEA document can help on-call engineers – especially those less experienced with debugging – to match up the documented expectations to the observed system behavior.
Site Reliability Engineers (SREs) looking for opportunities to improve the resilience of a system might look to FMEA documentation to understand the existing fault isolation boundaries and introduce additional resilience mechanisms through automation and system changes.
Finally, when selecting scenarios for experimentation with Chaos Engineering, the FMEA document provides a list of conjectures that have been mapped to hypotheses, ready to be validated through experimentation. This input into the Chaos Engineering workflow is the primary use of FMEA documents for Vanguard product teams.

There are many resources available online to learn more about how FMEA is used and applied in other organisations. In Failure Modes and Continuous Resilience, Adrian Cockcroft introduces FMEA as a method for anticipating failure scenarios. The NASA Software Engineering Handbook details how FMEAs are conducted as part of their engineering process. The Automotive Industry Group has also formally documented the use of FMEA in the Automotive Industry Action Group FMEA Handbook.

Chaos Engineering

After failure modes have been identified and mitigated through system design, it’s time to understand how resilient the system’s implementation is to those failure modes. Chaos engineering can be used to explore a system and validate that a system’s implementation meets business resiliency objectives.

Chaos engineering helps to improve a team’s mental model about the system under experimentation and provides insights into how a complex system behaves under adverse conditions. It also enables an engineer to find the unknown unknowns and the known unknowns through experiments that are built on top of the hypothesis. These experiments should simulate real world events, such as network degradation and increased client requests, and the outcome of the experiment should not be known. In other words, an experiment is not an experiment if it’s known that the conditions will cause the system to fail.

Prerequisites to Chaos Experiments at Vanguard

At Vanguard, there are some necessary prerequisites to running a chaos experiment. Firstly, the system under experiment must be set up with some basic observability tooling that will allow teams to monitor the state of the application during the failure injection. This could be as simple as an Amazon CloudWatch dashboard and some associated alarms, or as elaborate as a dedicated dashboard set up in a vendor tool.

Secondly, teams must be able to drive load to the application during the experiment; depending on the experiment type, the level and type of load may vary. The load generator can be as simple as a script on someone’s machine, or a fully automated load test depending on the requirements of the hypothesis.

Finally, teams need to have a good understanding of what the application’s “steady state” looks like. I Ideally, this takes the form of some metrics such as expected error rate, expected latency, and/or a service level objective (SLO) that can be monitored throughout the duration of the experiment. For example, a service level objective for a RESTful API might be that 90% of requests should receive a response within 100 milliseconds.

With the prerequisites met and a completed FMEA, teams can then experiment with their hypothesis using various experiment templates defined by Vanguard’s Climate of Chaos tooling.

Vanguard’s Climate of Chaos

At Vanguard, ensuring its software systems are resilient to adverse events is a critical part of its ongoing mission to provide world-class service to their clients. Vanguard believes that in order to develop high quality software, one must plan for the inevitable “stormy weather” events that occur in a distributed system.

Over the past 2 years, as a response to this need, Vanguard has developed in-house tooling called “The Climate of Chaos” to give teams easy access to common experiment templates, along with a friendly UI interface. The Climate of Chaos helps developers experiment on their systems and validate the hypotheses generated from FMEAs. It also provides the tooling for them to simulate the most common failure scenarios on Vanguard’s most commonly utilized AWS infrastructure, including Amazon Elastic Container Service (Amazon ECS), AWS Fargate, Amazon DynamoDB, Amazon Relational Database Service (Amazon RDS), AWS Lambda, and others.

The Climate of Chaos was created prior to Amazon’s release of the AWS Fault Injection Simulator (FIS), and today there is a lot of overlap with the experiment capabilities available in FIS. The Climate of Chaos has also been enhanced with company-specific features and integrations that make it easier for Vanguard developers to run chaos experiments in a controlled and predictable manner.

The Climate of Chaos includes important safety features such as an “emergency stop” function. This feature enables teams to terminate the experiment immediately if unintended side effects are encountered, rolling back the events simulated to resume steady state operation. The Climate of Chaos has been coupled with other systems like an in-house load testing tooling and added features like the ability to monitor CloudWatch alarms. Vanguard also offers teams the ability to schedule experiments to run at their convenience. Soon, Vanguard hopes to make running chaos experiments even smarter, introducing tools that will help teams run bulk experiments that systematically inject failures on a group of related applications to help pinpoint more complex failure modes.

Next Steps

Failure modes and effects analysis is a hazard analysis method which can help you identify single and combined points of failure in your system so you can prioritize the failure modes. To learn more about the FMEA process, you can read the NASA Software Engineering Handbook which outlines how they perform FMEA on their software-based systems. The AWS Whitepaper Building Mission-Critical Financial Services Applications on AWS provides example forms and suggestions for severity, probability, and detectability rankings. Appendix F in the whitepaper suggests a 1 to 10 ranking for each Risk Priority Number input, and the example spreadsheets recommend performing FMEAs for the application, platform, infrastructure, and operation layers of the system. Using these examples, you can perform an analysis of your own systems and generate hypotheses.

To experiment on your systems and validate your own hypotheses, you can use the AWS Fault Injection Simulator (FIS) mentioned earlier in this article. FIS provides you with a framework for performing controlled chaos experiments on your AWS workloads. It helps you to safely manage your experiments by providing tooling to monitor, rollback, and orchestrate chaos experiments. FIS provides the fault injection mechanisms that you will need to experiment upon your system’s implementation and resilience to identified failure modes. You can start by running experiments in pre-production environments, and then step up to running them as part of your CI/CD workflow and ultimately in your production environment. To learn more about FIS, you can read the FIS User Guide and FIS tutorials.

By using FMEA to anticipate the failures and experimenting on your systems with chaos engineering, you will gain confidence in the reliability of your system.

The content and opinions in this post are those of The Vanguard Group and AWS is not responsible for the content or accuracy of this post.

About the authors:

Amazon introduces dynamic intermediate certificate authorities

2022-09-15 Adina Lozada

Post Syndicated from Adina Lozada original https://aws.amazon.com/blogs/security/amazon-introduces-dynamic-intermediate-certificate-authorities/

AWS Certificate Manager (ACM) is a managed service that lets you provision, manage, and deploy public and private Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates for use with Amazon Web Services (AWS) and your internal connected resources. Starting October 11, 2022, at 9:00 AM Pacific Time, public certificates obtained through ACM will be issued from one of the multiple intermediate certificate authorities (CAs) that Amazon manages. In this blog post, we share important details about this change and how you can prepare.

What is changing and why?

Public certificates that you request through ACM are obtained from Amazon Trust Services, which is a public certificate authority (CA) that Amazon manages. Like other public CAs, Amazon Trust Services CAs have a structured trust hierarchy. The public certificate issued to you, also known as the leaf certificate, can chain to one or more intermediate CAs and then to the Amazon Trust Services root CA. The Amazon Trust Services root CA is trusted by default by most and operating systems. This is why Amazon can issue public certificates that are trusted by these systems.

Starting October 11, 2022 at 9:00 AM Pacific Time, public certificates obtained through ACM will be issued from one of the multiple intermediate CAs that Amazon manages. These intermediate CAs chain to an existing Amazon Trust Services root CA. With this change, leaf certificates issued to you will be signed by different intermediate CAs. Before this change, Amazon maintained a limited number of intermediate CAs and issued and renewed certificates from the same intermediate CAs.

Amazon is making this change to create a more resilient and agile certificate infrastructure that will help us respond more quickly to future requirements. This change also presents an opportunity to correct a known issue related to delayed revocation of a subordinate CA and help minimize the scope of impact for new risks that might emerge in the future.

What can I do to prepare?

Most customers won’t experience an impact from this change. Browsers and most applications will continue to work just as they do now, because these services trust the Amazon Trust Services root CA and not a specific intermediate CA. If you’re using one of the standard operating systems and web browsers that are listed in the next section of this post, you don’t need to take any action.

If you use intermediate CA information through certificate pinning, you will need to make changes and pin to an Amazon Trust Services root CA instead of an intermediate CA or leaf certificate. Certificate pinning is a process in which your application that initiates the TLS connection only trusts a specific public certificate through one or more certificate variables that you define. If the pinned certificate is replaced, your application won’t initiate the connection. AWS recommends that you don’t use certificate pinning because it introduces an availability risk. However, if your use case requires certificate pinning, AWS recommends that you pin to an Amazon Trust Services root CA instead of an intermediate CA or leaf certificate. When you pin to an Amazon Trust Services root CA, you should pin to all of the root CAs shown in the following table.

Amazon Trust Services root CA certificates

Distinguished name	SHA-256 hash of subject public key information	Test URL
CN=Amazon Root CA 1,O=Amazon,C=US	fbe3018031f9586bcbf41727e417b7d1c45c2f47f93be372a17b96b50757d5a2	Test URL
CN=Amazon Root CA 2,O=Amazon,C=US	7f4296fc5b6a4e3b35d3c369623e364ab1af381d8fa7121533c9d6c633ea2461	Test URL
CN=Amazon Root CA 3,O=Amazon,C=US	36abc32656acfc645c61b71613c4bf21c787f5cabbee48348d58597803d7abc9	Test URL
CN=Amazon Root CA 4,O=Amazon,C=US	f7ecded5c66047d28ed6466b543c40e0743abe81d109254dcf845d4c2c7853c5	Test URL

To test that your trust store contains the Amazon Trust Services root CA, see the preceding table, which lists the Amazon Trust Services root CA certificates, and choose each test URL in the table. If the test URL works, you should see a message that says Expected Status: Good, along with the certificate chain. If the test URL doesn’t work, you will receive an error message that indicates the connection has failed.

What should I do if the Amazon Trust Services CAs are not in my trust store?

If your application is using a custom trust store, you must add the Amazon Trust Services root CAs to your application’s trust store. The instructions for doing this vary based on the application or service. Refer to the documentation for the application or service that you’re using.

If your tests of any of the test URLs failed, you must update your trust store. The simplest way to update your trust store is to upgrade the operating system or browser that you’re using.

The following operating systems use the Amazon Trust Services CAs:

Amazon Linux (all versions)
Microsoft Windows versions, with updates installed, from January 2005, Windows Vista, Windows 7, Windows Server 2008, and newer versions
Mac OS X 10.4 with Java for Mac OS X 10.4 Release 5, Mac OS X 10.5, and newer versions
Red Hat Enterprise Linux 5 (March 2007 release), Linux 6, and Linux 7 and CentOS 5, CentOS 6, and CentOS 7
Ubuntu 8.10
Debian 5.0
Java 1.4.2_12, Java 5 update 2, and all newer versions, including Java 6, Java 7, and Java 8

Modern browsers trust Amazon Trust Services CAs. To update the certificate bundle in your browser, update your browser. For instructions on how to update your browser, see the update page for your browser:

Chrome
Firefox
Safari
The Windows operating system manages certificate bundles for Internet Explorer and Microsoft Edge, so to update your browser, you must update Windows.

Where can I get help?

If you have questions, contact AWS Support or your technical account manager (TAM), or start a new thread on the AWS re:Post ACM Forum. If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

DevOps with serverless Jenkins and AWS Cloud Development Kit (AWS CDK)

2022-09-12 sangusah

Post Syndicated from sangusah original https://aws.amazon.com/blogs/devops/devops-with-serverless-jenkins-and-aws-cloud-development-kit-aws-cdk/

The objective of this post is to walk you through how to set up a completely serverless Jenkins environment on AWS Fargate using AWS Cloud Development Kit (AWS CDK).

Jenkins is a popular open-source automation server that provides hundreds of plugins to support building, testing, deploying, and automation. Jenkins uses a controller-agent architecture in which the controller is responsible for serving the web UI, stores the configurations and related data on disk, and delegates the jobs to the worker agents that run these jobs as their primary responsibility.

Amazon Elastic Container Service (Amazon ECS) using Fargate is a fully-managed container orchestration service that helps you easily deploy, manage, and scale containerized applications. It deeply integrates with the rest of the AWS platform to provide a secure and easy-to-use solution for running container workloads in the cloud and now on your infrastructure. Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building applications without managing servers. Fargate is compatible with both Amazon ECS and Amazon Elastic Kubernetes Service (Amazon EKS).

Solution overview

The following diagram illustrates the solution architecture. The dashed lines indicate the AWS CDK deployment.

Figure 1 This diagram shows AWS CDK and how it deploys using AWS CloudFormation to create the Elastic Load Balancer, AWS Fargate, and Amazon EFS

You’ll be using the following:

The Jenkins controller URL backed by an Application Load Balancer (ALB).
You’ll be using your default Amazon Virtual Private Cloud (Amazon VPC) for this example.
The Jenkins controller runs as a service in Amazon ECS using Fargate as the launch type. You’ll use Amazon Elastic File System (Amazon EFS) as the persistent backing store for the Jenkins controller task. The Jenkins controller and Amazon EFS are launched in private subnets.

Prerequisites

For this post, you’ll utilize AWS CDK using TypeScript.

Follow the guide on Getting Started for AWS CDK to:

Get your local environment setup
Bootstrap your development account

Code

Let’s review the code used to define the Jenkins environment in AWS using the AWS CDK.

Setup your imports

import { Duration, IResource, RemovalPolicy, Stack, Tags } from 'aws-cdk-lib';
import { Construct } from 'constructs';

import * as cdk from 'aws-cdk-lib';

import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as efs from 'aws-cdk-lib/aws-efs';
import { Port } from 'aws-cdk-lib/aws-ec2';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';

Setup your Amazon ECS, which is a logical grouping of tasks or services and set vpc

export class AppStack extends Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const jenkinsHomeDir: string = 'jenkins-home';
    const appName: string = 'jenkins-cdk';

    const cluster = new ecs.Cluster(this, `${appName}-cluster`, {
      clusterName: appName,
    });

    const vpc = cluster.vpc;

Setup Amazon EFS to store the data

    const fileSystem = new efs.FileSystem(this, `${appName}-efs`, {
      vpc: vpc,
      fileSystemName: appName,
      removalPolicy: RemovalPolicy.DESTROY,
    });

Setup Access Point, which are application-specific entry points into an Amazon EFS file system that makes it easier to manage application access to shared datasets

const accessPoint = fileSystem.addAccessPoint(`${appName}-ap`, {
      path: `/${jenkinsHomeDir}`,
      posixUser: {
        uid: '1000',
        gid: '1000',
      },
      createAcl: {
        ownerGid: '1000',
        ownerUid: '1000',
        permissions: '755',
      },
    });

Setup Task Definition to run Docker containers in Amazon ECS

const taskDefinition = new ecs.FargateTaskDefinition(
      this,
      `${appName}-task`,
      {
        family: appName,
        cpu: 1024,
        memoryLimitMiB: 2048,
      }
    );

Setup a Volume mapping the Amazon EFS from above to the Task Definition

taskDefinition.addVolume({
      name: jenkinsHomeDir,
      efsVolumeConfiguration: {
        fileSystemId: fileSystem.fileSystemId,
        transitEncryption: 'ENABLED',
        authorizationConfig: {
          accessPointId: accessPoint.accessPointId,
          iam: 'ENABLED',
        },
      },
    });

Setup the Container using the Task Definition and the Jenkins image from the registry

const containerDefinition = taskDefinition.addContainer(appName, {
      image: ecs.ContainerImage.fromRegistry('jenkins/jenkins:lts'),
      logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'jenkins' }),
      portMappings: [{ containerPort: 8080 }],
    });

Setup Mount Points to bind ephemeral storage to the container

containerDefinition.addMountPoints({
      containerPath: '/var/jenkins_home',
      sourceVolume: jenkinsHomeDir,
      readOnly: false,
    });

Setup Fargate Service to run the container serverless

    const fargateService = new ecs.FargateService(this, `${appName}-service`, {
      serviceName: appName,
      cluster: cluster,
      taskDefinition: taskDefinition,
      desiredCount: 1,
      maxHealthyPercent: 100,
      minHealthyPercent: 0,
      healthCheckGracePeriod: Duration.minutes(5),
    });
    fargateService.connections.allowTo(fileSystem, Port.tcp(2049));

Setup ALB and add listener to checks for connection requests, using the protocol and port that you configure.

    const loadBalancer = new elbv2.ApplicationLoadBalancer(
      this,
      `${appName}-elb`,
      {
        loadBalancerName: appName,
        vpc: vpc,
        internetFacing: true,
      }
    );
    const lbListener = loadBalancer.addListener(`${appName}-listener`, {
      port: 80,
    });

Setup Target to route requests to Jenkins running on Amazon ECS using Fargate

const loadBalancerTarget = lbListener.addTargets(`${appName}-target`, {
      port: 8080,
      targets: [fargateService],
      deregistrationDelay: Duration.seconds(10),
      healthCheck: { path: '/login' },
    });
  }
}

Jenkins Deployment

Now that you have all the code, let’s deploy the AWS CDK definition:

Make sure that you have done the Prerequisite steps from earlier.
Install packages by running the following command in your IDE CLI:

npm i

Now you’ll deploy your AWS CDK definition to your dev account:

cdk deploy

Let’s now login to Jenkins

In your browser, use the DNS Name from the deployed Load Balancer
In Amazon CloudWatch, there will be a Log group that will be created that is associated to Cluster Service.
1. Go into that log and you’ll see it output the Password to login to Jenkins

In Jenkins, follow the wizard to continue the setup

Cleaning up

To avoid incurring future charges, delete the resources.

Let’s destroy our deploy solution

In your IDE CLI:

cdk destroy

Conclusion

With this overview we were able to cover the following:

Build an Elastic Load Balancer
Use AWS Fargate with a Jenkins AMI
All resources running serverlessly
All build using the AWS CDK

About the author: