Tag Archives: Technical How-to

Multi-Region Terraform Deployments with AWS CodePipeline using Terraform Built CI/CD

Post Syndicated from Lerna Ekmekcioglu original https://aws.amazon.com/blogs/devops/multi-region-terraform-deployments-with-aws-codepipeline-using-terraform-built-ci-cd/

As of February 2022, the AWS Cloud spans 84 Availability Zones within 26 geographic Regions, with announced plans for more Availability Zones and Regions. Customers can leverage this global infrastructure to expand their presence to their primary target of users, satisfying data residency requirements, and implementing disaster recovery strategy to make sure of business continuity. Although leveraging multi-Region architecture would address these requirements, deploying and configuring consistent infrastructure stacks across multi-Regions could be challenging, as AWS Regions are designed to be autonomous in nature. Multi-region deployments with Terraform and AWS CodePipeline can help customers with these challenges.

In this post, we’ll demonstrate the best practice for multi-Region deployments using HashiCorp Terraform as infrastructure as code (IaC), and AWS CodeBuild , CodePipeline as continuous integration and continuous delivery (CI/CD) for consistency and repeatability of deployments into multiple AWS Regions and AWS Accounts. We’ll dive deep on the IaC deployment pipeline architecture and the best practices for structuring the Terraform project and configuration for multi-Region deployment of multiple AWS target accounts.

You can find the sample code for this solution here

Solutions Overview

Architecture

The following architecture diagram illustrates the main components of the multi-Region Terraform deployment pipeline with all of the resources built using IaC.

DevOps engineer initially works against the infrastructure repo in a short-lived branch. Once changes in the short-lived branch are ready, DevOps engineer gets them reviewed and merged into the main branch. Then, DevOps engineer git tags the repo. For any future changes in the infra repo, DevOps engineer repeats this same process.

Git tags named “dev_us-east-1/research/1.0”, “dev_eu-central-1/research/1.0”, “dev_ap-southeast-1/research/1.0”, “dev_us-east-1/risk/1.0”, “dev_eu-central-1/risk/1.0”, “dev_ap-southeast-1/risk/1.0” corresponding to the version 1.0 of the code to release from the main branch using git tagging. Short-lived branch in between each version of the code, followed by git tags corresponding to each subsequent version of the code such as version 1.1 and version 2.0.”

Fig 1. Tagging to release from the main branch.

  1. The deployment is triggered from DevOps engineer git tagging the repo, which contains the Terraform code to be deployed. This action starts the deployment pipeline execution.
    Tagging with ‘dev_us-east-1/research/1.0’ triggers a pipeline to deploy the research dev account to us-east-1. In our example git tag ‘dev_us-east-1/research/1.0’ contains the target environment (i.e., dev), AWS Region (i.e. us-east-1), team (i.e., research), and a version number (i.e., 1.0) that maps to an annotated tag on a commit ID. The target workload account aliases (i.e., research dev, risk qa) are mapped to AWS account numbers in the environment configuration files of the infra repo in AWS CodeCommit.
The central tooling account contains the CodeCommit Terraform infra repo, where DevOps engineer has git access, along with the pipeline trigger, the CodePipeline dev pipeline consisting of the S3 bucket with Terraform infra repo and git tag, CodeBuild terraform tflint scan, checkov scan, plan and apply. Terraform apply points using the cross account role to VPC containing an Application Load Balancer (ALB) in eu-central-1 in the dev target workload account. A qa pipeline, a staging pipeline, a prod pipeline are included along with a qa target workload account, a staging target workload account, a prod target workload account. EventBridge, Key Management Service, CloudTrail, CloudWatch in us-east-1 Region are in the central tooling account along with Identity Access Management service. In addition, the dev target workload account contains us-east-1 and ap-southeast-1 VPC’s each with an ALB as well as Identity Access Management.

Fig 2. Multi-Region AWS deployment with IaC and CI/CD pipelines.

  1. To capture the exact git tag that starts a pipeline, we use an Amazon EventBridge rule. The rule is triggered when the tag is created with an environment prefix for deploying to a respective environment (i.e., dev). The rule kicks off an AWS CodeBuild project that takes the git tag from the AWS CodeCommit event and stores it with a full clone of the repo into a versioned Amazon Simple Storage Service (Amazon S3) bucket for the corresponding environment.
  2. We have a continuous delivery pipeline defined in AWS CodePipeline. To make sure that the pipelines for each environment run independent of each other, we use a separate pipeline per environment. Each pipeline consists of three stages in addition to the Source stage:
    1. IaC linting stage – A stage for linting Terraform code. For illustration purposes, we’ll use the open source tool tflint.
    2. IaC security scanning stage – A stage for static security scanning of Terraform code. There are many tooling choices when it comes to the security scanning of Terraform code. Checkov, TFSec, and Terrascan are the commonly used tools. For illustration purposes, we’ll use the open source tool Checkov.
    3. IaC build stage – A stage for Terraform build. This includes an action for the Terraform execution plan followed by an action to apply the plan to deploy the stack to a specific Region in the target workload account.
  1. Once the Terraform apply is triggered, it deploys the infrastructure components in the target workload account to the AWS Region based on the git tag. In turn, you have the flexibility to point the deployment to any AWS Region or account configured in the repo.
  2. The sample infrastructure in the target workload account consists of an AWS Identity and Access Management (IAM) role, an external facing Application Load Balancer (ALB), as well as all of the required resources down to the Amazon Virtual Private Cloud (Amazon VPC). Upon successful deployment, browsing to the external facing ALB DNS Name URL displays a very simple message including the location of the Region.

Architectural considerations

Multi-account strategy

Leveraging well-architected multi-account strategy, we have a separate central tooling account for housing the code repository and infrastructure pipeline, and a separate target workload account to house our sample workload infra-architecture. The clean account separation lets us easily control the IAM permission for granular access and have different guardrails and security controls applied. Ultimately, this enforces the separation of concerns as well as minimizes the blast radius.

A dev pipeline, a qa pipeline, a staging pipeline and, a prod pipeline in the central tooling account, each targeting the workload account for the respective environment pointing to the Regional resources containing a VPC and an ALB.

Fig 3. A separate pipeline per environment.

The sample architecture shown above contained a pipeline per environment (DEV, QA, STAGING, PROD) in the tooling account deploying to the target workload account for the respective environment. At scale, you can consider having multiple infrastructure deployment pipelines for multiple business units in the central tooling account, thereby targeting workload accounts per environment and business unit. If your organization has a complex business unit structure and is bound to have different levels of compliance and security controls, then the central tooling account can be further divided into the central tooling accounts per business unit.

Pipeline considerations

The infrastructure deployment pipeline is hosted in a central tooling account and targets workload accounts. The pipeline is the authoritative source managing the full lifecycle of resources. The goal is to decrease the risk of ad hoc changes (e.g., manual changes made directly via the console) that can’t be easily reproduced at a future date. The pipeline and the build step each run as their own IAM role that adheres to the principle of least privilege. The pipeline is configured with a stage to lint the Terraform code, as well as a static security scan of the Terraform resources following the principle of shifting security left in the SDLC.

As a further improvement for resiliency and applying the cell architecture principle to the CI/CD deployment, we can consider having multi-Region deployment of the AWS CodePipeline pipeline and AWS CodeBuild build resources, in addition to a clone of the AWS CodeCommit repository. We can use the approach detailed in this post to sync the repo across multiple regions. This means that both the workload architecture and the deployment infrastructure are multi-Region. However, it’s important to note that the business continuity requirements of the infrastructure deployment pipeline are most likely different than the requirements of the workloads themselves.

A dev pipeline in us-east-1, a dev pipeline in eu-central-1, a dev pipeline in ap-southeast-1, all in the central tooling account, each pointing respectively to the regional resources containing a VPC and an ALB for the respective Region in the dev target workload account.

Fig 4. Multi-Region CI/CD dev pipelines targeting the dev workload account resources in the respective Region.

Deeper dive into Terraform code

Backend configuration and state

As a prerequisite, we created Amazon S3 buckets to store the Terraform state files and Amazon DynamoDB tables for the state file locks. The latter is a best practice to prevent concurrent operations on the same state file. For naming the buckets and tables, our code expects the use of the same prefix (i.e., <tf_backend_config_prefix>-<env> for buckets and <tf_backend_config_prefix>-lock-<env> for tables). The value of this prefix must be passed in as an input param (i.e., “tf_backend_config_prefix”). Then, it’s fed into AWS CodeBuild actions for Terraform as an environment variable. Separation of remote state management resources (Amazon S3 bucket and Amazon DynamoDB table) across environments makes sure that we’re minimizing the blast radius.


-backend-config="bucket=${TF_BACKEND_CONFIG_PREFIX}-${ENV}" 
-backend-config="dynamodb_table=${TF_BACKEND_CONFIG_PREFIX}-lock-${ENV}"
A dev Terraform state files bucket named 

<prefix>-dev, a dev Terraform state locks DynamoDB table named <prefix>-lock-dev, a qa Terraform state files bucket named <prefix>-qa, a qa Terraform state locks DynamoDB table named <prefix>-lock-qa, a staging Terraform state files bucket named <prefix>-staging, a staging Terraform state locks DynamoDB table named <prefix>-lock-staging, a prod Terraform state files bucket named <prefix>-prod, a prod Terraform state locks DynamoDB table named <prefix>-lock-prod, in us-east-1 in the central tooling account” width=”600″ height=”456″>
 <p id=Fig 5. Terraform state file buckets and state lock tables per environment in the central tooling account.

The git tag that kicks off the pipeline is named with the following convention of “<env>_<region>/<team>/<version>” for regional deployments and “<env>_global/<team>/<version>” for global resource deployments. The stage following the source stage in our pipeline, tflint stage, is where we parse the git tag. From the tag, we derive the values of environment, deployment scope (i.e., Region or global), and team to determine the Terraform state Amazon S3 object key uniquely identifying the Terraform state file for the deployment. The values of environment, deployment scope, and team are passed as environment variables to the subsequent AWS CodeBuild Terraform plan and apply actions.

-backend-config="key=${TEAM}/${ENV}-${TARGET_DEPLOYMENT_SCOPE}/terraform.tfstate"

We set the Region to the value of AWS_REGION env variable that is made available by AWS CodeBuild, and it’s the Region in which our build is running.

-backend-config="region=$AWS_REGION"

The following is how the Terraform backend config initialization looks in our AWS CodeBuild buildspec files for Terraform actions, such as tflint, plan, and apply.

terraform init -backend-config="key=${TEAM}/${ENV}-
${TARGET_DEPLOYMENT_SCOPE}/terraform.tfstate" -backend-config="region=$AWS_REGION"
-backend-config="bucket=${TF_BACKEND_CONFIG_PREFIX}-${ENV}" 
-backend-config="dynamodb_table=${TF_BACKEND_CONFIG_PREFIX}-lock-${ENV}"
-backend-config="encrypt=true"

Using this approach, the Terraform states for each combination of account and Region are kept in their own distinct state file. This means that if there is an issue with one Terraform state file, then the rest of the state files aren’t impacted.

In the central tooling account us-east-1 Region, Terraform state files named “research/dev-us-east-1/terraform.tfstate”, “risk/dev-ap-southeast-1/terraform.tfstate”, “research/dev-eu-central-1/terraform.tfstate”, “research/dev-global/terraform.tfstate” are in S3 bucket named 

<prefix>-dev along with DynamoDB table for Terraform state locks named <prefix>-lock-dev. The Terraform state files named “research/qa-us-east-1/terraform.tfstate”, “risk/qa-ap-southeast-1/terraform.tfstate”, “research/qa-eu-central-1/terraform.tfstate” are in S3 bucket named <prefix>-qa along with DynamoDB table for Terraform state locks named <prefix>-lock-qa. Similarly for staging and prod.” width=”600″ height=”677″>
 <p id=Fig 6. Terraform state files per account and Region for each environment in the central tooling account

Following the example, a git tag of the form “dev_us-east-1/research/1.0” that kicks off the dev pipeline works against the research team’s dev account’s state file containing us-east-1 Regional resources (i.e., Amazon S3 object key “research/dev-us-east-1/terraform.tfstate” in the S3 bucket <tf_backend_config_prefix>-dev), and a git tag of the form “dev_ap-southeast-1/risk/1.0” that kicks off the dev pipeline works against the risk team’s dev account’s Terraform state file containing ap-southeast-1 Regional resources (i.e., Amazon S3 object key “risk/dev-ap-southeast-1/terraform.tfstate”). For global resources, we use a git tag of the form “dev_global/research/1.0” that kicks off a dev pipeline and works against the research team’s dev account’s global resources as they are at account level (i.e., “research/dev-global/terraform.tfstate).

Git tag “dev_us-east-1/research/1.0” pointing to the Terraform state file named “research/dev-us-east-1/terraform.tfstate”, git tag “dev_ap-southeast-1/risk/1.0 pointing to “risk/dev-ap-southeast-1/terraform.tfstate”, git tag “dev_eu-central-1/research/1.0” pointing to ”research/dev-eu-central-1/terraform.tfstate”, git tag “dev_global/research/1.0” pointing to “research/dev-global/terraform.tfstate”, in dev Terraform state files S3 bucket named <prefix>-dev along with <prefix>-lock-dev DynamoDB dev Terraform state locks table.” width=”600″ height=”318″>
 <p id=Fig 7. Git tags and the respective Terraform state files.

This backend configuration makes sure that the state file for one account and Region is independent of the state file for the same account but different Region. Adding or expanding the workload to additional Regions would have no impact on the state files of existing Regions.

If we look at the further improvement where we make our deployment infrastructure also multi-Region, then we can consider each Region’s CI/CD deployment to be the authoritative source for its local Region’s deployments and Terraform state files. In this case, tagging against the repo triggers a pipeline within the local CI/CD Region to deploy resources in the Region. The Terraform state files in the local Region are used for keeping track of state for the account’s deployment within the Region. This further decreases cross-regional dependencies.

A dev pipeline in the central tooling account in us-east-1, pointing to the VPC containing ALB in us-east-1 in dev target workload account, along with a dev Terraform state files S3 bucket named <prefix>-use1-dev containing us-east-1 Regional resources “research/dev/terraform.tfstate” and “risk/dev/terraform.tfstate” Terraform state files along with DynamoDB dev Terraform state locks table named <prefix>-use1-lock-dev. A dev pipeline in the central tooling account in eu-central-1, pointing to the VPC containing ALB in eu-central-1 in dev target workload account, along with a dev Terraform state files S3 bucket named <prefix>-euc1-dev containing eu-central-1 Regional resources “research/dev/terraform.tfstate” and “risk/dev/terraform.tfstate” Terraform state files along with DynamoDB dev Terraform state locks table named <prefix>-euc1-lock-dev. A dev pipeline in the central tooling account in ap-southeast-1, pointing to the VPC containing ALB in ap-southeast-1 in dev target workload account, along with a dev Terraform state files S3 bucket named <prefix>-apse1-dev containing ap-southeast-1 Regional resources “research/dev/terraform.tfstate” and “risk/dev/terraform.tfstate” Terraform state files along with DynamoDB dev Terraform state locks table named <prefix>-apse1-lock-dev” width=”700″ height=”603″>
 <p id=Fig 8. Multi-Region CI/CD with Terraform state resources stored in the same Region as the workload account resources for the respective Region

Provider

For deployments, we use the default Terraform AWS provider. The provider is parametrized with the value of the region passed in as an input parameter.

provider "aws" {
  region = var.region
   ...
}

Once the provider knows which Region to target, we can refer to the current AWS Region in the rest of the code.

# The value of the current AWS region is the name of the AWS region configured on the provider
# https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/region
data "aws_region" "current" {} 

locals {
    region = data.aws_region.current.name # then use local.region where region is needed
}

Provider is configured to assume a cross account IAM role defined in the workload account. The value of the account ID is fed as an input parameter.

provider "aws" {
  region = var.region
  assume_role {
    role_arn     = "arn:aws:iam::${var.account}:role/InfraBuildRole"
    session_name = "INFRA_BUILD"
  }
}

This InfraBuildRole IAM role could be created as part of the account creation process. The AWS Control Tower Terraform Account Factory could be used to automate this.

Code

Minimize cross-regional dependencies

We keep the Regional resources and the global resources (e.g., IAM role or policy) in distinct namespaces following the cell architecture principle. We treat each Region as one cell, with the goal of decreasing cross-regional dependencies. Regional resources are created once in each Region. On the other hand, global resources are created once globally and may have cross-regional dependencies (e.g., DynamoDB global table with a replica table in multiple Regions). There’s no “global” Terraform AWS provider since the AWS provider requires a Region. This means that we pick a specific Region from which to deploy our global resources (i.e., global_resource_deploy_from_region input param). By creating a distinct Terraform namespace for Regional resources (e.g., module.regional) and a distinct namespace for global resources (e.g., module.global), we can target a deployment for each using pipelines scoped to the respective namespace (e.g., module.global or module.regional).

Deploying Regional resources: A dev pipeline in the central tooling account triggered via git tag “dev_eu-central-1/research/1.0” pointing to the eu-central-1 VPC containing ALB in the research dev target workload account corresponding to the module.regional Terraform namespace. Deploying global resources: a dev pipeline in the central tooling account triggered via git tag “dev_global/research/1.0” pointing to the IAM resource corresponding to the module.global Terraform namespace.

Fig 9. Deploying regional and global resources scoped to the Terraform namespace

As global resources have a scope of the whole account regardless of Region while Regional resources are scoped for the respective Region in the account, one point of consideration and a trade-off with having to pick a Region to deploy global resources is that this introduces a dependency on that region for the deployment of the global resources. In addition, in the case of a misconfiguration of a global resource, there may be an impact to each Region in which we deployed our workloads. Let’s consider a scenario where an IAM role has access to an S3 bucket. If the IAM role is misconfigured as a result of one of the deployments, then this may impact access to the S3 bucket in each Region.

There are alternate approaches, such as creating an IAM role per Region (myrole-use1 with access to the S3 bucket in us-east-1, myrole-apse1 with access to the S3 bucket in ap-southeast-1, etc.). This would make sure that if the respective IAM role is misconfigured, then the impact is scoped to the Region. Another approach is versioning our global resources (e.g., myrole-v1, myrole-v2) with the ability to move to a new version and roll back to a previous version if needed. Each of these approaches has different drawbacks, such as the duplication of global resources that may make auditing more cumbersome with the tradeoff of minimizing cross Regional dependencies.

We recommend looking at the pros and cons of each approach and selecting the approach that best suits the requirements for your workloads regarding the flexibility to deploy to multiple Regions.

Consistency

We keep one copy of the infrastructure code and deploy the resources targeted for each Region using this same copy. Our code is built using versioned module composition as the “lego blocks”. This follows the DRY (Don’t Repeat Yourself) principle and decreases the risk of code drift per Region. We may deploy to any Region independently, including any Regions added at a future date with zero code changes and minimal additional configuration for that Region. We can see three advantages with this approach.

  1. The total deployment time per Region remains the same regardless of the addition of Regions. This helps for restrictions, such as tight release windows due to business requirements.
  2. If there’s an issue with one of the regional deployments, then the remaining Regions and their deployment pipelines aren’t affected.
  3. It allows the ability to stagger deployments or the possibility of not deploying to every region in non-critical environments (e.g., dev) to minimize costs and remain in line with the Well Architected Sustainability pillar.

Conclusion

In this post, we demonstrated a multi-account, multi-region deployment approach, along with sample code, with a focus on architecture using IaC tool Terraform and CI/CD services AWS CodeBuild and AWS CodePipeline to help customers in their journey through multi-Region deployments.

Thanks to Welly Siauw, Kenneth Jackson, Andy Taylor, Rodney Bozo, Craig Edwards and Curtis Rissi for their contributions reviewing this post and its artifacts.

Author:

Lerna Ekmekcioglu

Lerna Ekmekcioglu is a Senior Solutions Architect with AWS where she helps Global Financial Services customers build secure, scalable and highly available workloads.
She brings over 17 years of platform engineering experience including authentication systems, distributed caching, and multi region deployments using IaC and CI/CD to name a few.
In her spare time, she enjoys hiking, sight seeing and backyard astronomy.

Jack Iu

Jack is a Global Solutions Architect at AWS Financial Services. Jack is based in New York City, where he works with Financial Services customers to help them design, deploy, and scale applications to achieve their business goals. In his spare time, he enjoys badminton and loves to spend time with his wife and Shiba Inu.

Build Health Aware CI/CD Pipelines

Post Syndicated from sangusah original https://aws.amazon.com/blogs/devops/build-health-aware-ci-cd-pipelines/

Everything fails all the time — Werner Vogels, AWS CTO

At the moment of imminent failure, you want to avoid an unlucky deployment. I’ll start here with a short story that demonstrates the purpose of this post.

The DevOps team has just started a database upgrade with a planned outage of 30 minutes. The team automated the entire upgrade flow, triggered a CI/CD pipeline with no human intervention, and the upgrade is progressing smoothly. Then, 20 minutes in, the pipeline is stuck, and your upgrade isn’t progressing. The maintenance window has expired and customers can’t transact. You’ve created a support case, and the AWS engineer confirmed that the upgrade is failing because of a running AWS Health incident in the us-west-2 Region. The engineer has directed the DevOps team to continue monitoring the status.aws.amazon.com page for updates regarding incident resolution. The event continued running for three hours, during which time customers couldn’t transact. Once resolved, the DevOps team retried the failed pipeline, and it completed successfully.

After the incident, the DevOps team explored the possibilities for avoiding these types of incidents in the future. The team was made aware of AWS Health API that provides programmatic access to AWS Health information. In this post, we’ll help the DevOps team make the most of the AWS Health API to proactively prevent unintended outages.

AWS provides Business and Enterprise Support customers with access to the AWS Health API. Customers can have access to running events in the AWS infrastructure that may impact their service usage. Incidents could be Regional, AZ-specific, or even account specific. During these incidents, it isn’t recommended to deploy or change services that are impacted by the event.

In this post, I will walk you through how to embed AWS Health API insights into your CI/CD pipelines to automatically stop deployments whenever an AWS Health event is reported in a Region that you’re operating in. Furthermore, I will demonstrate how you can automate detection and remediation.

The Demo

In this demo, I will use AWS CodePipeline to demonstrate the idea. I will build a simple pipeline that demonstrates the concept without going into the build, test, and deployment specifics.

CodePipeline Flow

The CodePipeline flow consists of three steps:

  1. Source stage that downloads a CloudFormation template from AWS CodeCommit. The template will be deployed in the last stage.
  2. Custom stage that invokes the AWS Lambda function to evaluate the AWS Health. The Lambda function calls the AWS Health API, evaluates the health risk, and calls back CodePipeline with the assessment result.
  3. Deploy stage that deploys the CloudFormation templates downloaded from CodeCommit in the first stage.
The CodePipeline flow consists of 3 steps. First, "source stage" that downloads a CloudFormation template from CodeCommit. The template will be deployed in the last stage. Step 2 is a "custom stage" that invokes the Lambda function to evaluate AWS Health. The Lambda function calls the AWS Health API, evaluates the health risk and calls back CodePipeline with the assessment result. Finally, step 3 is a "deploy stage" that deploys the CloudFormation template downloaded from CodeCommit in the first stage. If a health is detected in step 2, the workflow will retry after a predefined timeout.

Figure 1. CodePipeline workflow.

Lambda evaluation logic

The Lambda function evaluates whether or not a running AWS Health event may be impacted by the deployment. In this case, the following criteria must be met to consider it as safe to deploy:

  • Deployment will take place in the North Virginia Region and accordingly the Lambda function will filter on the us-east-1 Region.
  • A closed event is irrelevant. The Lambda function will filter events with only the open status.
  • AWS Health API can return different event types that may not be relevant, such as: Scheduled Maintenance, and Account and Billing notifications. The Lambda function will filter only “Issue” type events.

The AWS Health API follows a multi-Region application architecture and has two regional endpoints in an active-passive configuration. To support active-passive DNS failover, AWS Health provides a global endpoint. The Python code is available on GitHub with more information in the README on how to build the Lambda code package.

The Lambda function requires the following AWS Identity and Access Management (IAM) permissions to access AWS Health API, CodePipeline, and publish logs to CloudWatch:

{
  "Version": "2012-10-17", 
  "Statement": [
    {
      "Action": [ 
        "logs:CreateLogStream",
        "logs:CreateLogGroup",
        "logs:PutLogEvents"
      ],
      "Effect": "Allow", 
      "Resource": "arn:aws:logs:us-east-1:replaceWithAccountNumber:*"
    },
    {
      "Action": [
        "codepipeline:PutJobSuccessResult",
        "codepipeline:PutJobFailureResult"
        ],
        "Effect": "Allow",
        "Resource": "*"
     },
     {
        "Effect": "Allow",
        "Action": "health:DescribeEvents",
        "Resource": "*"
    }
  ]
}

Solution architecture

This is the solution architecture diagram. It involved three entities: AWS Code Pipeline, AWS Lambda and the AWS Health API. First, AWS Code Pipeline invoke the Lambda function asynchronously. Second, the Lambda function call the AWS Health API, DescribeEvents. Third, the DescribeEvents API will respond back with a list of health events. Finally, the Lambda function will respond with either a success response or a failed one through calling PutJobSuccessResult and PutJobFailureResults consecutively.

Figure 2. Solution architecture diagram.

In CodePipeline, create a new stage with a single action to asynchronously invoke a Lambda function. The function will call AWS Health DescribeEvents API to retrieve the list of active health incidents. Then, the function will complete the event analysis and decide whether or not it may impact the running deployment. Finally, the function will call back CodePipeline with the evaluation results through either PutJobSuccessResult or PutJobFailureResult API operations.

If the Lambda evaluation succeeds, then it will call back the pipeline with a PutJobSuccessResult API. In turn, the pipeline will mark the step as successful and complete the execution.

AWS Code Pipeline workflow execution snapshot from the AWS Console. The first step, Source is a success after completing source code download from AWS CodeCommit service. The second step, check the AWS service health is a success as well.

Figure 3. AWS Code Pipeline workflow successful execution.

If the Lambda evaluation fails, then it will call back the pipeline with a PutJobFailureResult API specifying a failure message. Once the DevOps team is made aware that the event has been resolved, select the Retry button to re-evaluate the health status.

AWS CodePipeline workflow execution snapshot from the AWS Console. The first step, Source is a success after completing source code download from AWS CodeCommit service. The second step, check the AWS service health has failed after detecting a running health event/incident in the operating AWS region.

Figure 4. AWS CodePipeline workflow failed execution.

Your DevOps team must be aware of failed deployments. Therefore, it’s a good idea to configure alerts to notify concerned stakeholders with failed stage executions. Create a notification rule that posts a Slack message if a stage fails. For detailed steps, see Create a notification rule – AWS CodePipeline. In case of failure, a Slack notification will be sent through AWS Chatbot.

A Slack UI snapshot showing the notification to be sent if a deployment fails to execute. The notification shows a title of "AWS CodePipeline Notification". The notification indicates that one action has failed in the stage aws-health-check. The notification also shows that the failure reason is that there is an Incident In Progress. The notification also mentions the Pipeline name as well as the failed stage name.

Figure 5. Slack UI snapshot notification for a failed deployment.

A more elegant solution involves pushing the notification to an SNS topic that in turns calls a Lambda function to retry the failed stage. The Lambda function extracts the pipeline failed stage identifier, and then calls the RetryStageExecution CodePipeline API.

Conclusion

We’ve learned how to create an automation that evaluates the risk associated with proceeding with a deployment in conjunction with a running AWS Health event. Then, the automation decides whether to proceed with the deployment or block the progress to avoid unintended downtime. Accordingly, this results in the improved availability of your application.

This solution isn’t exclusive to CodePipeline. However, the pattern can be applied to other CI/CD tools that your DevOps team uses.

Author:

Islam Ghanim

Islam Ghanim is a Senior Technical Account Manager at Amazon Web Services in Melbourne, Australia. He enjoys helping customers build resilient and cost-efficient architectures. Outside work, he plays squash, tennis and almost any other racket sport.

Implementing Attribute-Based Instance Type Selection using Terraform

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/implementing-attribute-based-instance-type-selection-using-terraform/

This blog post is written by Christian Melendez, Senior Specialist Solutions Architect, Flexible Compute – EC2 Spot and Carlos Manzanedo Rueda, WW SA Leader, Flexible Compute – EC2 Spot.

In this blog post we will cover the release of Terraform support for Attribute-Based Instance Type Selection (ABS). ABS simplifies the configuration required to acquire compute capacity for Instance Flexible workloads. Terraform  is an open-source infrastructure as code software tool by HashiCorp. Hashicorp is an AWS Partner Network (APN) Advanced Technology Partner and member of the AWS DevOps Competency.

Introduction

Amazon EC2 provides a wide selection of instance types optimized to fit different use cases. Instance types comprise varying combinations of CPU, memory, storage, and networking capacity and give you the flexibility to choose the appropriate mix of resources for your applications.

Workloads such as continuous integration, analytics, microservices on containers, etc., can use multiple instance types. Customers have been telling us that simplifying the configuration of instance flexible workloads is important. For workloads that are instance flexible, AWS released ABS to express workload requirements as a set of instance attributes such as: vCPU, memory, type of processor, etc. ABS translates these requirements and selects all matching instance types that meet the criteria. To select which instance to launch, Amazon EC2 Auto Scaling Groups and EC2 Fleet chose instances based on the allocation strategy configured. Lowest-price allocation strategy is supported on both Amazon EC2 On-Demand Instances and Amazon EC2 Spot Instances. The recommendation for Spot Instances is to use capacity-optimized, which select the optimal instances that reduce the frequency of interruptions. ABS does also future-proof EC2 Auto Scaling Group and EC2 Fleet configurations: any new instance type we launch that matches the selected attributes, will be included in the list automatically. No need to update your EC2 Auto Scaling Group or EC2 Fleet configuration.

Following our commitment to open-source projects, AWS has added support for ABS in the AWS Terraform provider. You can use ABS for launch templates, EC2 Auto Scaling Group, and EC2 Fleet resources. The minimum required version of the AWS provider is v4.16.0.

Applying instance flexibility is key for running fault-tolerant, elastic, reliable, and cost optimized workloads. By selecting a diversified choice of instances that qualify for your workload, your application will be better prepared to avoid scenarios where lack of capacity on a specific instance type could be an issue. This applies both to On-Demand and Spot Instance-based workloads. For Spot Instances, applying diversification is key. Spot Instances are spare capacity that can be reclaimed by EC2 when it is required. ABS allows you to specify diversification in simple terms, allowing EC2 Auto Scaling Group and EC2 Fleet’s allocation strategy to replace reclaimed Spot Instances with instances from other pools where capacity is available.

Instance Requirement Attributes

To represent the instance requirements for your workload using ABS, there are a set of attributes you can use within the instance_requirements block. When using Terraform, the only two required attributes are memory_mib and vcpu_count. The rest of the attributes provide default values that adhere to Instance Flexible workloads best practices. For example, bare_metal attribute is by default excluded. You can see the full list of ABS attributes in the Terraform docs site.

Once ABS attributes are configured, ABS picks a list of instance types that match the criteria. This list is especially important when you’re using Spot Instances. One of Spot Instances best practices is to diversify the instance types which, in combination with the capacity-optimized allocation strategy, gives you access to the highest amount of Spot capacity pools. For On-Demand Instances, the instance types list is important as well. There might be scenarios where On-Demand Instance pools lack capacity. By applying instance flexibility using ABS, you can avoid the InsufficientInstanceCapacity error. And in combination with the lowest-price allocation strategy, you get the lowest price instance types from your diversified selection.

There are different places where we can specify ABS attributes. We can specify them at the launch templates level and declare a base mechanism to select instances. In most cases, the recommendation is to configure ABS attributes at the EC2 Auto Scaling Group and EC2 Fleet level. Let’s explore each of these options.

Configuring instance requirements within launch templates

Launch templates are instance configuration templates where you specify parameters like AMI ID, instance type, key pair, and security groups to launch instances. You can use ABS attributes in a launch template when you need to be prescriptive, and define sane defaults or guardrails for your workloads. This way, EC2 Auto Scaling Group or EC2 Fleet simply reference and use the launch template.

You should use ABS attributes in launch templates when you want to prevent users from overriding the resources specified by a launch template. Note that is still possible to override those requirements.

Let’s say that we have a Java application that requires a minimum of 4 vCPUs and 8 GiB of memory, and has been using the c5.xlarge instance type. After performance testing we’ve identified that runs better with current instance types generations. The following code snippet represents how to define these requirements in a Launch Template. To see the full list of attributes, visit the launch template doc site.

resource "aws_launch_template" "abs" {
  name_prefix = "abs"
  image_id    = data.aws_ami.abs.id

  instance_requirements {
    memory_mib {
      min = 8192
    }
    vcpu_count {
      min = 4
    }
    instance_generations = ["current"]
  }
}

Note by using instance_requirements block in a LaunchTemplate, you’ll need to use the mixed_instances_policy block in the EC2 Auto Scaling Group.

resource "aws_autoscaling_group" "on_demand" {
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  max_size           = 1
  min_size           = 1
  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.abs.id
      }
    }
  }
}

The EC2 Auto Scaling Group will use instance types that match the requirements in the launch template.

You can preview what are the instances that the EC2 Auto Scaling Group will select. The section “How to Preview Matching Instances without Launching Them” of the ABS blog post, describes how to preview the instances that will be selected.

Launching EC2 Spot Instances with EC2 Auto Scaling

Launch templates are very powerful. They allow you to decouple the attributes such as user_data from the actual instance management. They are idempotent and can be versioned which is key for rolling out changes in the configuration and applying EC2 Auto Scaling Group Instance Refresh.

Our recommendation is to define ABS attributes as overrides within the mixed_instances_policy block in EC2 Auto Scaling Groups. For most of the applications, we recommend using EC2 Auto Scaling Group to provision EC2 instances.

Let’s get back to our previous example. Now we want to be more prescriptive for the instance requirements our Java application uses.

Let’s assume that the Java application is memory intensive, requires a vCPU to Memory ratio of 4GB for every vCPU, and has been using the m5.large instance type. Additionally, it does not need hardware accelerators like GPUs, FPGAs, etc. Our application does also require a minimum of 2 vCPUs, and the range of memory has been reduced to 32GB to avoid large garbage collection scenarios. This time, we’d like to launch only Spot Instances. As mentioned earlier in this blog post, diversification is key for Spot. To improve the experience with Spot, let’s enable the capacity rebalance feature from the EC2 Auto Scaling Group to proactively replace instances that are at an elevated risk of being interrupted. The following code snippet represents the ABS attributes we need for the more prescriptive workload:

resource “aws_autoscaling_group” “spot” {
  availability_zones = [“us-east-1a”, “us-east-1b”, “us-east-1c”]
  desired_capacity   = 1
  max_size           = 1
  min_size           = 1
  capacity_rebalance = true

  mixed_instances_policy {
    instances_distribution {
      spot_allocation_strategy = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.x86.id
      }

      override {
        instance_requirements {
          memory_mib {
            min = 4096
            max = 32768
          }

          vcpu_count {
            min = 2
          }

          memory_gib_per_vcpu {
            min = 4
            max = 4
          }

          accelerator_count {
            max = 0
          }
        }
      }
    }
  }
}

Launching EC2 Spot Instances with EC2 Fleet

Another method we have to launch EC2 instances is the EC2 Fleet API. We recommend to use EC2 Fleet for workloads that need granular controls to provision capacity. For example, tight HPC workloads where instances must be close together (single Availability Zone and within the same placement group) and need similar instance types. EC2 Fleet is also used by capacity orchestrators such as Karpenter or Atlassian Escalator that implement tuned up and optimized logic to provision capacity.

Let’s say that this time the workload is a CPU bound workload and has been using the c5.9xlarge instance type. The workload can be retried and the application supports checkpointing, so it qualifies to use Spot Instances. Given we’ll be using Spot Instances, we would like to benefit from the capacity rebalance feature as we did before in the EC2 Auto Scaling Group example. The application requires very prescriptive ranges of vCPU and memory and we also need a minimum of 100 GB SSD local storage. While in most cases EC2 Auto Scaling Groups are appropriate solution to procure and maintain capacity, in this case we will use EC2 Fleet.

The following code snippet represents the ABS attributes we need for this workload:

resource "aws_ec2_fleet" "spot" {
  target_capacity_specification {
    default_target_capacity_type = "spot"
    total_target_capacity        = 5
  }

  spot_options {
    allocation_strategy = "capacity-optimized"
    maintenance_strategies {
      capacity_rebalance {
        replacement_strategy = "launch"
      }
    }
  }

  launch_template_config {
    launch_template_specification {
      launch_template_id = aws_launch_template.x86.id
      version            = aws_launch_template.x86.latest_version
    }

    override {
      instance_requirements {
        memory_mib {
          min = 65536
          max = 73728
        }

        vcpu_count {
          min = 32
          max = 36
        }

        cpu_manufacturers   = ["intel"]
        local_storage       = "required"
        local_storage_types = ["ssd"]

        total_local_storage_gb {
          min = 100
        }
      }
    }
  }
}

Multi-Architecture workloads using Graviton and x86 with EC2 Auto Scaling Groups

Another recent feature from EC2 Auto Scaling Groups is that you can build multi-architecture workloads. EC2 Auto Scaling Group allows you to mix Graviton and x86 instance types in the same EC2 Auto Scaling Group. Unlike the x86_64 instances we have used so far, AWS Graviton processors are custom built by AWS using 64-bit Arm. You need to use different launch templates as each architecture needs to use a different AMI. This can be defined in within the override block.

In the example below, we use ABS to define different attributes depending on the CPU architecture. And what’s great about doing this is that we don’t need to exclude instance types. Instances will be launched with a compatible CPU architecture based on the AMI that you specify in our launch template.

Besides supporting mixing architectures, EC2 Auto Scaling Groups allows to combine purchase models. For our example, this time we’ll use a more complex scenario to showcase how powerful and feature rich EC2 Auto Scaling Group has become. The following code snippet applies many of the configurations we’ve seen before, but the key difference here is that we have two overrides. One is for Graviton instances, and another one for x86 instances.

resource "aws_autoscaling_group" "on_demand_spot" {
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  desired_capacity   = 4
  max_size           = 10
  min_size           = 2
  capacity_rebalance = true

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 0
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.arm.id
      }

      override {
        launch_template_specification {
          launch_template_id = aws_launch_template.arm.id
        }

        instance_requirements {
          memory_mib {
            min = 16384
            max = 16384
          }

          vcpu_count {
            max = 4
          }
        }
      }

      override {
        launch_template_specification {
          launch_template_id = aws_launch_template.x86.id
        }

        instance_requirements {
          memory_mib {
            min = 16384
          }

          vcpu_count {
            min = 4
          }
        }
      }
    }
  }
}

Multi-architecture workloads can also be applied to container orchestration. Thanks to the Amazon Elastic Container Registry (Ama­zon ECR) support for multi-architecture container images, you can use manifests to push both container images, and Amazon ECR will pull the proper image based on the CPU architecture. We have a workshop for Elastic Kubernetes Service (Amazon EKS) where you can learn more about how to deploy a multi-architecture workload in Amazon EKS.

Conclusion

In this post you’ve learned how to configure ABS to launch instances using Auto Scaling Groups and EC2 Fleet. You’ve learned and how to mix CPU architectures along with purchase models in the same EC2 Auto Scaling Group using Terraform while simplifying the configuration.

Our commitment with open-source projects such as Terraform is to help customers implement AWS best practices in a larger ecosystem easily. ABS support allows customers to get access to compute capacity by simply specifying the resource requirement attributes of their workloads rather than the instance names.

ABS simplifies the configuration for instance flexible workloads and removes the need to list the instances that qualify for your workload. Instead, it simplifies the configuration and future proof for scenarios where AWS includes new instances that qualify for the workload. For Spot workloads where instance diversification is key, ABS simplifies the selection of instances and helps to increase the total number of pools. For more information, visit the ABS user guide and Terraform documentation for Auto Scaling Groups and EC2 Fleet.

Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/big-data/build-an-apache-iceberg-data-lake-using-amazon-athena-amazon-emr-and-aws-glue/

Most businesses store their critical data in a data lake, where you can bring data from various sources to a centralized storage. The data is processed by specialized big data compute engines, such as Amazon Athena for interactive queries, Amazon EMR for Apache Spark applications, Amazon SageMaker for machine learning, and Amazon QuickSight for data visualization.

Apache Iceberg is an open-source table format for data stored in data lakes. It is optimized for data access patterns in Amazon Simple Storage Service (Amazon S3) cloud object storage. Iceberg helps data engineers tackle complex challenges in data lakes such as managing continuously evolving datasets while maintaining query performance. Iceberg allows you to do the following:

  • Maintain transactional consistency where files can be added, removed, or modified atomically with full read isolation and multiple concurrent writes
  • Implement full schema evolution to process safe table schema updates as the table data evolves
  • Organize tables into flexible partition layouts with partition evolution, enabling updates to partition schemes as queries and data volume changes without relying on physical directories
  • Perform row-level update and delete operations to satisfy new regulatory requirements such as the General Data Protection Regulation (GDPR)
  • Provide versioned tables and support time travel queries to query historical data and verify changes between updates
  • Roll back tables to prior versions to return tables to a known good state in case of any issues

In 2021, AWS teams contributed the Apache Iceberg integration with the AWS Glue Data Catalog to open source, which enables you to use open-source compute engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena announced support of Iceberg and Amazon EMR added support of Iceberg starting with version 6.5.0.

In this post, we show you how to use Amazon EMR Spark to create an Iceberg table, load sample books review data, and use Athena to query, perform schema evolution, row-level update and delete, and time travel, all coordinated through the AWS Glue Data Catalog.

Solution overview

We use the Amazon Customer Reviews public dataset as our source data. The dataset contains data files in Apache Parquet format on Amazon S3. We load all the book-related Amazon review data as an Iceberg table to demonstrate the advantages of using the Iceberg table format on top of raw Parquet files. The following diagram illustrates our solution architecture.

Architecture that shows the flow from Amazon EMR loading data into Amazon S3, and queried by Amazon Athena through AWS Glue Data Catalog.

To set up and test this solution, we complete the following high-level steps:

  1. Create an S3 bucket.
  2. Create an EMR cluster.
  3. Create an EMR notebook.
  4. Configure a Spark session.
  5. Load data into the Iceberg table.
  6. Query the data in Athena.
  7. Perform a row-level update in Athena.
  8. Perform a schema evolution in Athena.
  9. Perform time travel in Athena.
  10. Consume Iceberg data across Amazon EMR and Athena.

Prerequisites

To follow along with this walkthrough, you must have the following:

  • An AWS Account with a role that has sufficient access to provision the required resources.

Create an S3 bucket

To create an S3 bucket that holds your Iceberg data, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. For Bucket name, enter a name (for this post, we enter aws-lake-house-iceberg-blog-demo).

Because S3 bucket names are globally unique, choose a different name when you create your bucket.

  1. For AWS Region, choose your preferred Region (for this post, we use us-east-1).

Create a new Amazon S3 bucket. Choose us-east-1 as region

  1. Complete the remaining steps to create your bucket.
  2. If this is the first time that you’re using Athena to run queries, create another globally unique S3 bucket to hold your Athena query output.

Create an EMR cluster

Now we’re ready to start an EMR cluster to run Iceberg jobs using Spark.

  1. On the Amazon EMR console, choose Create cluster.
  2. Choose Advanced options.
  3. For Software Configuration, choose your Amazon EMR release version.

Iceberg requires release 6.5.0 and above.

  1. Select JupyterEnterpriseGateway and Spark as the software to install.
  2. For Edit software settings, select Enter configuration and enter [{"classification":"iceberg-defaults","properties":{"iceberg.enabled":true}}].
  3. Leave other settings at their default and choose Next.

Choose Amazon EMR release 6.6.0 and JupyterEnterpriseGateway and Spark. Enter configuration information.

  1. You can change the hardware used by the Amazon EMR cluster in this step. In this demo, we use the default setting.
  2. Choose Next.
  3. For Cluster name, enter Iceberg Spark Cluster.
  4. Leave the remaining settings unchanged and choose Next.

Provide Iceberg Spark Cluster as the Cluster name

  1. You can configure security settings such as adding an EC2 key pair to access your EMR cluster locally. In this demo, we use the default setting.
  2. Choose Create cluster.

You’re redirected to the cluster detail page, where you wait for the EMR cluster to transition from Starting to Waiting.

Create an EMR notebook

When the cluster is active and in the Waiting state, we’re ready to run Spark programs in the cluster. For this demo, we use an EMR notebook to run Spark commands.

  1. On the Amazon EMR console, choose Notebooks in the navigation pane.
  2. Choose Create notebook.
  3. For Notebook name, enter a name (for this post, we enter iceberg-spark-notebook).
  4. For Cluster, select Choose an existing cluster and choose Iceberg Spark Cluster.
  5. For AWS service role, choose Create a new role to create EMR_Notebook_DefaultRole or choose a different role to access resources in the notebook.
  6. Choose Create notebook.

Create an Amazon EMR notebook. Use EMR_Notebooks_DefaultRole

You’re redirected to the notebook detail page.

  1. Choose Open in JupyterLab next to your notebook.
  2. Choose to create a new notebook.
  3. Under Notebook, choose Spark.

Choose Spark from the options provided in the Launcher

Configure a Spark session

In your notebook, run the following code:

%%configure -f
{
  "conf": {
    "spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.demo.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
    "spark.sql.catalog.demo.warehouse": "s3://<your-iceberg-blog-demo-bucket>",
    "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
  }
}

This sets the following Spark session configurations:

  • spark.sql.catalog.demo – Registers a Spark catalog named demo, which uses the Iceberg Spark catalog plugin
  • spark.sql.catalog.demo.catalog-impl – The demo Spark catalog uses AWS Glue as the physical catalog to store Iceberg database and table information
  • spark.sql.catalog.demo.warehouse – The demo Spark catalog stores all Iceberg metadata and data files under the root path s3://<your-iceberg-blog-demo-bucket>
  • spark.sql.extensions – Adds support to Iceberg Spark SQL extensions, which allows you to run Iceberg Spark procedures and some Iceberg-only SQL commands (you use this in a later step)

Load data into the Iceberg table

In our Spark session, run the following commands to load data:

// create a database in AWS Glue named reviews if not exist
spark.sql("CREATE DATABASE IF NOT EXISTS demo.reviews")

// load reviews related to books
val book_reviews_location = "s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet"
val book_reviews = spark.read.parquet(book_reviews_location)

// write book reviews data to an Iceberg v2 table
book_reviews.writeTo("demo.reviews.book_reviews").tableProperty("format-version", "2").createOrReplace()

Iceberg format v2 is needed to support row-level updates and deletes. See Format Versioning for more details.

It may take up to 15 minutes for the commands to complete. When it’s complete, you should be able to see the table on the AWS Glue console, under the reviews database, with the table_type property shown as ICEBERG.

Shows the table properties for book_reviews table

The table schema is inferred from the source Parquet data files. You can also create the table with a specific schema before loading data using Spark SQL, Athena SQL, or Iceberg Java and Python SDKs.

Query in Athena

Navigate to the Athena console and choose Query editor. If this is your first time using the Athena query editor, you need to configure to use the S3 bucket you created earlier to store the query results.

The table book_reviews is available for querying. Run the following query:

SELECT * FROM reviews.book_reviews LIMIT 5;

The following screenshot shows the first five records from the table being displayed.

Amazon Athena query the first 5 rows and show the results

Perform a row-level update in Athena

In the next few steps, let’s focus on a record in the table with review ID RZDVOUQG1GBG7. Currently, it has no total votes when we run the following query:

SELECT total_votes FROM reviews.book_reviews 
WHERE review_id = 'RZDVOUQG1GBG7'

Query total_votes for a particular review which shows a value of 0

Let’s update the total_votes value to 2 using the following query:

UPDATE reviews.book_reviews
SET total_votes = 2
WHERE review_id = 'RZDVOUQG1GBG7'

Update query to set the total_votes for the previous review_id to 2

After your update command runs successfully, run the below query and note the updated result showing a total of two votes:

SELECT total_votes FROM reviews.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'

Athena enforces ACID transaction guarantee for all the write operations against an Iceberg table. This is done through the Iceberg format’s optimistic locking specification. When concurrent attempts are made to update the same record, a commit conflict occurs. In this scenario, Athena displays a transaction conflict error, as shown in the following screenshot.

Concurrent updates causes a failure. This shows the TRANSACTION_CONFLICT error during this scenario.

Delete queries work in a similar way; see DELETE for more details.

Perform a schema evolution in Athena

Suppose the review suddenly goes viral and gets 10 billion votes:

UPDATE reviews.book_reviews
SET total_votes = 10000000000
WHERE review_id = 'RZDVOUQG1GBG7'

Based on the AWS Glue table information, the total_votes is an integer column. If you try to update a value of 10 billion, which is greater than the maximum allowed integer value, you get an error reporting a type mismatch.

Updating to a very large value greater than maximum allowed integer value results in an error

Iceberg supports most schema evolution features as metadata-only operations, which don’t require a table rewrite. This includes add, drop, rename, reorder column, and promote column types. To solve this issue, you can change the integer column total_votes to a BIGINT type by running the following DDL:

ALTER TABLE reviews.book_reviews
CHANGE COLUMN total_votes total_votes BIGINT;

You can now update the value successfully:

UPDATE reviews.book_reviews
SET total_votes = 10000000000
WHERE review_id = 'RZDVOUQG1GBG7'

Querying the record now gives us the expected result in BIGINT:

SELECT total_votes FROM reviews.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'

Perform time travel in Athena

In Iceberg, the transaction history is retained, and each transaction commit creates a new version. You can perform time travel to look at a historical version of a table. In Athena, you can use the following syntax to travel to a time that is after when the first version was committed:

SELECT total_votes FROM reviews.book_reviews
FOR SYSTEM_TIME AS OF localtimestamp + interval '-20' minute
WHERE review_id = 'RZDVOUQG1GBG7'

Query an earlier snapshot using time travel feature

Consume Iceberg data across Amazon EMR and Athena

One of the most important features of a data lake is for different systems to seamlessly work together through the Iceberg open-source protocol. After all the operations are performed in Athena, let’s go back to Amazon EMR and confirm that Amazon EMR Spark can consume the updated data.

First, run the same Spark SQL and see if you get the same result for the review used in the example:

val select_votes = """SELECT total_votes FROM demo.reviews.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'"""

spark.sql(select_votes).show()

Spark shows 10 billion total votes for the review.

Shows the latest value of total_votes when querying using the Amazon EMR notebook

Check the transaction history of the operation in Athena through Spark Iceberg’s history system table:

val select_history = "SELECT * FROM demo.reviews.book_reviews.history"

spark.sql(select_history).show()

This shows three transactions corresponding to the two updates you ran in Athena.

Shows snapshots corresponding to the two updates you ran in Athena

Iceberg offers a variety of Spark procedures to optimize the table. For example, you can run an expire_snapshots procedure to remove old snapshots, and free up storage space in Amazon S3:

import java.util.Calendar
import java.text.SimpleDateFormat

val now = Calendar.getInstance().getTime()
val form = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val now_formatted = form.format(now.getTime())
val procedure = s"""CALL demo.system.expire_snapshots(
  table => 'reviews.book_reviews',
  older_than => TIMESTAMP '$now_formatted',
  retain_last => 1)"""

spark.sql(procedure)

Note that, after running this procedure, time travel can no longer be performed against expired snapshots.

Examine the history system table again and notice that it shows you only the most recent snapshot.

Running the following query in Athena results in an error “No table snapshot found before timestamp…” as older snapshots were deleted, and you are no longer able to time travel to the older snapshot:

SELECT total_votes FROM reviews.book_reviews
FOR SYSTEM_TIME AS OF localtimestamp + interval '-20' minute
WHERE review_id = 'RZDVOUQG1GBG7'

Clean up

To avoid incurring ongoing costs, complete the following steps to clean up your resources:

  1. Run the following code in your notebook to drop the AWS Glue table and database:
// DROP the table 
spark.sql("DROP TABLE demo.reviews.book_reviews") 
// DROP the database 
spark.sql("DROP DATABASE demo.reviews")
  1. On the Amazon EMR console, choose Notebooks in the navigation pane.
  2. Select the notebook iceberg-spark-notebook and choose Delete.
  3. Choose Clusters in the navigation pane.
  4. Select the cluster Iceberg Spark Cluster and choose Terminate.
  5. Delete the S3 buckets and any other resources that you created as part of the prerequisites for this post.

Conclusion

In this post, we showed you an example of using Amazon S3, AWS Glue, Amazon EMR, and Athena to build an Iceberg data lake on AWS. An Iceberg table can seamlessly work across two popular compute engines, and you can take advantage of both to design your customized data production and consumption use cases.

With AWS Glue, Amazon EMR, and Athena, you can already use many features through AWS integrations, such as SageMaker Athena integration for machine learning, or QuickSight Athena integration for dashboard and reporting. AWS Glue also offers the Iceberg connector, which you can use to author and run Iceberg data pipelines.

In addition, Iceberg supports a variety of other open-source compute engines that you can choose from. For example, you can use Apache Flink on Amazon EMR for streaming and change data capture (CDC) use cases. The strong transaction guarantee and efficient row-level update, delete, time travel, and schema evolution experience offered by Iceberg offers a sound foundation and infinite possibilities for users to unlock the power of big data.


About the Authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Jack Ye is a software engineer of the Athena Data Lake and Storage team. He is an Apache Iceberg Committer and PMC member.

Mohit Mehta is a Principal Architect at AWS with expertise in AI/ML and data analytics. He holds 12 AWS certifications and is passionate about helping customers implement cloud enterprise strategies for digital transformation. In his free time, he trains for marathons and plans hikes across major peaks around the world.

Giovanni Matteo Fumarola is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.

Jared Keating is a Senior Cloud Consultant with AWS Professional Services. Jared assists customers with their cloud infrastructure, compliance, and automation requirements, drawing from his 20+ years of IT experience.

Adding approval notifications to EC2 Image Builder before sharing AMIs

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/adding-approval-notifications-to-ec2-image-builder-before-sharing-amis/

This blog post was written by, Glenn Chia Jin Wee, Associate Cloud Architect at AWS and Randall Han, Associate Professional Services Consultant at AWS.

In some situations, you may be required to manually validate the Amazon Machine Image (AMI) built from an Amazon Elastic Compute Cloud (Amazon EC2) Image Builder pipeline before sharing this AMI to other AWS accounts or to an AWS Organization. Currently, Image Builder provides an end-to-end pipeline that automatically shares AMIs after they’ve been built.

In this post, we will walk through the steps to enable approval notifications before AMIs are shared with other AWS accounts. Having a manual approval step could be useful if you would like to verify the AMI configurations before it is shared to other AWS accounts or an AWS Organization. This reduces the possibility of incorrectly configured AMIs being shared to other teams which in turn could lead to downstream issues if applications are installed using this AMI. This solution uses serverless resources to send an email with a link that automatically shares the AMI with the specified AWS accounts. Users select this link after they’ve verified that the AMI is built according to specifications.

Overview

Architecture Diagram

  1. In this solution, an Image Builder Pipeline is run that builds a Golden AMI in Account A. After the AMI is built, Image Builder publishes data about the AMI to an Amazon Simple Notification Service (Amazon SNS) topic.
  2. This SNS Topic passes the data to an AWS Lambda function that subscribes to it.
  3. The Lambda function that subscribes to this topic retrieves the data, formats it, and sends a customized email to another SNS Topic.
  4. The second SNS Topic has an email subscription with the Approver’s email. The approver will receive the customized email with a URL that interacts with the next set of Serverless resources.
  5. Selecting the URL makes a GET request to Amazon API Gateway, thereby passing the AMI ID in the query string.
  6. API Gateway then triggers another Lambda function and passes the AMI ID to it.
  7. The Lambda function obtains the AMI ID from the query string parameter of the API Gateway request, and then shares it with the provided target account.

Prerequisites

For this walkthrough, you will need the following:

Walkthrough

In this section, we will guide you through the steps required to deploy the Image Builder solution that utilizes Serverless resources. The solution is deployed with AWS SAM.

In this scenario, we deploy the solution within the approver’s account. The approval email will be sent to a predefined email address for manual approval, before the newly created AMI is shared to target accounts.

Once the approver selects the approval link, an email notification will be sent to the predefined target account email address, notifying that the AMI has been successfully shared.

The high-level steps we will follow are:

  1. In Account A, deploy the provided AWS SAM template. This includes an example Image Builder Pipeline, Amazon SNS topics, API Gateway, and Lambda functions.
  2. Approve the SNS subscription from your supplied email address.
  3. Run the pipeline from the Amazon EC2 Image Builder Console.
  4. [Optional] After the pipeline runs, launch an Amazon EC2 instance from the built AMI to conduct manual tests
  5. An Amazon SNS email will be sent to you with an API Gateway URL. When clicked, an AWS Lambda function shares the AMI to the Account B.
  6. Log in to Account B and verify that the AMI has been shared.

Step 1: Launch the AWS SAM template

  1. Clone the SAM templates from this GitHub repository.
  2. Run the following command to deploy the templates via SAM. Replace <approver email> with the Approver’s email and <AWS Account B ID> with the AWS Account ID of your second AWS Account.

sam deploy \

–template-file template.yaml \

–stack-name ec2-image-builder-approver-notifications \

–capabilities CAPABILITY_IAM \

–resolve-s3 \

–parameter-overrides \

ApproverEmail=<approver email> \

TargetAccountEmail=<target account email> \

TargetAccountlds=<AWS Account B ID>

Step 2: Verify your email address

  1. After running the deployment, you will receive an email prompting you to confirm the Subscription at the approver email address. Choose Confirm subscription.

Email to confirm SNS topic subscription

  1. This leads to the following screen, which shows that your subscription is confirmed.

SNS topic subscription confirmation

  1. Repeat the previous 2 steps for the target email address.

Step 3: Run the pipeline from the Image Builder console

  1. In the Image Builder console, under Image pipelines, select the checkbox next to the Pipeline created, choose Actions, and select Run pipeline.

Run the Image Builder Pipeline

Note that the pipeline takes approximately 20 to 30 minutes to complete.

Step 4: [Optional] Launch an Amazon EC2 instance from the built AMI

There could be a requirement to manually validate the AMI before sharing it to other AWS accounts or to the AWS organization. With this requirement, approvers will launch an Amazon EC2 instance from the built AMI and conduct manual tests on the EC2 instance to make sure that it is functional.

  1. In the Amazon EC2 console, under Images, choose AMIs. Validate that the AMI is created.

Validate the AMI has been built

  1. Follow AWS docs: Launching an EC2 instances from a custom AMI for steps on how to launch an Amazon EC2 instance from the AMI.

Step 5: Select the approval URL in the email sent

  1. When the pipeline is run successfully, you will receive another email with a URL to share the AMI.

Approval link to share the AMI to Account B

2. Selecting this URL results in the following screen which shows that the AMI share is successful.

Result showing the AMI was successfully shared after selecting the approval link

Step 6: Verify that the AMI is shared to Account B

  1. Log in to Account B.
  2. In the Amazon EC2 console, under Images, choose AMIs. Then, in the dropdown, choose Private images. Validate that the AMI is shared.

AMI is shared when Private images are selected from the dropdown

3. Verify that a success email notification was sent to the target account email address provided.

Successful AMI share email notification sent to Target Account Email Address

Clean up

This section provides the necessary information for deleting various resources created as part of this post.

1. Deregister the AMIs created and shared.

a. Log in to Account A and follow the steps at AWS documentation: Deregister your Linux AMI.

2. Delete the SAM stack with the following command. Replace <region> with the Region of choice.

sam delete –stack-name ec2-image-builder-approver-notifications –no-prompts –region <region>

3. Delete the CloudWatch log groups for the Lambda functions. You’ll identify it with the name `/aws/lambda/ec2-image-builder-approve*`.

4. Consider deleting the Amazon S3 bucket used to store the packaged Lambda artifact.

Conclusion

In this post, we explained how to use Serverless resources to enable approval notifications for an Image Builder pipeline before AMIs are shared to other accounts. This solution can be extended to share to more than one AWS account or even to an AWS organization. With this solution, you will be notified when new golden images are created, allowing you to verify the correctness of their configuration before sharing them to for wider use. This reduces the possibility of sharing AMIs with misconfigurations that the written tests may not have identified.

We invite you to experiment with different AMIs created using Image Builder, and with different Image Builder components. Check out this GitHub repository for various examples that use Image Builder. Also check out this blog on Image builder integrations with EC2 Auto Scaling Instance Refresh. Let us know your questions and findings in the comments, and have fun!

Capturing GPU Telemetry on the Amazon EC2 Accelerated Computing Instances

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/capturing-gpu-telemetry-on-the-amazon-ec2-accelerated-computing-instances/

This post is written by Amr Ragab, Principal Solutions Architect EC2.

AWS is excited to announce the native integration of monitoring GPU metrics through the CloudWatch Agent. Customers can now easily monitor GPU utilization and its memory to scale their workloads more effectively without custom scripts. In this post, we’ll describe how to allow GPU monitoring and integrate it into your Amazon Machine Images (AMI). Furthermore, we’ll extend this to include the monitoring of GPU hardware events utilizing CloudWatch Log Streams. By combining this telemetry into the Amazon CloudWatch Console, customers can have a complete picture of GPU activity across their fleets.

Capturing GPU metrics

There is an extensive list of NVIDIA accelerator metrics that can be captured. Depending on the workload type, it may be unnecessary to capture all of the metrics at all times. The following table breaks down the suggested metrics to collect by workload type. This considers a balance of cost and impactful metrics at scale.

Compute (Machine Learning (ML), High Performance Computing (HPC)) Graphics/Gaming
utilization_gpu
power_draw
utilization_memory
memory_total
memory_used
memory_free
pcie_link_gen_current
pcie_link_width_current
clocks_current_smclocks_current_memory
utilization_gpu
utilization_memory
memory_total
memory_usedmemory_free
pcie_link_gen_current
pcie_link_width_current
encoder_stats_session_count
encoder_stats_average_fps
encoder_stats_average_latency
clocks_current_graphics
clocks_current_memory
clocks_current_video

Moreover, this is supported through custom AMIs that are deployed with managed service offerings, including Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Services (Amazon ECS), and AWS ParallelCluster w/ SLURM for HPC clusters.

The following is an example screenshot from the CloudWatch Console showcasing the telemetry captured for a P4d instance. You can see that we captured the preceding metrics on a per-GPU index. Each Amazon Elastic Compute Cloud (Amazon EC2) P4d instance has 8x A100 GPUs.

Cloudwatch Console

Capturing GPU Xid events

Xid events are a reporting mechanism from GPU hardware vendors that emit notable events from the device to the OS in this case we are capturing the events through the NVRM kernel module. Current GPU architecture requires that the full GPU with protections are passed into the running instance. Thus, most errors that manifest inside of the customer instance aren’t directly visible to the Amazon EC2 virtualization stack. Although some of these errors are benign, others indicate problems with the customer application, the NVIDIA driver, and under rare circumstances a defect in the GPU hardware.

For NVIDIA-based Amazon EC2 instances, these errors will be logged in the system journal with an “NVRM:” regular expression.

These events can be collected and pushed to Amazon CloudWatch Logs as a stream. When an Xid event occurs on the GPU, it will parse the event and push it the log stream for that instance ID in the Region in which the instance is running. The following steps are required to get started capturing those events.

Deployment

We’ll cover the deployment in two different use-cases: 1. You have an existing instance running and you want to start to capture metrics and XID events. 2. You want to build and an AMI and use it within Amazon EC2 or additional services.

I. On a running Amazon EC2 instance

Step 1. Attach an IAM Role to the EC2 instance that has permission to CloudWatch Metrics/Logs. The following is an IAM policy that you can attach to your IAM Role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricStream",
                "logs:CreateLogDelivery",
                "logs:CreateLogStream",
                "cloudwatch:PutMetricData",
                "logs:UpdateLogDelivery",
                "logs:CreateLogGroup",
                "logs:PutLogEvents",
                "cloudwatch:ListMetrics"
            ],
            "Resource": "*"
        }
    ]
}

Step 2. Connect to a shell on the EC2 instance (through SSM or SSH). Install the CloudWatch Agent following the instructions here. There is support across architectures and distributions.

Step 3. Next, we can create our CloudWatch Agent JSON configuration file. The following JSON snippet will capture the logs from gpuerrors.log and push to CloudWatch Logs. Save the contents of the following JSON snippet to a file on the instance at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json.

{
     "agent": {
         "run_as_user": "root"
     },
     "metrics": {
         "append_dimensions": {
             "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
             "ImageId": "${aws:ImageId}",
             "InstanceId": "${aws:InstanceId}",
             "InstanceType": "${aws:InstanceType}"
         },
         "aggregation_dimensions": [["InstanceId"]],
         "metrics_collected": {
            "nvidia_gpu": {
                "measurement": [
                    "utilization_gpu",
                    "utilization_memory",
                    "memory_total",
                    "memory_used",
                    "memory_free",
                    "clocks_current_graphics",
                    "clocks_current_sm",
                    "clocks_current_memory"
                ]
            }
         }
     },
     "logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/gpuevent.log",
                         "log_group_name": "/ec2/accelerated/accel-event-log",
                         "log_stream_name": "{instance_id}"
                     }
                 ]
             }
         }
     }
 }

Step 4. To start capturing the logs, restart the aws cloudwatch systemd service.

sudo systemctl restart amazon-cloudwatch-agent.service

At this point, if you navigate to the CloudWatch Console in the Region that the instance is running, – All metrics – CWAgent, you should see a table of metrics similar to the following screenshot.

Cloudwatch Agent Metrics

Step 5. To capture the XID events it’s possible to use the same CloudWatch Log directive used in the preceding image were set the GPU metrics to capture. The JSON following snippet defines that we will stream the log in /var/log/gpuevent.log to CloudWatch.

"logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/gpuevent.log",
                         "log_group_name": "/ec2/accelerated/accel-event-log",
                         "log_stream_name": "{instance_id}"
                     }
                 ]
             }
         }
     }

The GitHub project is an open source reference design for capturing these errors in the CloudWatch agent.

https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline

Step 6. Save the following file as /opt/aws/aws-hwaccel-event-parser.py|.go with the following contents, which will write the Xid errors parsed to /var/log/gpuevent.log:

The code is available in either Python3 or Go (> 1.16).

Golang code of the hwaccel-event-parser: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/cloudwatch/nvidia/aws-hwaccel-event-parser.go

Python3 code: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/cloudwatch/nvidia/aws-hwaccel-event-parser.py

As you can see from the code, this is a blocking thread, and it will be running during the lifetime of the instance or container.

Step 7. For ease of deployment, you can also create a systemd service (aws-hw-monitor.service), which will run at startup before the CloudWatch agent.

[Unit]
Description=HW Error Monitor
Before=amazon-cloudwatch-agent.service
After=syslog.target network-online.target

[Service]
Type=simple
ExecStart=/opt/aws/cloudwatch/aws-cloudwatch-wrapper.sh
RemainAfterExit=1
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

Where /opt/aws/cloudwatch/aws-cloudwatch-wrapper.sh is a script which contains:

#!/bin/bash
python3 /opt/aws/aws-hwaccel-event-parser.py &

Finally, enable and start the hw monitor service

sudo systemctl enable aws-hw-monitor.service –now

II. Building an AMI

For convenience, the following repo has what is needed to build the AMI for Amazon EC2, Amazon EKS, Amazon ECS, Amazon Linux 2, and Ubuntu 18.04/20.04 distributions. You must have Packer installed on your machine, and it must be authenticated to make API calls on your behalf to AWS. Generally you need to modify the variables:{} json and execute the packer build.

"variables": {
    "region": "us-east-1",
    "flag": "<flag>",
    "subnet_id": "<subnetid>",
    "security_groupids": "<security_group_id,security_group_id",
    "build_ami": "<buildami>",
    "efa_pkg": "aws-efa-installer-latest.tar.gz",
    "intel_mkl_version": "intel-mkl-2020.0-088",
    "nvidia_version": "510.47.03",
    "cuda_version": "cuda-toolkit-11-6 nvidia-gds-11-6",
    "cudnn_version": "libcudnn8",
    "nccl_version": "v2.12.7-1"
  },

After filling in the variables, check that the packer script is validated.

packer validate nvidia-efa-ml-al2.yml
packer build nvidia-efa-ml-al2.yml

The log group namespace is /ec2/accelerated/accel-event-log. However, you may change this to the namespace of your preference in the CloudWatch Agent config file created earlier.

Navigate to the CloudWatch Console – Logs – Log groups – /ec2/accelerated/accel-event-log. It’s sorted by instance ID, where the instance ID of the latest stream is on top.

CloudWatch Log-events

We can see in the preceding screenshot that an example workload ran on instance i-03a7b66de3198977e, which was a p4d.24xlarge triggered a Xid 63 event. Capturing these events is the first step. Next, we must interpret what these events mean. With each Xid error, there is a number associated with each event. As previously mentioned, these can be hardware errors, driver, and/or application errors. If you’re running on an Amazon EC2 accelerated instance, and after code execution run into one of these errors, contact AWS Support with the instance ID and Xid error. The following is a list of the more common Xid errors that you may encounter.

Xid Error Name Description Action
48 Double Bit ECC error Hardware memory error Contact AWS Support with Xid error and instance ID
74 GPU NVLink error Further SXid errors should also be populated which will inform on the error seen with the NVLink fabric Get information on which links are causing the issue by running nvidia-smi nvlink -e
63 GPU Row Remapping Event Specific to Ampere architecture –- a row bank is pending a memory remap Stop all CUDA processes, and reset the GPU (nvidia-smi -r), and make sure thatensure the remap is cleared in nvidia-smi -q
13 Graphics Engine Exception User application fault , illegal instruction or register Rerun the application with CUDA_LAUNCH_BLOCKING=1 enabled which should determine if it’s a NVIDIA driver or hardware issue
31 GPU memory page fault Illegal memory address access error Rerun the application with CUDA_LAUNCH_BLOCKING=1 enabled which should determine if it’s a NVIDIA driver or hardware issue

A quick way to check for row remapping failures is to run the below command on the instance.

nvidia-smi --query-remapped-
rows=gpu_name,gpu_bus_id,remapped_rows.failure,remapped_rows.pending,remapped_rows.correctable,remapped_rows.uncorrectable --format=csv
gpu_name, gpu_bus_id, remapped_rows.failure, remapped_rows.pending, remapped_rows.correctable, remapped_rows.uncorrectable
NVIDIA A100-SXM4-40GB, 00000000:10:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:10:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:20:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:20:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:90:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:90:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:A0:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:A0:1D.0, 0, 0, 0, 0

This isn’t an exhaustive list of Xid events, but it provides some of the more common ones that you may come across as you develop your accelerated workload. You can find a more complete table of events here. Furthermore, if you have questions, you can reach out to AWS Support with the output of the tar ball created by executing the nvidia-bug-report.sh script included with the NVIDIA driver.

Conclusion

Get started with integrating this monitoring into your AMIs if you use custom AMIs specifically for key services, such as Amazon EKS, Amazon ECS, or Amazon EC2 with AWS ParallelCluster. This will help you discover utilization metrics for your accelerated computing workloads. If you have any questions about this post, then reach out to your account team.

Combining Amazon AppFlow with AWS Step Functions to maximize application integration benefits

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/combining-amazon-appflow-with-aws-step-functions-to-maximize-application-integration-benefits/

This post is written by Ahmad Aboushady, Senior Technical Account Manager and Kamen Sharlandjiev, Senior Specialist Solution Architect, Integration.

In this blog post, you learn how to orchestrate AWS service integrations to reduce the manual steps in your workflow. The example uses AWS Step Functions SDK integration to integrate Amazon AppFlow and AWS Glue catalog without writing custom code. It automatically uses Amazon EventBridge to trigger Step Functions every time a new Amazon AppFlow flow finishes running.

Amazon AppFlow enables customers to transfer data securely between software as a service (SaaS) applications, like Salesforce, SAP, Zendesk, Slack, ServiceNow, and multiple AWS services.

An everyday use case of Amazon AppFlow is creating a customer-360 by integrating marketing, customer support, and sales data. For example, analyze the revenue impact of different marketing channels by synchronizing the revenue data from Salesforce with marketing data from Adobe Marketo.

This involves setting up flows to ingest data from different data sources and SaaS applications to AWS Data Lake based on Amazon S3. It uses AWS Glue to crawl and catalog this data. Customers use this catalog to access data quickly in several ways.

For example, they query the data using Amazon Athena or Amazon QuickSight for visualizations, business intelligence and anomaly detection. You can create those data flows quickly with no code required. However, to complete the next set of requirements, customers often go through multiple manual steps of provisioning and configuring different AWS resources. One such step requires creating AWS Glue crawler and running it with every Amazon AppFlow flow execution.

Step Functions can help us automate this process. This is a low-code workflow orchestration service that offers a visual workflow designer. You can quickly build workflows using the built-in drag-and-drop interface available in the AWS Management Console.

You can follow this blog and build your end-to-end state machine using the Step Functions Workflow Studio, or use the AWS Serverless Application Model (AWS SAM) template to deploy the example. The Step Functions state machine uses SDK integration with other AWS Services, so you don’t need to write any custom integration code.

Overview

The following diagram depicts the workflow with the different states in the state machine. You can group these into three phases: preparation, processing, and configuration.

  • The preparation phase captures all the configuration parameters and collects information about the metadata of the data, ingested by Amazon AppFlow.
  • The processing phase generates the AWS Glue table definition and sets the required parameters based on the destination file type. It iterates through the different columns and adds them as part of the table definition.
  • The last phase provides the Glue Catalog resources by creating or updating an existing AWS Glue table. With each Amazon AppFlow flow execution, the state machine determines if a new Glue table partition is required.

Workflow architecture

Preparation phase

The first state, “SetDatabaseAndContext”, is a pass state where you set the configuration parameters used in later states. Set the AWS Glue database and table name and capture the details of the data flow. You can do this by using the parameters filter to build a new JSON payload using parts of the state input similar to:

"Parameters": {
        "Config": {
          "Database": "<Glue-Database-Name>",
          "TableName.$": "$.detail['flow-name']",
          "detail.$": "$.detail"
        }
}

The following state, “DatabaseExist?” is an AWS SDK integration using a “GetDatabase” call to AWS Glue to ensure that the database exists. Here, the state uses error handling to intercept exception messages from the SDK call. This feature splits the workflow and adds an extra step if needed.

In this case, the SDK call returns an exception if the database does not exist, and the workflow invokes the “CreateDatabase” state. It moves to the “CleanUpError” state to clean up any errors and set the configuration parameters accordingly. Afterwards, with the database in place, the workflow continues to the next state: “DescribeFlow”. This returns the metadata of the Amazon AppFlow flow. Part of this metadata is the list of the object fields, which you must create in the Glue table and partitions.

Here is an error handling state that catches exceptions and routes the flow to execute an extra step:

"Catch": [
  {
    "ErrorEquals": [
      "States.ALL"
    ],
    "Comment": "Create Glue Database",
    "Next": "CreateDatabase",
    "ResultPath": "$.error"
  }
]

In the next state, “DescribeFlow”, you use the AWS SDK integration to get the Amazon AppFlow flow configuration. This uses the Amazon AppFlow “DescribeFlow API call. It moves to “S3AsDestination?”, which is a choice state to check if S3 is a destination for the flow. Amazon AppFlow allows you to bring data into different purpose-built data stores, such as S3, Amazon Redshift, or external SaaS or data warehouse applications. This automation can only continue if the configured destination is S3.

The choice definition is:

"Choices": [
  {
    "Variable": "$.FlowConfig.DestinationFlowConfigList[0].ConnectorType",
    "StringEquals": "S3",
    "Next": "GenerateTableDefinition"
  }
],
"Default": "S3NotDestination"

Processing phase

The following state generates the base AWS Glue table definition based on the destination file type. Then it uses a map state to iterate and transform the Amazon AppFlow schema output into what the AWS Glue Data Catalog expects as input.

Next, add the “GenerateTableDefinition” state and use the parameters filter to build a new JSON payload output. Finally, use the information from the “DescribeFlow” state similar to:

"Parameters": {
  "Config.$": "$.Config",
  "FlowConfig.$": "$.FlowConfig",
  "TableInput": {
    "Description": "Created by AmazonAppFlow",
    "Name.$": "$.Config.TableName",
    "PartitionKeys": [
      {
        "Name": "partition_0",
        "Type": "string"
      }
    ],
    "Retention": 0,
    "Parameters": {
      "compressionType": "none",
      "classification.$": "$.FlowConfig.DestinationFlowConfigList[0].DestinationConnectorProperties['S3'].S3OutputFormatConfig.FileType",
      "typeOfData": "file"
    },
    "StorageDescriptor": {
      "BucketColumns": [],
      "Columns.$": "$.FlowConfig.Tasks[?(@.TaskType == 'Map')]",
      "Compressed": false,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location.$": "States.Format('{}/{}/', $.Config.detail['destination-object'], $.FlowConfig.FlowName)",
      "NumberOfBuckets": -1,
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "SortColumns": [],
      "StoredAsSubDirectories": false
    },
    "TableType": "EXTERNAL_TABLE"
  }
}

The following state, “DestinationFileFormatEvaluator”, is a choice state to change the JSON payload according to the destination file type. Amazon AppFlow supports different file type conversions when S3 is the destination of your data. These formats are CSV, Parquet, and JSON Lines. AWS Glue uses various serialization libraries according to the file type.

You iterate within a map state to transform the AWS Glue table schema and set the column type to a known AWS Glue format. If the file type is unrecognized or does not have an equivalent in Glue, default this field to string. The map state configuration is defined as:

"Iterator": {
        "StartAt": "KnownFIleFormat?",
        "States": {
          "KnownFIleFormat?": {
            "Type": "Choice",
            "Choices": [
              {
                "Or": [
                  {
                    "Variable": "$.TaskProperties.SOURCE_DATA_TYPE",
                    "StringEquals": "boolean"
                  },
                  {
                    "Variable": "$.TaskProperties.SOURCE_DATA_TYPE",
                    "StringEquals": "double"
                  },
                  .
                  .
                  .
                  .
                  {
                    "Variable": "$.TaskProperties.SOURCE_DATA_TYPE",
                    "StringEquals": "timestamp"
                  }
                ],
                "Next": "1:1 mapping"
              }
            ],
            "Default": "Cast to String"
          },
          "1:1 mapping": {
            "Type": "Pass",
            "End": true,
            "Parameters": {
              "Name.$": "$.DestinationField",
              "Type.$": "$.TaskProperties.SOURCE_DATA_TYPE"
            }
          },
          "Cast to String": {
            "Type": "Pass",
            "End": true,
            "Parameters": {
              "Name.$": "$.DestinationField",
              "Type": "string"
            }
          }
        }
      },
"ItemsPath": "$.TableInput.StorageDescriptor.Columns",
"ResultPath": "$.TableInput.StorageDescriptor.Columns",

Configuration phase

The next stage in the workflow is “TableExist?”, which checks if the Glue table exists. If the state machine detects any error because the table does not exist, it moves to the “CreateTable” state. Alternatively, it goes to the “UpdateTable” state.

Both states use the AWS SDK integration to create or update the AWS Glue table definition using the “TableInput” parameter. AWS Glue operates with partitions. Every time you have new data stored in a new S3 prefix, you must update the table and add a new partition showing where the data sits.

You need an extra step to check if Amazon AppFlow has stored the data into a new S3 prefix or an existing one. In the “AddPartition?” State, you must review and determine the next step of your workflow. For example, you must validate that the flow executed successfully and processed data.

A choice state helps with those checks:

"And": [
            {
              "Variable": "$.Config.detail['execution-id']",
              "IsPresent": true
            },
            {
              "Variable": "$.Config.detail['status']",
              "StringEquals": "Execution Successful"
            },
            {
              "Not": {
                "Variable": "$.Config.detail['num-of-records-processed']",
                "StringEquals": "0"
              }
            }
          ]

Amazon AppFlow supports different types of flow execution. With scheduled flows, you can regularly configure Amazon AppFlow to hydrate a data lake by bringing only new data since its last execution. Sometimes, after a successful flow execution, there is no new data to ingest. The workflow concludes and moves to the success state in such cases. However, if there is new data, the state machine continues to the next state, “SingleFileAggregation?”.

Amazon AppFlow supports different file aggregation strategies and allows you to aggregate all ingested records into a single or multiple files. Depending on your flow configuration, it may store your data in a different S3 prefix with each flow execution.

In this state, you check this configuration to decide if you need a new partition for your AWS Glue table.

"Variable": "$.FlowConfig.DestinationFlowConfigList[0].DestinationConnectorProperties.S3.S3OutputFormatConfig.AggregationConfig.AggregationType",
"StringEquals": "SingleFile"

If the data flow aggregates all records into a single file per flow execution, it stores all data into a single S3 prefix. In this case, there is a single partition in your AWS Glue table. You must create that single partition the first time this state machine executes for a specific flow.

Use the AWS SDK integration to get the table partition from the AWS Glue in the “IsPartitionExist?” state. Conclude the workflow and move to the “Success” state if the partition exists. Otherwise, create that single partition in another state, “CreateMainPartition”.

If the flow run does not aggregate files, every flow run generates multiple files into a new S3 prefix. In this case, you add a new partition to the AWS Glue table. A pass state, “ConfigureDestination”, configures the required parameters for the partition creation:

"Parameters": {
        "InputFormat.$": "$.TableInput.StorageDescriptor.InputFormat",
        "OutputFormat.$": "$.TableInput.StorageDescriptor.OutputFormat",
        "Columns.$": "$.TableInput.StorageDescriptor.Columns",
        "Compressed.$": "$.TableInput.StorageDescriptor.Compressed",
        "SerdeInfo.$": "$.TableInput.StorageDescriptor.SerdeInfo",
        "Location.$": "States.Format('{}{}', $.TableInput.StorageDescriptor.Location, $.Config.detail['execution-id'])"
      },
 "ResultPath": "$.TableInput.StorageDescriptor"

Next, move to the “CreateNewPartition” state to use the AWS SDK integration to create a new partition to the Glue table similar to:

"Parameters": {
        "DatabaseName.$": "$.Config.Database",
        "TableName.$": "$.Config.TableName",
        "PartitionInput": {
          "Values.$": "States.Array($.Config.detail['execution-id'])",
          "StorageDescriptor.$": "$.TableInput.StorageDescriptor"
        }
      },
"Resource": "arn:aws:states:::aws-sdk:glue:createPartition"

This concludes the workflow with a “Succeed” state after configuring the AWS Glue table in response to the new Amazon AppFlow flow run.

Conclusion

This blog post explores how to integrate Amazon AppFlow and AWS Glue using Step Functions to automate your business requirements. You can use AWS Lambda to simplify the configuration phase and reduce state transitions or create complex checks, filters, or even data cleansing and preparation.

This approach allows you to tailor the schema conversion to your business requirements. Use this AWS SAM template, to deploy this example. This provides the Step Functions workflow described in this post and the EventBridge rule to trigger the state machine after each Amazon AppFlow flow run. The template also includes all required IAM roles and permissions.

For more serverless learning resources, visit Serverless Land.

Introducing a new AWS whitepaper: Does data localization cause more problems than it solves?

Post Syndicated from Jana Kay original https://aws.amazon.com/blogs/security/introducing-a-new-aws-whitepaper-does-data-localization-cause-more-problems-than-it-solves/

Amazon Web Services (AWS) recently released a new whitepaper, Does data localization cause more problems than it solves?, as part of the AWS Innovating Securely briefing series. The whitepaper draws on research from Emily Wu’s paper Sovereignty and Data Localization, published by Harvard University’s Belfer Center, and describes how countries can realize similar data localization objectives through AWS services without incurring the unintended effects highlighted by Wu.

Wu’s research analyzes the intent of data localization policies, and compares that to the reality of the policies’ effects, concluding that data localization policies are often counterproductive to their intended goals of data security, economic competitiveness, and protecting national values.

The new whitepaper explains how you can use the security capabilities of AWS to take advantage of up-to-date technology and help meet your data localization requirements while maintaining full control over the physical location of where your data is stored.

AWS offers robust privacy and security services and features that let you implement your own controls. AWS uses lessons learned around the globe and applies them at the local level for improved cybersecurity against security events. As an AWS customer, after you pick a geographic location to store your data, the cloud infrastructure provides you greater resiliency and availability than you can achieve by using on-prem infrastructure. When you choose an AWS Region, you maintain full control to determine the physical location of where your data is stored. AWS also provides you with resources through the AWS compliance program, to help you understand the robust controls in place at AWS to maintain security and compliance in the cloud.

An important finding of Wu’s research is that localization constraints can deter innovation and hurt local economies because they limit which services are available, or increase costs because there are a smaller number of service providers to choose from. Wu concludes that data localization can “raise the barriers [to entrepreneurs] for market entry, which suppresses entrepreneurial activity and reduces the ability for an economy to compete globally.” Data localization policies are especially challenging for companies that trade across national borders. International trade used to be the remit of only big corporations. Current data-driven efficiencies in shipping and logistics mean that international trade is open to companies of all sizes. There has been particular growth for small and medium enterprises involved in services trade (of which cross-border data flows are a key element). In a 2016 worldwide survey conducted by McKinsey, 86 percent of tech-based startups had at least one cross-border activity. The same report showed that cross-border data flows added some US$2.8 trillion to world GDP in 2014.

However, the availability of cloud services supports secure and efficient cross-border data flows, which in turn can contribute to national economic competitiveness. Deloitte Consulting’s report, The cloud imperative: Asia Pacific’s unmissable opportunity, estimates that by 2024, the cloud will contribute $260 billion to GDP across eight regional markets, with more benefit possible in the future. The World Trade Organization’s World Trade Report 2018 estimates that digital technologies, which includes advanced cloud services, will account for a 34 percent increase in global trade by 2030.

Wu also cites a link between national data governance policies and a government’s concerns that movement of data outside national borders can diminish their control. However, the technology, storage capacity, and compute power provided by hyperscale cloud service providers like AWS, can empower local entrepreneurs.

AWS continually updates practices to meet the evolving needs and expectations of both customers and regulators. This allows AWS customers to use effective tools for processing data, which can help them meet stringent local standards to protect national values and citizens’ rights.

Wu’s research concludes that “data localization is proving ineffective” for meeting intended national goals, and offers practical alternatives for policymakers to consider. Wu has several recommendations, such as continuing to invest in cybersecurity, supporting industry-led initiatives to develop shared standards and protocols, and promoting international cooperation around privacy and innovation. Despite the continued existence of data localization policies, countries can currently realize similar objectives through cloud services. AWS implements rigorous contractual, technical, and organizational measures to protect the confidentiality, integrity, and availability of customer data, regardless of which AWS Region you select to store their data. As an AWS customer, this means you can take advantage of the economic benefits and the support for innovation provided by cloud computing, while improving your ability to meet your core security and compliance requirements.

For more information, see the whitepaper Does data localization cause more problems than it solves?, or contact AWS.

If you have feedback about this post, submit comments in the Comments section below.

Author

Jana Kay

Since 2018, Jana Kay has been a cloud security strategist with the AWS Security Growth Strategies team. She develops innovative ways to help AWS customers achieve their objectives, such as security table top exercises and other strategic initiatives. Previously, she was a cyber, counter-terrorism, and Middle East expert for 16 years in the Pentagon’s Office of the Secretary of Defense.

Arturo Cabanas

Arturo Cabanas

Arturo joined Amazon in 2017 and is AWS Security Assurance Principal for the Public Sector in Latin America, Canada, and the Caribbean. In this role, Arturo creates programs that help governments move their workloads and regulated data to the cloud by meeting their specific security, data privacy regulation, and compliance requirements.

Simplify and optimize Python package management for AWS Glue PySpark jobs with AWS CodeArtifact

Post Syndicated from Ashok Padmanabhan original https://aws.amazon.com/blogs/big-data/simplify-and-optimize-python-package-management-for-aws-glue-pyspark-jobs-with-aws-codeartifact/

Data engineers use various Python packages to meet their data processing requirements while building data pipelines with AWS Glue PySpark Jobs. Languages like Python and Scala are commonly used in data pipeline development. Developers can take advantage of their open-source packages or even customize their own to make it easier and faster to perform use cases, such as data manipulation and analysis. However, managing standardized packages can be cumbersome with multiple teams using different versions of packages, installing non-approved packages, and causing duplicate development effort due to the lack of visibility of what is available at the enterprise level. This can be especially challenging in large enterprises with multiple data engineering teams.

ETL Developers have requirements to use additional packages for their AWS Glue ETL jobs. With security being job zero for customers, many will restrict egress traffic from their VPC to the public internet, and they need a way to manage the packages used by applications including their data processing pipelines.

Our proposed solution will enable you with network egress restrictions to manage packages centrally with AWS CodeArtifact and use their favorite libraries in their AWS Glue ETL PySpark code. In this post, we’ll describe how CodeArtifact can be used for managing packages and modules for AWS Glue ETL jobs, and we’ll demo a solution using Glue PySpark jobs that run within VPC Subnets that have no internet access.

Solution overview

The solution uses CodeArtifact as a tool to make it easier for organizations of any size to securely store, publish, and share software packages used in their ETL with AWS Glue. VPC Endpoints will be enabled for CodeArtifact and Glue to enable private link connections. AWS Step Functions makes it easy to coordinate the orchestration of components used in the data processing pipeline. Native integrations with both CodeArtifact and AWS Glue enable the workflow to both authenticate the request to CodeArtifact and start the AWS Glue ETL job.

The following architecture shows an implementation of a solution using AWS Glue, CodeArtifact, and Step Functions to use additional Python modules without egress internet access. The solution is deployed using AWS Cloud Development Kit (AWS CDK), an open-source software development framework to define your cloud application resources using familiar programming languages.

Solution Architecture for the blog post

Fig 1: Architecture Diagram for the Solution

To illustrate how to set up this architecture, we’ll walk you through the following steps:

  1. Deploying an AWS CDK stack to provision the following AWS Resources
    1. CodeArtifact
    2. An AWS Glue job
    3. Step Functions workflow
    4. Amazon Simple Storage Service (Amazon S3) bucket
    5. A VPC with a private Subnet and VPC Endpoints to Amazon S3 and CodeArtifact
  2. Validate the Deployment.
  3. Run a Sample Workflow – This workflow will run an AWS Glue PySpark job that uses a custom Python library, and an upgraded version of boto3.
  4. Cleaning up your resources.

Prerequisites

Make sure that you complete the following steps as prerequisites:

The solution

Launching your AWS CDK Stack

Step 1: Using your device’s command line, check out our Git repository to a local directory on your device:

git clone https://github.com/aws-samples/python-lib-management-without-internet-for-aws-glue-in-private-subnets.git

Step 2: Change directories to the new directory Amazon S3 script location:

cd python-lib-management-without-internet-for-aws-glue-in-private-subnets/scripts/s3

Step 3: Download the following CSV, which contains New York City Taxi and Limousine Commission (TLC) Trip weekly trips. This will serve as the input source for the AWS Glue Job:

aws s3 cp s3://nyc-tlc/misc/FOIL_weekly_trips_apps.csv .

Step 4: Change the directories to the path where the app.py file is located (in reference to the previous step, execute the following step):

cd ../..

Step 5: Create a virtual environment:

macOS/Linux:
python3 -m venv .env

Windows:
python -m venv .env

Step 6: Activate the virtual environment after the init process completes and the virtual environment is created:

macOS/Linux:
source .env/bin/activate

Windows:
.env\Scripts\activate.bat

Step 7: Install the required dependencies:

pip3 install -r requirements.txt

Step 8: Make sure that your AWS profile is setup along with the region that you want to deploy as mentioned in the prerequisite. Synthesize the templates. AWS CDK apps use code to define the infrastructure, and when run they produce or “synthesize” a CloudFormation template for each stack defined in the application:

cdk synthesize

Step 9: BootStrap the cdk app using the following command:

cdk bootstrap aws://<AWS_ACCOUNTID>/<AWS_REGION>

Replace the place holder AWS_ACCOUNTID and AWS_REGION with your AWS account ID and the region to be deployed.

This step provisions the initial resources, including an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments.

Step 10: Deploy the solution. By default, some actions that could potentially make security changes require approval. In this deployment, you’re creating an IAM role. The following command overrides the approval prompts, but if you would like to manually accept the prompts, then omit the --require-approval never flag:

cdk deploy "*" --require-approval never

While the AWS CDK deploys the CloudFormation stacks, you can follow the deployment progress in your terminal:

AWS CDK Deployment progress in terminal

Fig 2: AWS CDK Deployment progress in terminal

Once the deployment is successful, you’ll see the successful status as follows:

AWS CDK Deployment completion success

Fig 3: AWS CDK Deployment completion success

Step 11: Log in to the AWS Console, go to CloudFormation, and see the output of the ApplicationStack stack:

AWS CloudFormation stack output

Fig 4: AWS CloudFormation stack output

Note the values of the DomainName and RepositoryName variables. We’ll use them in the next step to upload our artifacts

Step 12: We will upload a custom library into the repo that we created. This will be used by our Glue ETL job.

  • Install twine using pip:
python3 -m pip install twine

The custom python package glueutils-0.2.0.tar.gz can be found under this folder of the cloned repo:

cd scripts/custom_glue_library
  • Configure twine with the login command (additional details here ). Refer to step 11 for the DomainName and RepositoryName from the CloudFormation output:
aws codeartifact login --tool twine --domain <DomainName> --domain-owner <AWS_ACCOUNTID> --repository <RepositoryName>
  • Publish Python package assets:
twine upload --repository codeartifact glueutils-0.2.0.tar.gz
Python package publishing using twine

Fig 5: Python package publishing using twine

Validate the Deployment

The AWS CDK stack will deploy the following AWS resources:

  1. Amazon Virtual Private Cloud (Amazon VPC)
    1. One Private Subnet
  2. AWS CodeArtifact
    1. CodeArtifact Repository
    2. CodeArtifact Domain
    3. CodeArtifact Upstream Repository
  3. AWS Glue
    1. AWS Glue Job
    2. AWS Glue Database
    3. AWS Glue Connection
  4. AWS Step Function
  5. Amazon S3 Bucket for AWS CDK and also for storing scripts and CSV file
  6. IAM Roles and Policies
  7. Amazon Elastic Compute Cloud (Amazon EC2) Security Group

Step 1: Browse to the AWS account and region via the AWS Console to which the resources are deployed.

Step 2: Browse the Subnet page (https://<region> .console.aws.amazon.com/vpc/home?region=<region> #subnets:) (*Replace region with actual AWS Region to which your resources are deployed)

Step 3: Select the Subnet with name as ApplicationStack/enterprise-repo-vpc/Enterprise-Repo-Private-Subnet1

Step 4: Select the Route Table and validate that there are no Internet Gateway or NAT Gateway for routes to Internet, and that it’s similar to the following image:

Route table validation

Fig 6: Route table validation

Step 5: Navigate to the CodeArtifact console and review the repositories created. The enterprise-repo is your local repository, and pypi-store is the upstream repository connected to the PyPI, providing artifacts from pypi.org.

AWS CodeArifact repositories created

Fig 7: AWS CodeArifact repositories created

Step 6: Navigate to enterprise-repo and search for glueutils. This is the custom python package that we published.

AWS CodeArifact custom python package published

Fig 8: AWS CodeArifact custom python package published

Step 7: Navigate to Step Functions Console and review the enterprise-repo-step-function as follows:

AWS Step Functions workflow

Fig 9: AWS Step Functions workflow

The diagram shows how the Step Functions workflow will orchestrate the pattern.

  1. The first step CodeArtifactGetAuthorizationToken calls the getAuthorizationToken API to generate a temporary authorization token for accessing repositories in the domain (this token is valid for 15 mins.).
  2. The next step GenerateCodeArtifactURL takes the authorization token from the response and generates the CodeArtifact URL.
  3. Then, this will move into the GlueStartJobRun state, which makes a synchronous API call to run the AWS Glue job.

Step 8: Navigate to the AWS Glue Console and select the Jobs tab, then select enterprise-repo-glue-job.

The AWS Glue job is created with the following script and AWS Glue Connection enterprise-repo-glue-connection. The AWS Glue connection is a Data Catalog object that enables the job to connect to sources and APIs from within the VPC. The network type connection runs the job from within the private subnet to make requests to Amazon S3 and CodeArtifact over the VPC endpoint connection. This enables the job to run without any traffic through the internet.

Note the connections section in the AWS Glue PySpark Job, which makes the Glue job run on the private subnet in the VPC provisioned.

AWS Glue network connections

Fig 10: AWS Glue network connections

The job takes an Amazon S3 bucket, Glue Database, Python Job Installer Option, and Additional Python Modules as job parameters. The parameters --additional-python-modules and --python-modules-installer-option are passed to install the selected Python module from a PyPI repository hosted in AWS CodeArtifact.

The script itself first reads the Amazon S3 input path of the taxi data in the CSV format. A light transformation to sum the total trips by year, week, and app is performed. Then the output is written to an Amazon S3 path as parquet . A partitioned table in the AWS Glue Data Catalog will either be created or updated if it already exists .

You can find the Glue PySpark script here.

Run a sample workflow

The following steps will demonstrate how to run a sample workflow:

Step 1: Navigate to the Step Functions Console and select the enterprise-repo-step-function.

Step 2: Select Start execution and input the following: We’re including the glueutils and latest boto3 libraries as part of the job run. It is always recommended to pin your python dependencies to avoid any breaking change due to a future version of dependency . In the below example, the latest available version of boto3, and the 0.2.0 version of glueutils will be installed. To pin it to a specific release you may add  boto3==1.24.2   (Current latest release at the time of publishing this post).

{"pythonmodules": "boto3,glueutils==0.2.0"}

Step 3: Select Start execution and wait until Execution Status is Succeeded. This may take a few minutes.

Step 4: Navigate to the CodeArtifact Console to review the enterprise-repo repository. You’ll see the cached PyPi packages and all of their dependencies pulled down from PyPi.

Step 5: In the Glue Console under the Runs section of the enterprise-glue-job, you’ll see the parameters passed:

Fig 11 : AWS Glue job execution history

Fig 11 : AWS Glue job execution history

Note the --index-url which was passed as a parameter to the glue ETL job. The token is valid only for 15 minutes.

Step 6: Navigate to the Amazon CloudWatch Console and go to the /aws/glue-jobs log group to verify that the packages were installed from the local repo.

You will see that the 2 package names passed as parameters are installed with the corresponding versions.

Fig 12 : Amazon CloudWatch logs details for the Glue job

Fig 12 : Amazon CloudWatch logs details for the Glue job

Step 7: Navigate to the Amazon Athena console and select Query Editor.

Step 8: Run the following query to validate the output of the AWS Glue job:

SELECT year, app, SUM(total_trips) as sum_of_total_trips 
FROM 
"codeartifactblog_glue_db"."taxidataparquet" 
GROUP BY year, app;

Clean up

Make sure that you clean up all of the other AWS resources that you created in the AWS CDK Stack deployment. You can delete these resources via the AWS CDK Destroy command as follows or the CloudFormation console.

To destroy the resources using AWS CDK, follow these steps:

  1. Follow Steps 1-6 from the ‘Launching your CDK Stack’ section.
  2. Destroy the app by executing the following command:
    cdk destroy

Conclusion

In this post, we demonstrated how CodeArtifact can be used for managing Python packages and modules for AWS Glue jobs that run within VPC Subnets that have no internet access. We also demonstrated how the versions of existing packages can be updated (i.e., boto3) and a custom Python library (glueutils) that is developed locally is also managed through CodeArtifact.

This post enables you to use your favorite Python packages with AWS Glue ETL PySpark jobs by modifying the input to the AWS StepFunctions workflow (Step 2 in the Run a Sample workflow section).


About the Authors

Bret Pontillo is a Data & ML Engineer with AWS Professional Services. He works closely with enterprise customers building data lakes and analytical applications on the AWS platform. In his free time, Bret enjoys traveling, watching sports, and trying new restaurants.

Gaurav Gundal is a DevOps consultant with AWS Professional Services, helping customers build solutions on the customer platform. When not building, designing, or developing solutions, Gaurav spends time with his family, plays guitar, and enjoys traveling to different places.

Ashok Padmanabhan is a Sr. IOT Data Architect with AWS Professional Services, helping customers build data and analytics platform and solutions. When not helping customers build and design data lakes, Ashok enjoys spending time at the beach near his home in Florida.

Build a multilingual dashboard with Amazon Athena and Amazon QuickSight

Post Syndicated from Francesco Marelli original https://aws.amazon.com/blogs/big-data/build-a-multilingual-dashboard-with-amazon-athena-and-amazon-quicksight/

Amazon QuickSight is a serverless business intelligence (BI) service used by organizations of any size to make better data-driven decisions. QuickSight dashboards can also be embedded into SaaS apps and web portals to provide interactive dashboards, natural language query or data analysis capabilities to app users seamlessly. The QuickSight Demo Central contains many dashboards, feature showcase and tips and tricks that you can use; in the QuickSight Embedded Analytics Developer Portal you can find details on how to embed dashboards in your applications.

The QuickSight user interface currently supports 15 languages that you can choose on a per-user basis. The language selected for the user interface localizes all text generated by QuickSight with respect to UI components and isn’t applied to the data displayed in the dashboards.

This post describes how to create multilingual dashboards at the data level by creating new columns that contain the translated text and providing a language selection parameter and associated control to display data in the selected language in a QuickSight dashboard. You can create new columns with the translated text in several ways; in this post we create new columns using Amazon Athena user-defined functions implemented in the GitHub project sample Amazon Athena UDFs for text translation and analytics using Amazon Comprehend and Amazon Translate. This approach makes it easy to automatically create columns with translated text using neural machine translation provided by Amazon Translate.

Solution overview

The following diagram illustrates the architecture of this solution.

Architecture

For this post, we use the sample SaaS-Sales.csv dataset and follow these steps:

  1. Copy the dataset to a bucket in Amazon Simple Storage Service (Amazon S3).
  2. Use Athena to define a database and table to read the CSV file.
  3. Create a new table in Parquet format with the columns with the translated text.
  4. Create a new dataset in QuickSight.
  5. Create the parameter and control to select the language.
  6. Create dynamic multilingual calculated fields.
  7. Create an analysis with calculated multilingual calculated fields.
  8. Publish the multilingual dashboard.
  9. Create parametric headers and titles for visuals for use in an embedded dashboard.

An alternative approach might be to directly upload the CSV dataset to QuickSight and create the new columns with translated text as QuickSight calculated fields, for example using the ifelse() conditional function to directly assign the translated values.

Prerequisites

To follow the steps in this post, you need to have an AWS account with an active QuickSight Standard Edition or Enterprise Edition subscription.

Copy the dataset to a bucket in Amazon S3

Use the AWS Command Line Interface (AWS CLI) to create the S3 bucket qs-ml-blog-data and copy the dataset under the prefix saas-sales in your AWS account. You must follow the bucket naming rules to create your bucket. See the following code:

$ MY_BUCKET=qs-ml-blog-data
$ PREFIX=saas-sales

$ aws s3 mb s3://${MY_BUCKET}/

$ aws s3 cp \
    "s3://ee-assets-prod-us-east-1/modules/337d5d05acc64a6fa37bcba6b921071c/v1/SaaS-Sales.csv" \
    "s3://${MY_BUCKET}/${PREFIX}/SaaS-Sales.csv" 

Define a database and table to read the CSV file

Use the Athena query editor to create the database qs_ml_blog_db:

CREATE DATABASE IF NOT EXISTS qs_ml_blog_db;

Then create the new table qs_ml_blog_db.saas_sales:

CREATE EXTERNAL TABLE IF NOT EXISTS qs_ml_blog_db.saas_sales (
  row_id bigint, 
  order_id string, 
  order_date string, 
  date_key bigint, 
  contact_name string, 
  country_en string, 
  city_en string, 
  region string, 
  subregion string, 
  customer string, 
  customer_id bigint, 
  industry_en string, 
  segment string, 
  product string, 
  license string, 
  sales double, 
  quantity bigint, 
  discount double, 
  profit double)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<MY_BUCKET>/saas-sales/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'skip.header.line.count'='1', 
  'typeOfData'='file')

Create a new table in Parquet format with the columns with the translated text

We want to translate the columns country_en, city_en, and industry_en to German, Spanish, and Italian. To do this in a scalable and flexible way, we use the GitHub project sample Amazon Athena UDFs for text translation and analytics using Amazon Comprehend and Amazon Translate.

After you set up the user-defined functions following the instructions in the GitHub repo, run the following SQL query in Athena to create the new table qs_ml_blog_db.saas_sales_ml with the translated columns using the translate_text user-defined function and some other minor changes:

CREATE TABLE qs_ml_blog_db.saas_sales_ml WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    external_location = 's3://<MY_BUCKET>/saas-sales-ml/'
) AS 
USING EXTERNAL FUNCTION translate_text(text_col VARCHAR, sourcelang VARCHAR, targetlang VARCHAR, terminologyname VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf'
SELECT 
row_id,
order_id,
date_parse("order_date",'%m/%d/%Y') as order_date,
date_key,
contact_name,
country_en,
translate_text(country_en, 'en', 'de', NULL) as country_de,
translate_text(country_en, 'en', 'es', NULL) as country_es,
translate_text(country_en, 'en', 'it', NULL) as country_it,
city_en,
translate_text(city_en, 'en', 'de', NULL) as city_de,
translate_text(city_en, 'en', 'es', NULL) as city_es,
translate_text(city_en, 'en', 'it', NULL) as city_it,
region,
subregion,
customer,
customer_id,
industry_en,
translate_text(industry_en, 'en', 'de', NULL) as industry_de,
translate_text(industry_en, 'en', 'es', NULL) as industry_es,
translate_text(industry_en, 'en', 'it', NULL) as industry_it,
segment,
product,
license,
sales,
quantity,
discount,
profit
FROM qs_ml_blog_db.saas_sales
;

Run three simple queries, one per column, to check the generation of the new columns with the translation was successful. We include a screenshot after each query showing its results.

SELECT 
distinct(country_en),
country_de,
country_es,
country_it
FROM qs_ml_blog_db.saas_sales_ml 
ORDER BY country_en
limit 10
;

Original and translated values for column Country

SELECT 
distinct(city_en),
city_de,
city_es,
city_it
FROM qs_ml_blog_db.saas_sales_ml 
ORDER BY city_en
limit 10
;

Original and translated values for column City

SELECT 
distinct(industry_en),
industry_de,
industry_es,
industry_it
FROM qs_ml_blog_db.saas_sales_ml 
ORDER BY industry_en
limit 10
;

Original and translated values for column Industry

Now you can use the new table saas_sales_ml as input to create a dataset in QuickSight.

Create a dataset in QuickSight

To create your dataset in QuickSight, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose Create a dataset.
  3. Choose Athena.
  4. For Data source name¸ enter athena_primary.
  5. For Athena workgroup¸ choose primary.
  6. Choose Create data source.
    New Athena data source
  7. Select the saas_sales_ml table previously created and choose Select.
    Choose your table
  8. Choose to import the table to SPICE and choose Visualize to start creating the new dashboard.
    Finish dataset creation

In the analysis section, you receive a message that informs you that the table was successfully imported to SPICE.

SPICE import complete

Create the parameter and control to select the language

To create the parameter and associate the control that you use to select the language for the dashboard, complete the following steps:

  1. In the analysis section, choose Parameters and Create one.
  2. For Name, enter Language.
  3. For Data type, choose String.
  4. For Values, select Single value.
  5. For Static default value, enter English.
  6. Choose Create.
    Create new parameter
  7. To connect the parameter to a control, choose Control.
    Connect parameter to control
  8. For Display name, choose Language.
  9. For Style, choose Dropdown.
  10. For Values, select Specific values.
  11. For Define specific values, enter English, German, Italian, and French (one value per line).
  12. Select Hide Select all option from the control values if the parameter has a default configured.
  13. Choose Add.
    Define control properties

The control is now available, linked to the parameter and displayed in the Controls section of the current sheet in the analysis.

Language control preview

Create dynamic multilingual calculated fields

You’re now ready to create the calculated fields whose value will change based on the currently selected language.

  1. In the menu bar, choose Add and choose Add calculated field.
    Add calculated field
  2. Use the ifelse conditional function to evaluate the value of the Language parameter and select the correct column in the dataset to assign the value to the calculated field.
  3. Create the Country calculated field using the following expression:
    ifelse(
        ${Language} = 'English', {country_en},
        ${Language} = 'German', {country_de},
        ${Language} = 'Italian', {country_it},
        ${Language} = 'Spanish', {country_es},
        {country_en}
    )

  4. Choose Save.
    Calculated field definition in Amazon QuickSight
  5. Repeat the process for the City calculated field:
    ifelse(
        ${Language} = 'English', {city_en},
        ${Language} = 'German', {city_de},
        ${Language} = 'Italian', {city_it},
        ${Language} = 'Spanish', {city_es},
        {city_en}
    )

  6. Repeat the process for the Industry calculated field:
    ifelse(
        ${Language} = 'English', {industry_en},
        ${Language} = 'German', {industry_de},
        ${Language} = 'Italian', {industry_it},
        ${Language} = 'Spanish', {industry_es},
        {industry_en}
    )

The calculated fields are now available and ready to use in the analysis.

Calculated fields available in analysis

Create an analysis with calculated multilingual calculated fields

Create an analysis with two donut charts and a pivot table that use the three multilingual fields. In the subtitle of the visuals, use the string Language: <<$Language>> to display the currently selected language. The following screenshot shows our analysis.

Analysis with Language control - English

If you choose a new language from the Language control, the visuals adapt accordingly. The following screenshot shows the analysis in Italian.

Analysis with Language control - Italian

You’re now ready to publish the analysis as a dashboard.

Publish the multilingual dashboard

In the menu bar, choose Share and Publish dashboard.

Publish dashboard menu

Publish the new dashboard as “Multilingual dashboard,” leave the advanced publish options at their default values, and choose Publish dashboard.

Publish dashboard with name

The dashboard is now ready.

Published dashboard

We can take the multilingual features one step further by embedding the dashboard and controlling the parameters in the external page using the Amazon QuickSight Embedding SDK.

Create parametric headers and titles for visuals for use in an embedded dashboard

When embedding an QuickSight dashboard, the locale and parameters’ values can be set programmatically from JavaScript. This can be useful to set default values and change the settings for localization and the default data language. The following steps show how to use these features by modifying the dashboard we have created so far, embedding it in an HTML page, and using the Amazon QuickSight Embedding SDK to dynamically set the value of parameters used to display titles, legends, headers, and more in translated text. The full code for the HTML page is also provided in the appendix of this post.

Create new parameters for the titles and the headers of the visuals in the analysis, the sheet name, visuals legends, and control labels as per the following table.

Name Data type Values Static default value
city String Single value City
country String Single value Country
donut01title String Single value Sales by Country
donut02title String Single value Quantity by Industry
industry String Single value Industry
Language String Single value English
languagecontrollabel String Single value Language
pivottitle String Single value Sales by Country, City and Industy
sales String Single value Sales
sheet001name String Single value Summary View

The parameters are now available on the Parameters menu.

You can now use the parameters inside each sheet title, visual title, legend title, column header, axis label, and more in your analysis. The following screenshots provide examples that illustrate how to insert these parameters into each title.

First, we insert the sheet name.

Then we add the language control name.

We edit the donut charts’ titles.

Donut chart title

We also add the donut charts’ legend titles.

Donut chart legend title

In the following screenshot, we specify the pivot table row names.

Pivot table row names

We also specify the pivot table column names.

Pivot table column names

Publish the analysis to a new dashboard and follow the steps in the post Embed interactive dashboards in your apps and portals in minutes with Amazon QuickSight’s new 1-click embedding feature to embed the dashboard in an HTML page hosted in your website or web application.

The example HTML page provided in the appendix of this post contains one control to switch among the four languages you created in the dataset in the previous sections with the option to automatically sync the QuickSight UI locale when changing the language, and one control to independently change the UI locale as required.

The following screenshots provide some examples of combinations of data language and QuickSight UI locale.

The following is an example of English data language with the English QuickSight UI locale.

Embedded dashboard with English language and English locale

The following is an example of Italian data language with the synchronized Italian QuickSight UI locale.

Embedded dashboard with Italian language and synced Italian locale

The following screenshot shows German data language with the Japanese QuickSight UI locale.

Embedded dashboard with German language and Japanese locale

Conclusion

This post demonstrated how to automatically translate data using machine learning and build a multilingual dashboard with Athena, QuickSight, and Amazon Translate, and how to add advanced multilingual features with QuickSight embedded dashboards. You can use the same approach to display different values for dimensions as well as metrics depending on the values of one or more parameters.

QuickSight provides a 30-day free trial subscription for four users; you can get started immediately. You can learn more and ask questions about QuickSight in the Amazon QuickSight Community.

Appendix: Embedded dashboard host page

The full code for the HTML page is as follows:

<!DOCTYPE html>
<html>
    <head>
        <title>Amazon QuickSight Multilingual Embedded Dashboard</title>
        <script src="https://unpkg.com/[email protected]/dist/quicksight-embedding-js-sdk.min.js"></script>
        <script type="text/javascript">

            var url = "https://<<YOUR_AMAZON_QUICKSIGHT_REGION>>.quicksight.aws.amazon.com/sn/embed/share/accounts/<<YOUR_AWS_ACCOUNT_ID>>/dashboards/<<DASHBOARD_ID>>?directory_alias=<<YOUR_AMAZON_QUICKSIGHT_ACCOUNT_NAME>>"
            var defaultLanguageOptions = 'en_US'
            var dashboard

            var trns = {
                en_US: {
                    locale: "en-US",
                    language: "English",
                    languagecontrollabel: "Language",
                    sheet001name: "Summary View",
                    sales: "Sales",
                    country: "Country",
                    city: "City",
                    industry: "Industry",
                    quantity: "Quantity",
                    by: "by",
                    and: "and"
                },
                de_DE: {
                    locale: "de-DE",
                    language: "German",
                    languagecontrollabel: "Sprache",
                    sheet001name: "Zusammenfassende Ansicht",
                    sales: "Umsätze",
                    country: "Land",
                    city: "Stadt",
                    industry: "Industrie",
                    quantity: "Anzahl",
                    by: "von",
                    and: "und"
                },
                it_IT: {
                    locale: "it-IT",
                    language: "Italian",
                    languagecontrollabel: "Lingua",
                    sheet001name: "Prospetto Riassuntivo",
                    sales: "Vendite",
                    country: "Paese",
                    city: "Città",
                    industry: "Settore",
                    quantity: "Quantità",
                    by: "per",
                    and: "e"
                },
                es_ES: {
                    locale: "es-ES",
                    language: "Spanish",
                    languagecontrollabel: "Idioma",
                    sheet001name: "Vista de Resumen",
                    sales: "Ventas",
                    country: "Paìs",
                    city: "Ciudad",
                    industry: "Industria",
                    quantity: "Cantidad",
                    by: "por",
                    and: "y"
                }
            }

            function setLanguageParameters(l){

                return {
                            Language: trns[l]['language'],
                            languagecontrollabel: trns[l]['languagecontrollabel'],
                            sheet001name: trns[l]['sheet001name'],
                            donut01title: trns[l]['sales']+" "+trns[l]['by']+" "+trns[l]['country'],
                            donut02title: trns[l]['quantity']+" "+trns[l]['by']+" "+trns[l]['industry'],
                            pivottitle: trns[l]['sales']+" "+trns[l]['by']+" "+trns[l]['country']+", "+trns[l]['city']+" "+trns[l]['and']+" "+trns[l]['industry'],
                            sales: trns[l]['sales'],
                            country: trns[l]['country'],
                            city: trns[l]['city'],
                            industry: trns[l]['industry'],
                        }
            }

            function embedDashboard(lOpts, forceLocale) {

                var languageOptions = defaultLanguageOptions
                if (lOpts) languageOptions = lOpts

                var containerDiv = document.getElementById("embeddingContainer");
                containerDiv.innerHTML = ''

                parameters = setLanguageParameters(languageOptions)

                if(!forceLocale) locale = trns[languageOptions]['locale']
                else locale = forceLocale

                var options = {
                    url: url,
                    container: containerDiv,
                    parameters: parameters,
                    scrolling: "no",
                    height: "AutoFit",
                    loadingHeight: "930px",
                    width: "1024px",
                    locale: locale
                };

                dashboard = QuickSightEmbedding.embedDashboard(options);
            }

            function onLangChange(langSel) {

                var l = langSel.value

                if(!document.getElementById("changeLocale").checked){
                    dashboard.setParameters(setLanguageParameters(l))
                }
                else {
                    var selLocale = document.getElementById("locale")
                    selLocale.value = trns[l]['locale']
                    embedDashboard(l)
                }
            }

            function onLocaleChange(obj) {

                var locl = obj.value
                var lang = document.getElementById("lang").value

                document.getElementById("changeLocale").checked = false
                embedDashboard(lang,locl)

            }

            function onSyncLocaleChange(obj){

                if(obj.checked){
                    var selLocale = document.getElementById('locale')
                    var selLang = document.getElementById('lang').value
                    selLocale.value = trns[selLang]['locale']
                    embedDashboard(selLang, trns[selLang]['locale'])
                }            
            }

        </script>
    </head>

    <body onload="embedDashboard()">

        <div style="text-align: center; width: 1024px;">
            <h2>Amazon QuickSight Multilingual Embedded Dashboard</h2>

            <span>
                <label for="lang">Language</label>
                <select id="lang" name="lang" onchange="onLangChange(this)">
                    <option value="en_US" selected>English</option>
                    <option value="de_DE">German</option>
                    <option value="it_IT">Italian</option>
                    <option value="es_ES">Spanish</option>
                </select>
            </span>

            &nbsp;-&nbsp;
            
            <span>
                <label for="changeLocale">Sync UI Locale with Language</label>
                <input type="checkbox" id="changeLocale" name="changeLocale" onchange="onSyncLocaleChange(this)">
            </span>

            &nbsp;|&nbsp;

            <span>
                <label for="locale">QuickSight UI Locale</label>
                <select id="locale" name="locale" onchange="onLocaleChange(this)">
                    <option value="en-US" selected>English</option>
                    <option value="da-DK">Dansk</option>
                    <option value="de-DE">Deutsch</option>
                    <option value="ja-JP">日本語</option>
                    <option value="es-ES">Español</option>
                    <option value="fr-FR">Français</option>
                    <option value="it-IT">Italiano</option>
                    <option value="nl-NL">Nederlands</option>
                    <option value="nb-NO">Norsk</option>
                    <option value="pt-BR">Português</option>
                    <option value="fi-FI">Suomi</option>
                    <option value="sv-SE">Svenska</option>
                    <option value="ko-KR">한국어</option>
                    <option value="zh-CN">中文 (简体)</option>
                    <option value="zh-TW">中文 (繁體)</option>            
                </select>
            </span>
        </div>

        <div id="embeddingContainer"></div>

    </body>

</html>

About the Author

Author Francesco MarelliFrancesco Marelli is a principal solutions architect at Amazon Web Services. He is specialized in the design and implementation of analytics, data management, and big data systems. Francesco also has a strong experience in systems integration and design and implementation of applications. He is passionate about music, collecting vinyl records, and playing bass.

Running AWS Lambda functions on AWS Outposts using AWS IoT Greengrass

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/running-aws-lambda-functions-on-aws-outposts-using-aws-iot-greengrass/

This blog post is written by Adam Imeson, Sr. Hybrid Edge Specialist Solution Architect.

Today, AWS customers can deploy serverless applications in AWS Regions using a variety of AWS services. Customers can also use AWS Outposts to deploy fully managed AWS infrastructure at virtually any datacenter, colocation space, or on-premises facility.

AWS Outposts extends the cloud by bringing AWS services to customers’ premises to support their hybrid and edge workloads. This post will describe how to deploy Lambda functions on an Outpost using AWS IoT Greengrass.

Consider a customer who has built an application that runs in an AWS Region and depends on AWS Lambda. This customer has a business need to enter a new geographic market, but the nearest AWS Region is not close enough to meet application latency or data residency requirements. AWS Outposts can help this customer extend AWS infrastructure and services to their desired geographic region. This blog post will explain how a customer can move their Lambda-dependent application to an Outpost.

Overview

In this walkthrough you will create a Lambda function that can run on AWS IoT Greengrass and deploy it on an Outpost. This architecture results in an AWS-native Lambda function running on the Outpost.

Architecture overview - Lambda functions on AWS Outposts

Deploying Lambda functions on Outposts rack

Prerequisites: Building a VPC

To get started, build a VPC in the same Region as your Outpost. You can do this with the create VPC option in the AWS console. The workflow allows you to set up a VPC with public and private subnets, an internet gateway, and NAT gateways as necessary. Do not consume all of the available IP space in the VPC with your subnets in this step, because you will still need to create Outposts subnets after this.

Now, build a subnet on your Outpost. You can do this by selecting your Outpost in the Outposts console and choosing Create Subnet in the drop-down Actions menu in the top right.

Confirm subnet details

Choose the VPC you just created and select a CIDR range for your new subnet that doesn’t overlap with the other subnets that are already in the VPC. Once you’ve created the subnet, you need to create a new subnet route table and associate it with your new subnet. Go into the subnet route tables section of the VPC console and create a new route table. Associate the route table with your new subnet. Add a 0.0.0.0/0 route pointing at your VPC’s internet gateway. This sets the subnet up as a public subnet, which for the purposes of this post will make it easier to access the instance you are about to build for Greengrass Core. Depending on your requirements, it may make more sense to set up a private subnet on your Outpost instead. You can also add a route pointing at your Outpost’s local gateway here. Although you won’t be using the local gateway during this walkthrough, adding a route to the local gateway makes it possible to trigger your Outpost-hosted Lambda function with on-premises traffic.

Create a new route table

Associate the route table with the new subnet

Add a 0.0.0.0/0 route pointing at your VPC’s internet gateway

Setup: Launching an instance to run Greengrass Core

Create a new EC2 instance in your Outpost subnet. As long as your Outpost has capacity for your desired instance type, this operation will proceed the same way as any other EC2 instance launch. You can check your Outpost’s capacity in the Outposts console or in Amazon CloudWatch:

I used a c5.large instance running Amazon Linux 2 with 20 GiB of Amazon EBS storage for this walkthough. You can pick a different instance size or a different operating system in accordance with your application’s needs and the AWS IoT Greengrass documentation. For the purposes of this tutorial, we assign a public IP address to the EC2 instance on creation.

Step 1: Installing the AWS IoT Greengrass Core software

Once your EC2 instance is up and running, you will need to install the AWS IoT Greengrass Core software on the instance. Follow the AWS IoT Greengrass documentation to do this. You will need to do the following:

  1. Ensure that your EC2 instance has appropriate AWS permissions to make AWS API calls. You can do this by attaching an instance profile to the instance, or by providing AWS credentials directly to the instance as environment variables, as in the Greengrass documentation.
  2. Log in to your instance.
  3. Install OpenJDK 11. For Amazon Linux 2, you can use sudo amazon-linux-extras install java-openjdk11 to do this.
  4. Create the default system user and group that runs components on the device, with
    sudo useradd —system —create-home ggc_user
    sudo groupadd —system ggc_group
  5. Edit the /etc/sudoers file with sudo visudosuch that the entry for the root user looks like root ALL=(ALL:ALL) ALL
  6. Enable cgroups and enable and mount the memory and devices cgroups. In Amazon Linux 2, you can do this with the grubby utility as follows:
    sudo grubby --args="cgroup_enable=memory cgroup_memory=1 systemd.unified_cgroup_hierarchy=0" --update-kernel /boot/vmlinuz-$(uname -r)
  7. Type sudo reboot to reboot your instance with the cgroup boot parameters enabled.
  8. Log back in to your instance once it has rebooted.
  9. Use this command to download the AWS IoT Greengrass Core software to the instance:
    curl -s https://d2s8p88vqu9w66.cloudfront.net/releases/greengrass-nucleus-latest.zip > greengrass-nucleus-latest.zip
  10. Unzip the AWS IoT Greengrass Core software:
    unzip greengrass-nucleus-latest.zip -d GreengrassInstaller && rm greengrass-nucleus-latest.zip
  11. Run the following command to launch the installer. Replace each argument with appropriate values for your particular deployment, particularly the aws-region and thing-name arguments.
    sudo -E java -Droot="/greengrass/v2" -Dlog.store=FILE \
    -jar ./GreengrassInstaller/lib/Greengrass.jar \
    --aws-region region \
    --thing-name MyGreengrassCore \
    --thing-group-name MyGreengrassCoreGroup \
    --thing-policy-name GreengrassV2IoTThingPolicy \
    --tes-role-name GreengrassV2TokenExchangeRole \
    --tes-role-alias-name GreengrassCoreTokenExchangeRoleAlias \
    --component-default-user ggc_user:ggc_group \
    --provision true \
    --setup-system-service true \
    --deploy-dev-tools true
  12. You have now installed the AWS IoT Greengrass Core software on your EC2 instance. If you type sudo systemctl status greengrass.service then you should see output similar to this:

Step 2: Building and deploying a Lambda function

Now build a Lambda function and deploy it to the new Greengrass Core instance. You can find example local Lambda functions in the aws-greengrass-lambda-functions GitHub repository. This example will use the Hello World Python 3 function from that repo.

  1. Create the Lambda function. Go to the Lambda console, choose Create function, and select the Python 3.8 runtime:

  1. Choose Create function at the bottom of the page. Once your new function has been created, copy the code from the Hello World Python 3 example into your function:

  1. Choose Deploy to deploy your new function’s code.
  2. In the top right, choose Actions and select Publish new version. For this particular function, you would need to create a deployment package with the AWS IoT Greengrass SDK for the function to work on the device. I’ve omitted this step for brevity as it is not a main focus of this post. Please reference the Lambda documentation on deployment packages and the Python-specific deployment package docs if you want to pursue this option.

  1. Go to the AWS IoT Greengrass console and choose Components in the left-side pop-in menu.
  2. On the Components page, choose Create component, and then Import Lambda function. If you prefer to do this programmatically, see the relevant AWS IoT Greengrass documentation or AWS CloudFormation documentation.
  3. Choose your new Lambda function from the drop-down.

Create component

  1. Scroll to the bottom and choose Create component.
  2. Go to the Core devices menu in the left-side nav bar and select your Greengrass Core device. This is the Greengrass Core EC2 instance you set up earlier. Make a note of the core device’s name.

  1. Use the left-side nav bar to go to the Deployments menu. Choose Create to create a new deployment, which will place your Lambda function on your Outpost-hosted core device.
  2. Give the deployment a name and select Core device, providing the name of your core device. Choose Next.

  1. Select your Lambda function and choose Next.

  1. Choose Next again, on both the Configure components and Configure advanced settings On the last page, choose Deploy.

You should see a green message at the top of the screen indicating that your configuration is now being deployed.

Clean up

  1. Delete the Lambda function you created.
  2. Terminate the Greengrass Core EC2 instance.
  3. Delete the VPC.

Conclusion

Many customers use AWS Outposts to expand applications into new geographies. Some customers want to run Lambda-based applications on Outposts. This blog post shows how to use AWS IoT Greengrass to build Lambda functions which run locally on Outposts.

To learn more about Outposts, please contact your AWS representative and visit the Outposts homepage and documentation.

Extending PowerShell on AWS Lambda with other services

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/extending-powershell-on-aws-lambda-with-other-services/

This post expands on the functionality introduced with the PowerShell custom runtime for AWS Lambda. The previous blog explains how the custom runtime approach makes it easier to run Lambda functions written in PowerShell.

You can add additional functionality to your PowerShell serverless applications by importing PowerShell modules, which are shareable packages of code. Build your own modules or import from the wide variety of existing vendor modules to manage your infrastructure and applications.

You can also take advantage of the event-driven nature of Lambda, which allows you to run Lambda functions in response to events. Events can include an object being uploaded to Amazon S3, a message placed on an Amazon SQS queue, a scheduled task using Amazon EventBridge, or an HTTP request from Amazon API Gateway. Lambda functions support event triggers from over 200 AWS services and software as a service (SaaS) applications.

Adding PowerShell modules

You can add PowerShell modules from a number of locations. These can include modules from the AWS Tools for PowerShell, from the PowerShell Gallery, or your own custom modules. Lambda functions access these PowerShell modules within specific folders within the Lambda runtime environment.

You can include PowerShell modules via Lambda layers, within your function code package, or container image. When using .zip archive functions, you can use layers to package and share modules to use with your functions. Layers reduce the size of uploaded deployment archives and can make it faster to deploy your code. You can attach up to five layers to your function, one of which must be the PowerShell custom runtime layer. You can include multiple modules per layer.

The custom runtime configures PowerShell’s PSModulePath environment variable, which contains the list of folder locations to search to find modules. The runtime searches the folders in the following order:

1. User supplied modules as part of function package

You can include PowerShell modules inside the published Lambda function package in a /modules subfolder.

2. User supplied modules as part of Lambda layers

You can publish Lambda layers that include PowerShell modules in a /modules subfolder. This allows you to share modules across functions and accounts. Lambda extracts layers to /opt within the Lambda runtime environment so the modules are located in /opt/modules. This is the preferred solution to use modules with multiple functions.

3. Default/user supplied modules supplied with PowerShell

You can also include additional default modules and add them within a /modules folder within the PowerShell custom runtime layer.

For example, the following function includes four Lambda layers. One layer includes the custom runtime. Three additional layers include further PowerShell modules; the AWS Tools for PowerShell, your own custom modules, and third-party modules. You can also include additional modules with your function code.

Lambda layers

Lambda layers

Within your PowerShell code, you can load modules during the function initialization (init) phase. This initializes the modules before the handler function runs, which speeds up subsequent warm-start invocations.

Adding modules from the AWS Tools for PowerShell

This post shows how to use the AWS Tools for PowerShell to manage your AWS services and resources. The tools are packaged as a set of PowerShell modules that are built on the functionality exposed by the AWS SDK for .NET. You can follow similar packaging steps to add other modules to your functions.

The AWS Tools for PowerShell are available as three distinct packages:

The AWS.Tools package is the preferred modularized version, which allows you to load only the modules for the services you want to use. This reduces package size and function memory usage. The AWS.Tools cmdlets support auto-importing modules without having to call Import-Module first. However, specifically importing the modules during the function init phase is more efficient and can reduce subsequent invoke duration. The AWS.Tools.Common module is required and provides cmdlets for configuration and authentication that are not service specific.

The accompanying GitHub repository contains the code for the custom runtime, along with a number of example applications. There are also module build instructions for adding a number of common PowerShell modules as Lambda layers, including AWS.Tools.

Building an event-driven PowerShell function

The repository contains an example of an event-driven demo application that you can build using serverless services.

A clothing printing company must manage its t-shirt size and color inventory. The printers store t-shirt orders for each day in a CSV file. The inventory service is one service that must receive the CSV file. It parses the file and, for each order, records the details to manage stock deliveries.

The stores upload the files to S3. This automatically invokes a PowerShell Lambda function, which is configured to respond to the S3 ObjectCreated event. The Lambda function receives the S3 object location as part of the $LambdaInput event object. It uses the AWS Tools for PowerShell to download the file from S3. It parses the contents and, for each line in the CSV file, sends the individual order details as an event to an EventBridge event bus.

In this example, there is a single rule to log the event to Amazon CloudWatch Logs to show the received event. However, you could route each order, depending on the order details, to different targets. For example, you can send different color combinations to SQS queues, which the dyeing service can use to order dyes. You could send particular size combinations to another Lambda function that manages cloth orders.

Example event-driven application

Example event-driven application

The previous blog post shows how to use the AWS Serverless Application Model (AWS SAM) to build a Lambda layer, which includes only the AWS.Tools.Common module to run Get-AWSRegion. To build a PowerShell application to process objects from S3 and send events to EventBridge, you can extend this functionality by also including the AWS.Tools.S3 and AWS.Tools.EventBridge modules in a Lambda layer.

Lambda layers, including S3 and EventBridge

Lambda layers, including S3 and EventBridge

Building the AWS Tools for PowerShell layer

You could choose to add these modules and rebuild the existing layer. However, the example in this post creates a new Lambda layer to show how you can have different layers for different module combinations of AWS.Tools. The example also adds the Lambda layer Amazon Resource Name (ARN) to AWS Systems Manager Parameter Store to track deployed layers. This allows you to reference them more easily in infrastructure as code tools.

The repository includes build scripts for both Windows and non-Windows developers. Windows does not natively support Makefiles. When using Windows, you can use either Windows Subsystem for Linux (WSL)Docker Desktop, or native PowerShell.

When using Linux, macOS, WSL, or Docker, the Makefile builds the Lambda layers. After downloading the modules, it also extracts the additional AWS.Tools.S3 and AWS.Tools.EventBridge modules.

# Download AWSToolsLayer module binaries
curl -L -o $(ARTIFACTS_DIR)/AWS.Tools.zip https://sdk-for-net.amazonwebservices.com/ps/v4/latest/AWS.Tools.zip
mkdir -p $(ARTIFACTS_DIR)/modules

# Extract select AWS.Tools modules (AWS.Tools.Common required)
unzip $(ARTIFACTS_DIR)/AWS.Tools.zip 'AWS.Tools.Common/**/*' -d $(ARTIFACTS_DIR)/modules/
unzip $(ARTIFACTS_DIR)/AWS.Tools.zip 'AWS.Tools.S3/**/*' -d $(ARTIFACTS_DIR)/modules/
unzip $(ARTIFACTS_DIR)/AWS.Tools.zip 'AWS.Tools.EventBridge/**/*' -d $(ARTIFACTS_DIR)/modules/

When using native PowerShell on Windows to build the layer, the build-AWSToolsLayer.ps1 script performs the same file copy functionality as the Makefile. You can use this option for Windows without WSL or Docker.

### Extract entire AWS.Tools modules to stage area but only move over select modules
…
Move-Item "$PSScriptRoot\stage\AWS.Tools.Common" "$PSScriptRoot\modules\" -Force
Move-Item "$PSScriptRoot\stage\AWS.Tools.S3" "$PSScriptRoot\modules\" -Force
Move-Item "$PSScriptRoot\stage\AWS.Tools.EventBridge" "$PSScriptRoot\modules\" -Force

The Lambda function code imports the required modules in the function init phase.

Import-Module "AWS.Tools.Common"
Import-Module "AWS.Tools.S3"
Import-Module "AWS.Tools.EventBridge"

For other combinations of AWS.Tools, amend the example build-AWSToolsLayer.ps1 scripts to add the modules you require. You can use a similar download and copy process, or PowerShell’s Save-Module to build layers for modules from other locations.

Building and deploying the event-driven serverless application

Follow the instructions in the GitHub repository to build and deploy the application.

The demo application uses AWS SAM to deploy the following resources:

  1. PowerShell custom runtime.
  2. Additional Lambda layer containing the AWS.Tools.Common, AWS.Tools.S3, and AWS.Tools.EventBridge modules from AWS Tools for PowerShell. The layer ARN is stored in Parameter Store.
  3. S3 bucket to store CSV files.
  4. Lambda function triggered by S3 upload.
  5. Custom EventBridge event bus and rule to send events to CloudWatch Logs.

Testing the event-driven application

Use the AWS CLI or AWS Tools for PowerShell to copy the sample CSV file to S3. Replace BUCKET_NAME with your S3 SourceBucket Name from the AWS SAM outputs.

AWS CLI

aws s3 cp .\test.csv s3://BUCKET_NAME

AWS Tools for PowerShell

Write-S3Object -BucketName BUCKET_NAME -File .\test.csv

The S3 file copy action generates an S3 notification event. This invokes the PowerShell Lambda function, passing the S3 file location details as part of the function $LambdaInput event object.

The function downloads the S3 CSV file, parses the contents, and sends the individual lines to EventBridge, which logs the events to CloudWatch Logs.

Navigate to the CloudWatch Logs group /aws/events/demo-s3-lambda-eventbridge.

You can see the individual orders logged from the CSV file.

EventBridge logs showing CSV lines

EventBridge logs showing CSV lines

Conclusion

You can extend PowerShell Lambda applications to provide additional functionality.

This post shows how to import your own or vendor PowerShell modules and explains how to build Lambda layers for the AWS Tools for PowerShell.

You can also take advantage of the event-driven nature of Lambda to run Lambda functions in response to events. The demo application shows how a clothing printing company builds a PowerShell serverless application to manage its t-shirt size and color inventory.

See the accompanying GitHub repository, which contains the code for the custom runtime, along with additional installation options and additional examples.

Start running PowerShell on Lambda today.

For more serverless learning resources, visit Serverless Land.

Automating detection of security vulnerabilities and bugs in CI/CD pipelines using Amazon CodeGuru Reviewer CLI

Post Syndicated from Akash Verma original https://aws.amazon.com/blogs/devops/automating-detection-of-security-vulnerabilities-and-bugs-in-ci-cd-pipelines-using-amazon-codeguru-reviewer-cli/

Watts S. Humphrey, the father of Software Quality, had famously quipped, “Every business is a software business”. Software is indeed integral to any industry. The engineers who create software are also responsible for making sure that the underlying code adheres to industry and organizational standards, are performant, and are absolved of any security vulnerabilities that could make them susceptible to attack.

Traditionally, security testing has been the forte of a specialized security testing team, who would conduct their tests toward the end of the Software Development lifecycle (SDLC). The adoption of DevSecOps practices meant that security became a shared responsibility between the development and security teams. Now, development teams can, on their own or as advised by their security team, setup and configure various code scanning tools to detect security vulnerabilities much earlier in the software delivery process (aka “Shift Left”). Meanwhile, the practice of Static code analysis and security application testing (SAST) has become a standard part of the SDLC. Furthermore, it’s imperative that the development teams expect SAST tools that are easy to set-up, seamlessly fit into their DevOps infrastructure, and can be configured without requiring assistance from security or DevOps experts.

In this post, we’ll demonstrate how you can leverage Amazon CodeGuru Reviewer Command Line Interface (CLI) to integrate CodeGuru Reviewer into your Jenkins Continuous Integration & Continuous Delivery (CI/CD) pipeline. Note that the solution isn’t limited to Jenkins, and it would be equally useful with any other build automation tool. Moreover, it can be integrated at any stage of your SDLC as part of the White-box testing. For example, you can integrate the CodeGuru Reviewer CLI as part of your software development process, as well as run it on your dev machine before committing the code.

Launched in 2020, CodeGuru Reviewer utilizes machine learning (ML) and automated reasoning to identify security vulnerabilities, inefficient uses of AWS APIs and SDKs, as well as other common coding errors. CodeGuru Reviewer employs a growing set of detectors for Java and Python to provide recommendations via the AWS Console. Customers that leverage the CodeGuru Reviewer CLI within a CI/CD pipeline also receive recommendations in a machine-readable JSON format, as well as HTML.

CodeGuru Reviewer offers native integration with Source Code Management (SCM) systems, such as GitHub, BitBucket, and AWS CodeCommit. However, it can be used with any SCM via its CLI. The CodeGuru Reviewer CLI is a shim layer on top of the AWS Command Line Interface (AWS CLI) that simplifies the interaction with the tool by handling the uploading of artifacts, triggering of the analysis, and fetching of the results, all in a single command.

Many customers, including Mastercard, are benefiting from this new CodeGuru Reviewer CLI.

“During one of our technical retrospectives, we noticed the need to integrate Amazon CodeGuru recommendations in our build pipelines hosted on Jenkins. Not all our developers can run or check CodeGuru recommendations through the AWS console. Incorporating CodeGuru CLI in our build pipelines acts as an important quality gate and ensures that our developers can immediately fix critical issues.”
                                           Claudio Frattari, Lead DevOps at Mastercard

Solution overview

The application deployment workflow starts by placing the application code on a GitHub SCM. To automate the scenario, we have added GitHub to the Jenkins project under the “Source Code” section. We chose the GitHub option, which would clone the chosen GitHub repository in the Jenkins local workspace directory.

In the build stage of the pipeline (see Figure 1), we configure the appropriate build tool to perform the code build and security analysis. In this example, we will be using Maven as the build tool.

Figure 1: Jenkins pipeline with Amazon CodeGuru Reviewer

Figure 1: Jenkins pipeline with Amazon CodeGuru Reviewer

In the post-build stage, we configure the CodeGuru Reviewer CLI to generate the recommendations based on the review.

Lastly, in the concluding stage of the pipeline, we’ll be analyzing the JSON results using jq – a lightweight and flexible command-line JSON processor, and then failing the Jenkins job if we encounter observations that are of a “Critical” severity.

Jenkins will trigger the “CodeGuru Reviewer” (see Figure 1) based review process in the post-build stage, i.e., after the build finishes. Furthermore, you can configure other stages, such as automated testing or deployment, after this stage. Additionally, passing the location of the build artifacts to the CLI lets CodeGuru Reviewer perform a more in-depth security analysis. Build artifacts are either directories containing jar files (e.g., build/lib for Gradle or /target for Maven) or directories containing class hierarchies (e.g., build/classes/java/main for Gradle).

Walkthrough

Now that we have an overview of the workflow, let’s dive deep and walk you through the following steps in detail:

  1. Installing the CodeGuru Reviewer CLI
  2. Creating a Jenkins pipeline job
  3. Reviewing the CodeGuru Reviewer recommendations
  4. Configuring CodeGuru Reviewer CLI’s additional options

1. Installing the CodeGuru CLI Wrapper

a. Prerequisites

To run the CLI, we must have Git, Java, Maven, and the AWS CLI installed. Verify that they’re installed on our machine by running the following commands:

java -version 
mvn --version 
aws --version 
git –-version

If they aren’t installed, then download and install Java here (Amazon Corretto is a no-cost, multiplatform, production-ready distribution of the Open Java Development Kit), Maven from here, and Git from here. Instructions for installing AWS CLI are available here.

We would need to create an Amazon Simple Storage Service (Amazon S3) bucket with the prefix codeguru-reviewer-. Note that the bucket name must begin with the mentioned prefix, since we have used the name pattern in the following AWS Identity and Access Management (IAM) permissions, and CodeGuru Reviewer expects buckets to begin with this prefix. Refer to the following section 4(a) “Specifying S3 bucket name” for more details.

Furthermore, we’ll need working credentials on our machine to interact with our AWS account. Learn more about setting up credentials for AWS here. You can find the minimal permissions to run the CodeGuru Reviewer CLI as follows.

b. Required Permissions

To use the CodeGuru Reviewer CLI, we need at least the following AWS IAM permissions, attached to an AWS IAM User or an AWS IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "codeguru-reviewer:ListRepositoryAssociations",
                "codeguru-reviewer:AssociateRepository",
                "codeguru-reviewer:DescribeRepositoryAssociation",
                "codeguru-reviewer:CreateCodeReview",
                "codeguru-reviewer:DescribeCodeReview",
                "codeguru-reviewer:ListRecommendations",
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucket*",
                "s3:List*",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::codeguru-reviewer-*",
                "arn:aws:s3:::codeguru-reviewer-*/*"
            ],
            "Effect": "Allow"
        }
    ]
}

c.  CLI installation

Please download the latest version of the CodeGuru Reviewer CLI available at GitHub. Then, run the following commands in sequence:

curl -OL https://github.com/aws/aws-codeguru-cli/releases/download/0.0.1/aws-codeguru-cli.zip
unzip aws-codeguru-cli.zip
export PATH=$PATH:./aws-codeguru-cli/bin

d. Using the CLI

The CodeGuru Reviewer CLI only has one required parameter –root-dir (or just -r) to specify to the local directory that should be analyzed. Furthermore, the –src option can be used to specify one or more files in this directory that contain the source code that should be analyzed. In turn, for Java applications, the –build option can be used to specify one or more build directories.

For a demonstration, we’ll analyze the demo application. This will make sure that we’re all set for when we leverage the CLI in Jenkins. To proceed, first we download and install the sample application, as follows:

git clone https://github.com/aws-samples/amazon-codeguru-reviewer-sample-app
cd amazon-codeguru-reviewer-sample-app
mvn clean compile

Now that we have built our demo application, we can use the aws-codeguru-cli CLI command that we added to the path to trigger the code scan:

aws-codeguru-cli --root-dir ./ --build target/classes --src src --output ./output

For additional assistance on the CLI command, reference the readme here.

2.  Creating a Jenkins Pipeline job

CodeGuru Reviewer can be integrated in a Jenkins Pipeline as well as a Freestyle project. In this example, we’re leveraging a Pipeline.

a. Pipeline Job Configuration

  1.  Log in to Jenkins, choose “New Item”, then select “Pipeline” option.
  2. Enter a name for the project (for example, “CodeGuruPipeline”), and choose OK.
Figure 2: Creating a new Jenkins pipeline

Figure 2: Creating a new Jenkins pipeline

  1. On the “Project configuration” page, scroll down to the bottom and find your pipeline. In the pipeline script, paste the following script (or use your own Jenkinsfile). The following example is a valid Jenkinsfile to integrate CodeGuru Reviewer with a project built using Maven.
pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                // Get code from a GitHub repository
                git clone https://github.com/aws-samples/amazon-codeguru-reviewer-java-detectors.git

                // Run Maven on a Unix agent
                sh "mvn clean compile"

                // To run Maven on a Windows agent, use following
                // bat "mvn -Dmaven.test.failure.ignore=true clean package"
            }
        }
        stage('CodeGuru Reviewer') {
            steps{
                sh 'ls -lsa *'
                sh 'pwd'
                // Here we’re setting an absolute path, but we can 
                // also use JENKINS environment variables
                sh '''
                    export BASE=/var/jenkins_home/workspace/CodeGuruPipeline/amazon-codeguru-reviewer-java-detectors
                    export SRC=${BASE}/src
                    export OUTPUT = ./output
                    /home/codeguru/aws-codeguru-cli/bin/aws-codeguru-cli --root-dir $BASE --build $BASE/target/classes --src $SRC --output $OUTPUT -c $GIT_PREVIOUS_COMMIT:$GIT_COMMIT --no-prompt
                    '''
            }
        }    
        stage('Checking findings'){
            steps{
                // In this example we are stopping our pipline on  
                // detecting Critical findings. We are using jq 
                // to count occurrences of Critical severity 
                sh '''
                CNT = $(cat ./output/recommendations.json |jq '.[] | select(.severity=="Critical")|.severity' | wc -l)'
                if (( $CNT > 0 )); then
                  echo "Critical findings discovered. Failing."
                  exit 1
                fi
                '''
            }
        }
    }
}
  1. Save the configuration and select “Build now” on the side bar to trigger the build process (see Figure 3).
Figure 3: Jenkins pipeline in triggered state

Figure 3: Jenkins pipeline in triggered state

3. Reviewing the CodeGuru Reviewer recommendations

Once the build process is finished, you can view the review results from CodeGuru Reviewer by selecting the Jenkins build history for the most recent build job. Then, browse to Workspace output. The output is available in JSON and HTML formats (Figure 4).

Figure 4: CodeGuru CLI Output

Figure 4: CodeGuru CLI Output

Snippets from the HTML and JSON reports are displayed in Figure 5 and 6 respectively.

In this example, our pipeline analyzes the JSON results with jq based on severity equal to critical and failing the job if there are any critical findings. Note that this output path is set with the –output option. For instance, the pipeline will fail on noticing the “critical” finding at Line 67 of the EventHandler.java class (Figure 5), flagged due to use of an insecure code. Till the time the code is remediated, the pipeline would prevent the code deployment. The vulnerability could have gone to production undetected, in absence of the tool.

Figure 5: CodeGuru HTML Report

Figure 5: CodeGuru HTML Report

Figure 6: CodeGuru JSON recommendations

Figure 6: CodeGuru JSON recommendations

4.  Configuring CodeGuru Reviewer CLI’s additional options

a.  Specifying Amazon S3 bucket name and policy

CodeGuru Reviewer needs one Amazon S3 bucket for the CLI to store the artifacts while the analysis is running. The artifacts are deleted after the analysis is completed. The same bucket will be reused for all the repositories that are analyzed in the same account and region (unless specified otherwise by the user). Note that CodeGuru Reviewer expects the S3 bucket name to begin with codeguru-reviewer-. At this time, you can’t use a different naming pattern. However, if you want to use a different bucket name, then you can use the –bucket-name option.

Select the Permissions tab of your S3 bucket. Update the Block public access and add the following S3 bucket policy.

Figure 7: S3 bucket settings

Figure 7: S3 bucket settings

S3 bucket policy:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"PublicRead",
         "Effect":"Allow",
         "Principal":"*",
         "Action":"s3:GetObject",
         "Resource":"[Change to ARN for your S3 bucket]/*"
      }
   ]
}

Note that if you must change the bucket’s name, then you can remove the associated S3 bucket in the AWS console under CodeGuru → CI workflows and select Disassociate Workflow.

b.  Analyzing a single commit

The CLI also lets us specify a specific commit range to analyze. This can lead to faster and more cost-effective scans for the incremental code changes, instead of a full repository scan. For example, if we just want to analyze the last commit, we can run:

aws-codeguru-cli -r ./ -s src/main/java -b build/libs -c HEAD^:HEAD --no-prompt

Here, we use the -c option to specify that we only want to analyze the commits between HEAD^ (the previous commit) and HEAD (the current commit). Moreover, we add the –no-prompt option to automatically answer questions by the CLI with yes. This option is useful if we plan to use the CLI in an automated way, such as in our CI/CD workflow.

c.  Encrypting artifacts

CodeGuru Reviewer lets us use a customer managed key to encrypt the content of the S3 bucket that is used to store the source and build artifacts. To achieve this, create a customer owned key in AWS Key Management Service (AWS KMS) (see Figure 8).

Figure 8: KMS settings

Figure 8: KMS settings

We must grant CodeGuru Reviewer the permission to decrypt artifacts with this key by adding the following Statement to your Key policy:

{
   "Sid":"Allow CodeGuru to use the key to decrypt artifact",
   "Effect":"Allow",
   "Principal":{
      "AWS":"*"
   },
   "Action":[
      "kms:Decrypt",
      "kms:DescribeKey"
   ],
   "Resource":"*",
   "Condition":{
      "StringEquals":{
         "kms:ViaService":"codeguru-reviewer.amazonaws.com",
         "kms:CallerAccount":[
            "YOUR AWS ACCOUNT ID"
         ]
      }
   }
}

Then, enable server-side encryption for the S3 bucket that we’re using with CodeGuru Reviewer (Figure 9).

S3 bucket settings:

Figure 9: S3 bucket encryption settings

Figure 9: S3 bucket encryption settings

After we enable encryption on the bucket, we must delete all the CodeGuru repository associations that use this bucket, and then recreate them by analyzing the repositories while providing the key (as in the following example, Figure 10):

Figure10: CodeGuru CI Workflow

Figure 10: CodeGuru CI Workflow

Note that the first time you check out your repository, it will always trigger a full repository scan. Consider setting the -c option, as this will allow a commit range.

Cleaning Up

At this stage, you may choose to delete the resources created while following this blog, to avoid incurring any unwanted costs.

  1. Delete Amazon S3 bucket.
  2. Delete AWS KMS key.
  3. Delete the Jenkins installation, if not required further.

Conclusion

In this post, we outlined how you can integrate Amazon CodeGuru Reviewer CLI with the Jenkins open-source build automation tool to perform code analysis as part of your code build pipeline and act as a quality gate. We showed you how to create a Jenkins pipeline job and integrate the CodeGuru Reviewer CLI to detect issues in your Java and Python code, as well as access the recommendations for remediating these issues. We presented an example where you can stop the build upon finding critical violations. Furthermore, we discussed how you can specify a commit range to avoid a full repo scan, and how the S3 bucket used by CodeGuru Reviewer to store artifacts can be encrypted using customer managed keys.

The CodeGuru Reviewer CLI offers you a one-line command to scan any code on your machine and retrieve recommendations. You can run the CLI anywhere where you can run AWS commands. In other words, you can use the CLI to integrate CodeGuru Reviewer into your favourite CI tool, as a pre-commit hook, or anywhere else in your workflow. In turn, you can combine CodeGuru Reviewer with Dynamic Application Security Testing (DAST) and Software Composition Analysis (SCA) tools to achieve a hybrid application security testing method that helps you combine the inside-out and outside-in testing approaches, cross-reference results, and detect vulnerabilities that both exist and are exploitable.

Hopefully, you have found this post informative, and the proposed solution useful. If you need helping hands, then AWS Professional Services can help implement this solution in your enterprise, as well as introduce you to our AWS DevOps services and offerings.

About the Authors

Akash Verma

Akash Verma

Akash is a Software Development Engineer 2 at Amazon India. He is passionate about writing clean code and building maintainable software. He also enjoys learning modern technologies. Outside of work, Akash loves to travel, interact with new people, and try different cuisines. He also relishes gardening and watching Stand-up comedy.

Debashish Chakrabarty

Debashish Chakrabarty

Debashish is a Sr. Engagement Manager at AWS Professional Services, India with over 21+ years of experience in various IT roles. At ProServe he leads engagements on Security, App Modernization and Migrations to help ProServe customers accelerate their cloud journey and achieve their business goals. Off work, Debashish has been a Hindi Blogger & Podcaster. He loves binge-watching OTT shows and spending time with family.

David Ernst

David Ernst

David is a Sr. Specialist Solution Architect – DevOps, with 20+ years of experience in designing and implementing software solutions for various industries. David is an automation enthusiast and works with AWS customers to design, deploy, and manage their AWS workloads/architectures.

Manage application security and compliance with the AWS Cloud Development Kit and cdk-nag

Post Syndicated from Rodney Bozo original https://aws.amazon.com/blogs/devops/manage-application-security-and-compliance-with-the-aws-cloud-development-kit-and-cdk-nag/

Infrastructure as Code (IaC) is an important part of Cloud Applications. Developers rely on various Static Application Security Testing (SAST) tools to identify security/compliance issues and mitigate these issues early on, before releasing their applications to production. Additionally, SAST tools often provide reporting mechanisms that can help developers verify compliance during security reviews.

cdk-nag integrates directly into AWS Cloud Development Kit (AWS CDK) applications to provide identification and reporting mechanisms similar to SAST tooling.

This post demonstrates how to integrate cdk-nag into an AWS CDK application to provide continual feedback and help align your applications with best practices.

Overview of cdk-nag

cdk-nag (inspired by cfn_nag) validates that the state of constructs within a given scope comply with a given set of rules. Additionally, cdk-nag provides a rule suppression and compliance reporting system. cdk-nag validates constructs by extending AWS CDK Aspects. If you’re interested in learning more about the AWS CDK Aspect system, then you should check out this post.

cdk-nag includes several rule sets (NagPacks) to validate your application against. As of this post, cdk-nag includes the AWS Solutions, HIPAA Security, NIST 800-53 rev 4, NIST 800-53 rev 5, and PCI DSS 3.2.1 NagPacks. You can pick and choose different NagPacks and apply as many as you wish to a given scope.

cdk-nag rules can either be warnings or errors. Both warnings and errors will be displayed in the console and compliance reports. Only unsuppressed errors will prevent applications from deploying with the cdk deploy command.

You can see which rules are implemented in each of the NagPacks in the Rules Documentation in the GitHub repository.

Walkthrough

This walkthrough will setup a minimal AWS CDK v2 application, as well as demonstrate how to apply a NagPack to the application, how to suppress rules, and how to view a report of the findings. Although cdk-nag has support for Python, TypeScript, Java, and .NET AWS CDK applications, we’ll use TypeScript for this walkthrough.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • A local installation of and experience using the AWS CDK.

Create a baseline AWS CDK application

In this section you will create and synthesize a small AWS CDK v2 application with an Amazon Simple Storage Service (Amazon S3) bucket. If you are unfamiliar with using the AWS CDK, then learn how to install and setup the AWS CDK by looking at their open source GitHub repository.

  1. Run the following commands to create the AWS CDK application:
mkdir CdkTest
cd CdkTest
cdk init app --language typescript
  1. Replace the contents of the lib/cdk_test-stack.ts with the following:
import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    const bucket = new Bucket(this, 'Bucket')
  }
}
  1. Run the following commands to install dependencies and synthesize our sample app:
npm install
npx cdk synth

You should see an AWS CloudFormation template with an S3 bucket both in your terminal and in cdk.out/CdkTestStack.template.json.

Apply a NagPack in your application

In this section, you’ll install cdk-nag, include the AwsSolutions NagPack in your application, and view the results.

  1. Run the following command to install cdk-nag:
npm install cdk-nag
  1. Replace the contents of the bin/cdk_test.ts with the following:
#!/usr/bin/env node
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import { CdkTestStack } from '../lib/cdk_test-stack';
import { AwsSolutionsChecks } from 'cdk-nag'
import { Aspects } from 'aws-cdk-lib';

const app = new cdk.App();
// Add the cdk-nag AwsSolutions Pack with extra verbose logging enabled.
Aspects.of(app).add(new AwsSolutionsChecks({ verbose: true }))
new CdkTestStack(app, 'CdkTestStack', {});
  1. Run the following command to view the output and generate the compliance report:
npx cdk synth

The output should look similar to the following (Note: SSE stands for Server-side encryption):

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S1: The S3 Bucket has server access logs disabled. The bucket should have server access logging enabled to provide detailed records for the requests that are made to the bucket.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S2: The S3 Bucket does not have public access restricted and blocked. The bucket should have public access restricted and blocked to prevent unauthorized access.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S10: The S3 Bucket does not require requests to use SSL. You can use HTTPS (TLS) to help prevent potential attackers from eavesdropping on or manipulating network traffic using person-in-the-middle or similar attacks. You should allow only encrypted connections over HTTPS (TLS) using the aws:SecureTransport condition on Amazon S3 bucket policies.

Found errors

Note that applying the AwsSolutions NagPack to the application rendered several errors in the console (AwsSolutions-S1, AwsSolutions-S2, AwsSolutions-S3, and AwsSolutions-S10). Furthermore, the cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains the errors as well:

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Remediating and suppressing errors

In this section, you’ll remediate the AwsSolutions-S10 error, suppress the  AwsSolutions-S1 error on a Stack level, suppress the  AwsSolutions-S2 error on a Resource level errors, and not remediate the  AwsSolutions-S3 error and view the results.

  1. Replace the contents of the lib/cdk_test-stack.ts with the following:
import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';
import { NagSuppressions } from 'cdk-nag'

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    // The local scope 'this' is the Stack. 
    NagSuppressions.addStackSuppressions(this, [
      {
        id: 'AwsSolutions-S1',
        reason: 'Demonstrate a stack level suppression.'
      },
    ])
    // Remediating AwsSolutions-S10 by enforcing SSL on the bucket.
    const bucket = new Bucket(this, 'Bucket', { enforceSSL: true })
    NagSuppressions.addResourceSuppressions(bucket, [
      {
        id: 'AwsSolutions-S2',
        reason: 'Demonstrate a resource level suppression.'
      },
    ])
  }
}
  1. Run the cdk synth command again:
npx cdk synth

The output should look similar to the following:

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

Found errors

The cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains more details about rule compliance, non-compliance, and suppressions.

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a stack level suppression.","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a resource level suppression.","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Moreover, note that the resultant cdk.out/CdkTestStack.template.json template contains the cdk-nag suppression data. This provides transparency with what rules weren’t applied to an application, as the suppression data is included in the resources.

{
  "Metadata": {
    "cdk_nag": {
      "rules_to_suppress": [
        {
          "id": "AwsSolutions-S1",
          "reason": "Demonstrate a stack level suppression."
        }
      ]
    }
  },
  "Resources": {
    "BucketDEB6E181": {
      "Type": "AWS::S3::Bucket",
      "UpdateReplacePolicy": "Retain",
      "DeletionPolicy": "Retain",
      "Metadata": {
        "aws:cdk:path": "CdkTestStack/Bucket/Resource",
        "cdk_nag": {
          "rules_to_suppress": [
            {
              "id": "AwsSolutions-S2",
              "reason": "Demonstrate a resource level suppression."
            }
          ]
        }
      }
    },
  ...
  },
  ...
}

Reflecting on the Walkthrough

In this section, you learned how to apply a NagPack to your application, remediate/suppress warnings and errors, and review the compliance reports. The reporting and suppression systems provide mechanisms for the development and security teams within organizations to work together to identify and mitigate potential security/compliance issues. Security can choose which NagPacks developers should apply to their applications. Then, developers can use the feedback to quickly remediate issues. Security can use the reports to validate compliances. Furthermore, developers and security can work together to use suppressions to transparently document exceptions to rules that they’ve decided not to follow.

Advanced usage and further reading

This section briefly covers some advanced options for using cdk-nag.

Unit Testing with the AWS CDK Assertions Library

The Annotations submodule of the AWS CDK assertions library lets you check for cdk-nag warnings and errors without AWS credentials by integrating a NagPack into your application unit tests. Read this post for further information about the AWS CDK assertions module. The following is an example of using assertions with a TypeScript AWS CDK application and Jest for unit testing.

import { Annotations, Match } from 'aws-cdk-lib/assertions';
import { App, Aspects, Stack } from 'aws-cdk-lib';
import { AwsSolutionsChecks } from 'cdk-nag';
import { CdkTestStack } from '../lib/cdk_test-stack';

describe('cdk-nag AwsSolutions Pack', () => {
  let stack: Stack;
  let app: App;
  // In this case we can use beforeAll() over beforeEach() since our tests 
  // do not modify the state of the application 
  beforeAll(() => {
    // GIVEN
    app = new App();
    stack = new CdkTestStack(app, 'test');

    // WHEN
    Aspects.of(stack).add(new AwsSolutionsChecks());
  });

  // THEN
  test('No unsuppressed Warnings', () => {
    const warnings = Annotations.fromStack(stack).findWarning(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(warnings).toHaveLength(0);
  });

  test('No unsuppressed Errors', () => {
    const errors = Annotations.fromStack(stack).findError(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(errors).toHaveLength(0);
  });
});

Additionally, many testing frameworks include watch functionality. This is a background process that reruns all of the tests when files in your project have changed for fast feedback. For example, when using the AWS CDK in JavaScript/Typescript, you can use the Jest CLI watch commands. When Jest watch detects a file change, it attempts to run unit tests related to the changed file. This can be used to automatically run cdk-nag-related tests when making changes to your AWS CDK application.

CDK Watch

When developing in non-production environments, consider using AWS CDK Watch with a NagPack for fast feedback. AWS CDK Watch attempts to synthesize and then deploy changes whenever you save changes to your files. Aspects are run during synthesis. Therefore, any NagPacks applied to your application will also run on save. As in the walkthrough, all of the unsuppressed errors will prevent deployments, all of the messages will be output to the console, and all of the compliance reports will be generated. Read this post for further information about AWS CDK Watch.

Conclusion

In this post, you learned how to use cdk-nag in your AWS CDK applications. To learn more about using cdk-nag in your applications, check out the README in the GitHub Repository. If you would like to learn how to create your own rules and NagPacks, then check out the developer documentation. The repository is open source and welcomes community contributions and feedback.

Author:

Arun Donti

Arun Donti is a Senior Software Engineer with Twitch. He loves working on building automated processes and tools that enable builders and organizations to focus on and deliver their mission critical needs. You can find him on GitHub.

Use the AWS Toolkit for Azure DevOps to automate your deployments to AWS

Post Syndicated from Mahmoud Abid original https://aws.amazon.com/blogs/devops/use-the-aws-toolkit-for-azure-devops-to-automate-your-deployments-to-aws/

Many developers today seek to improve productivity by finding better ways to collaborate, enhance code quality and automate repetitive tasks. We hear from some of our customers that they would like to leverage services such as AWS CloudFormation, AWS CodeBuild and other AWS Developer Tools to manage their AWS resources while continuing to use their existing CI/CD pipelines which they are familiar with. These services range from popular open-source solutions, such as Jenkins, to paid commercial solutions, such as Azure DevOps Server (formerly Team Foundation Server (TFS)).

In this post, I will walk you through an example to leverage the AWS Toolkit for Azure DevOps to deploy your Infrastructure as Code templates, i.e. AWS CloudFormation stacks, directly from your existing Azure DevOps build pipelines.

The AWS Toolkit for Azure DevOps is a free-to-use extension for hosted and on-premises Microsoft Azure DevOps that makes it easy to manage and deploy applications using AWS. It integrates with many AWS services, including Amazon S3, AWS CodeDeploy, AWS Lambda, AWS CloudFormation, Amazon SQS and others. It can also run commands using the AWS Tools for Windows PowerShell module as well as the AWS CLI.

Solution Overview

The solution described in this post consists of leveraging the AWS Toolkit for Azure DevOps to manage resources on AWS via Infrastructure as Code templates with AWS CloudFormation:

Solution high-level overview

Figure 1. Solution high-level overview

Prerequisites and Assumptions

You will need to go through three main steps in order to set up your environment, which are summarized here and detailed in the toolkit’s user guide:

  • Install the toolkit into your Azure DevOps account or choose Download to install it on an on-premises server (Figure 2).
  • Create an IAM User and download its keys. Keep the principle of least privilege in mind when associating the policy to your user.
  • Create a Service Connection for your project in Azure DevOps. Service connections are how the Azure DevOps tooling manages connecting and providing access to Azure resources. The AWS Toolkit also provides a user interface to configure the AWS credentials used by the service connection (Figure 3).

In addition to the above steps, you will need a sample AWS CloudFormation template to use for testing the deployment such as this sample template creating an EC2 instance. You can find more samples in the Sample Templates page or get started with authoring your own templates.

AWS Toolkit for Azure DevOps in the Visual Studio Marketplace

Figure 2. AWS Toolkit for Azure DevOps in the Visual Studio Marketplace

A new Service Connection of type “AWS” will appear after installing the extension

Figure 3. A new Service Connection of type “AWS” will appear after installing the extension

Model your CI/CD Pipeline to Automate Your Deployments on AWS

One common DevOps model is to have a CI/CD pipeline that deploys an application stack from one environment to another. This model typically includes a Development (or integration) account first, then Staging and finally a Production environment. Let me show you how to make some changes to the service connection configuration to apply this CI/CD model to an Azure DevOps pipeline.

We will create one service connection per AWS account we want to deploy resources to. Figure 4 illustrates the updated solution to showcase multiple AWS Accounts used within the same Azure DevOps pipeline.

Solution overview with multiple target AWS accounts

Figure 4. Solution overview with multiple target AWS accounts

Each service connection will be configured to use a single, target AWS account. This can be done in two ways:

  1. Create an IAM User for every AWS target account and supply the access key ID and secret access key for that user.
  2. Alternatively, create one central IAM User and have it assume an IAM Role for every AWS deployment target. The AWS Toolkit extension enables you to select an IAM Role to assume. This IAM Role can be in the same AWS account as the IAM User or in a different accounts as depicted in Figure 5.
Use a single IAM User to access all other accounts

Figure 5. Use a single IAM User to access all other accounts

Define Your Pipeline Tasks

Once a service connection for your AWS Account is created, you can now add a task to your pipeline that references the service connection created in the previous step. In the example below, I use the CloudFormation Create/Update Stack task to deploy a CloudFormation stack using a template file named my-aws-cloudformation-template.yml:

- task: CloudFormationCreateOrUpdateStack@1
  displayName: 'Create/Update Stack: Development-Deployment'
  inputs:
    awsCredentials: 'development-account'
    regionName:     'eu-central-1'
    stackName:      'my-stack-name'
    useChangeSet:   true
    changeSetName:  'my-stack-name-change-set'
    templateFile:   'my-aws-cloudformation-template.yml'
    templateParametersFile: 'development/parameters.json'
    captureStackOutputs: asVariables
    captureAsSecuredVars: false

I used the service connection that I’ve called development-account and specified the other required information such as the templateFile path for the AWS CloudFormation template. I also specified the optional templateParametersFile path because I used template parameters in my template.

A template parameters file is particularly useful if you need to use custom values in your CloudFormation templates that are different for each stack. This is a common case when deploying the same application stack to different environments (Development, Staging, and Production).

The task below will to deploy the same template to a Staging environment:

- task: CloudFormationCreateOrUpdateStack@1
  displayName: 'Create/Update Stack: Staging-Deployment'
  inputs:
    awsCredentials: 'staging-account'
    regionName:     'eu-central-1'
    stackName:      'my-stack-name'
    useChangeSet:   true
    changeSetName:  'my-stack-name-changeset'
    templateFile:   'my-aws-cloudformation-template.yml'
    templateParametersFile: 'staging/parameters.json'
    captureStackOutputs: asVariables
    captureAsSecuredVars: false

The differences between Development and Staging deployment tasks are the service connection name and template parameters file path used. Remember that each service connection points to a different AWS account and the corresponding parameter values are specific to the target environment.

Use Azure DevOps Parameters to Switch Between Your AWS Accounts

Azure DevOps lets you define reusable contents via pipeline templates and pass different variable values to them when defining the build tasks. You can leverage this functionality so that you easily replicate your deployment steps to your different environments.

In the pipeline template snippet below, I use three template parameters that are passed as input to my task definition:

# File pipeline-templates/my-application.yml

parameters:
  deploymentEnvironment: ''         # development, staging, production, etc
  awsCredentials:        ''         # service connection name
  region:                ''         # the AWS region

steps:

- task: CloudFormationCreateOrUpdateStack@1
  displayName: 'Create/Update Stack: Staging-Deployment'
  inputs:
    awsCredentials: '${{ parameters.awsCredentials }}'
    regionName:     '${{ parameters.region }}'
    stackName:      'my-stack-name'
    useChangeSet:   true
    changeSetName:  'my-stack-name-changeset'
    templateFile:   'my-aws-cloudformation-template.yml'
    templateParametersFile: '${{ parameters.deploymentEnvironment }}/parameters.json'
    captureStackOutputs: asVariables
    captureAsSecuredVars: false

This template can then be used when defining your pipeline with steps to deploy to the Development and Staging environments. The values passed to the parameters will control the target AWS Account the CloudFormation stack will be deployed to :

# File development/pipeline.yml

container: amazon/aws-cli

trigger:
  branches:
    include:
    - master
    
steps:
- template: ../pipeline-templates/my-application.yml  
  parameters:
    deploymentEnvironment: 'development'
    awsCredentials:        'deployment-development'
    region:                'eu-central-1'
    
- template: ../pipeline-templates/my-application.yml  
  parameters:
    deploymentEnvironment: 'staging'
    awsCredentials:        'deployment-staging'
    region:                'eu-central-1'

Putting it All Together

In the snippet examples below, I defined an Azure DevOps pipeline template that builds a Docker image, pushes it to Amazon ECR (using the ECR Push Task) , creates/updates a stack from an AWS CloudFormation template with a template parameter files, and finally runs a AWS CLI command to list all Load Balancers using the AWS CLI Task.

The template below can be reused across different AWS accounts by simply switching the value of the defined parameters as described in the previous section.

Define a template containing your AWS deployment steps:

# File pipeline-templates/my-application.yml

parameters:
  deploymentEnvironment: ''         # development, staging, production, etc
  awsCredentials:        ''         # service connection name
  region:                ''         # the AWS region

steps:

# Build a Docker image
  - task: Docker@1
    displayName: 'Build docker image'
    inputs:
      dockerfile: 'Dockerfile'
      imageName: 'my-application:${{parameters.deploymentEnvironment}}'

# Push Docker Image to Amazon ECR
  - task: ECRPushImage@1
    displayName: 'Push image to ECR'
    inputs:
      awsCredentials: '${{ parameters.awsCredentials }}'
      regionName:     '${{ parameters.region }}'
      sourceImageName: 'my-application'
      repositoryName: 'my-application'
  
# Deploy AWS CloudFormation Stack
- task: CloudFormationCreateOrUpdateStack@1
  displayName: 'Create/Update Stack: My Application Deployment'
  inputs:
    awsCredentials: '${{ parameters.awsCredentials }}'
    regionName:     '${{ parameters.region }}'
    stackName:      'my-application'
    useChangeSet:   true
    changeSetName:  'my-application-changeset'
    templateFile:   'cfn-templates/my-application-template.yml'
    templateParametersFile: '${{ parameters.deploymentEnvironment }}/my-application-parameters.json'
    captureStackOutputs: asVariables
    captureAsSecuredVars: false
         
# Use AWS CLI to perform commands, e.g. list Load Balancers 
 - task: AWSShellScript@1
    displayName: 'AWS CLI: List Elastic Load Balancers'
    inputs:
    awsCredentials: '${{ parameters.awsCredentials }}'
    regionName:     '${{ parameters.region }}'
    scriptType:     'inline'
    inlineScript:   'aws elbv2 describe-load-balancers'

Define a pipeline file for deploying to the Development account:

# File development/azure-pipelines.yml

container: amazon/aws-cli

variables:
- name:  deploymentEnvironment
  value: 'development'
- name:  awsCredentials
  value: 'deployment-development'
- name:  region
  value: 'eu-central-1'  

trigger:
  branches:
    include:
    - master
    - dev
  paths:
    include:
    - "${{ variables.deploymentEnvironment }}/*"  
    
steps:
- template: ../pipeline-templates/my-application.yml  
  parameters:
    deploymentEnvironment: ${{ variables.deploymentEnvironment }}
    awsCredentials:        ${{ variables.awsCredentials }}
    region:                ${{ variables.region }}

(Optionally) Define a pipeline file for deploying to the Staging and Production accounts

<p># File staging/azure-pipelines.yml</p>
container: amazon/aws-cli

variables:
- name:  deploymentEnvironment
  value: 'staging'
- name:  awsCredentials
  value: 'deployment-staging'
- name:  region
  value: 'eu-central-1'  

trigger:
  branches:
    include:
    - master
  paths:
    include:
    - "${{ variables.deploymentEnvironment }}/*"  
    
    
steps:
- template: ../pipeline-templates/my-application.yml  
  parameters:
    deploymentEnvironment: ${{ variables.deploymentEnvironment }}
    awsCredentials:        ${{ variables.awsCredentials }}
    region:                ${{ variables.region }}
	
# File production/azure-pipelines.yml

container: amazon/aws-cli

variables:
- name:  deploymentEnvironment
  value: 'production'
- name:  awsCredentials
  value: 'deployment-production'
- name:  region
  value: 'eu-central-1'  

trigger:
  branches:
    include:
    - master
  paths:
    include:
    - "${{ variables.deploymentEnvironment }}/*"  
    
    
steps:
- template: ../pipeline-templates/my-application.yml  
  parameters:
    deploymentEnvironment: ${{ variables.deploymentEnvironment }}
    awsCredentials:        ${{ variables.awsCredentials }}
    region:                ${{ variables.region }}

Cleanup

After you have tested and verified your pipeline, you should remove any unused resources by deleting the CloudFormation stacks to avoid unintended account charges. You can delete the stack manually from the AWS Console or use your Azure DevOps pipeline by adding a CloudFormationDeleteStack task:

- task: CloudFormationDeleteStack@1
  displayName: 'Delete Stack: My Application Deployment'
  inputs:
    awsCredentials: '${{ parameters.awsCredentials }}'
    regionName:     '${{ parameters.region }}'
    stackName:      'my-application'       

Conclusion

In this post, I showed you how you can easily leverage the AWS Toolkit for AzureDevOps extension to deploy resources to your AWS account from Azure DevOps and Azure DevOps Server. The story does not end here. This extension integrates directly with others services as well, making it easy to build your pipelines around them:

  • AWSCLI – Interact with the AWSCLI (Windows hosts only)
  • AWS Powershell Module – Interact with AWS through powershell (Windows hosts only)
  • Beanstalk – Deploy ElasticBeanstalk applications
  • CodeDeploy – Deploy with CodeDeploy
  • CloudFormation – Create/Delete/Update CloudFormation stacks
  • ECR – Push an image to an ECR repository
  • Lambda – Deploy from S3, .net core applications, or any other language that builds on Azure DevOps
  • S3 – Upload/Download to/from S3 buckets
  • Secrets Manager – Create and retrieve secrets
  • SQS – Send SQS messages
  • SNS – Send SNS messages
  • Systems manager – Get/set parameters and run commands

The toolkit is an open-source project available in GitHub. We’d love to see your issues, feature requests, code reviews, pull requests, or any positive contribution coming up.

Author:

Mahmoud Abid

Mahmoud Abid is a Senior Customer Delivery Architect at Amazon Web Services. He focuses on designing technical solutions that solve complex business challenges for customers across EMEA. A builder at heart, Mahmoud has been designing large scale applications on AWS since 2011 and, in his spare time, enjoys every DIY opportunity to build something at home or outdoors.

Govern CI/CD best practices via AWS Service Catalog

Post Syndicated from César Prieto Ballester original https://aws.amazon.com/blogs/devops/govern-ci-cd-best-practices-via-aws-service-catalog/

Introduction

AWS Service Catalog enables organizations to create and manage Information Technology (IT) services catalogs that are approved for use on AWS. These IT services can include resources such as virtual machine images, servers, software, and databases to complete multi-tier application architectures. AWS Service Catalog lets you centrally manage deployed IT services and your applications, resources, and metadata , which helps you achieve consistent governance and meet your compliance requirements. In addition,  this configuration enables users to quickly deploy only approved IT services.

In large organizations, as more products are created, Service Catalog management can become exponentially complicated when different teams work on various products. The following solution simplifies Service Catalog products provisioning by considering elements such as shared accounts, roles, or users who can run portfolios or tags in the form of best practices via Continuous Integrations and Continuous Deployment (CI/CD) patterns.

This post demonstrates how Service Catalog Products can be delivered by taking advantage of the main benefits of CI/CD principles along with reducing complexity required to sync services. In this scenario, we have built a CI/CD Pipeline exclusively using AWS Services and the AWS Cloud Development Kit (CDK) Framework to provision the necessary Infrastructure.

Customers need the capability to consume services in a self-service manner, with services built on patterns that follow best practices, including focus areas such as compliance and security. The key tenants for these customers are: the use of infrastructure as code (IaC), and CI/CD. For these reasons, we built a scalable and automated deployment solution covered in this post.Furthermore, this post is also inspired from another post from the AWS community, Building a Continuous Delivery Pipeline for AWS Service Catalog.

Solution Overview

The solution is built using a unified AWS CodeCommit repository with CDK v1 code, which manages and deploys the Service Catalog Product estate. The solution supports the following scenarios: 1) making Products available to accounts and 2) provisioning these Products directly into accounts. The configuration provides flexibility regarding which components must be deployed in accounts as opposed to making a collection of these components available to account owners/users who can in turn build upon and provision them via sharing.

Figure shows the pipeline created comprised of stages

The pipeline created is comprised of the following stages:

  1. Retrieving the code from the repository
  2. Synthesize the CDK code to transform it into a CloudFormation template
  3. Ensure the pipeline is defined correctly
  4. Deploy and/or share the defined Portfolios and Products to a hub account or multiple accounts

Deploying and using the solution

Deploy the pipeline

We have created a Python AWS Cloud Development Kit (AWS CDK) v1 application hosted in a Git Repository. Deploying this application will create the required components described in this post. For a list of the deployment prerequisites, see the project README.

Clone the repository to your local machine. Then, bootstrap and deploy the CDK stack following the next steps.

git clone https://github.com/aws-samples/aws-cdk-service-catalog-pipeline
cd aws-cdk-service-catalog
pip install -r requirements.txt
cdk bootstrap aws://account_id/eu-west-1
cdk deploy

The infrastructure creation takes around 3-5 minutes to complete deploying the AWS CodePipelines and repository creation. Once CDK has deployed the components, you will have a new empty repository where we will define the target Service Catalog estate. To do so, clone the new repository and push our sample code into it:

git clone https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/service-catalog-repo
git checkout -b main
cd service-catalog-repo
cp -aR ../cdk-service-catalog-pipeline/* .
git add .
git commit -am "First commit"
git push origin main

Review and update configuration

Our cdk.json file is used to manage context settings such as shared accounts, permissions, region to deploy, etc.

shared_accounts_ecs: AWS account IDs where the ECS portfolio will be shared
shared_accounts_storage: AWS account IDs where the Storage portfolio will be shared
roles: ARN for the roles who will have permissions to access to the Portfolio
users: ARN for the users who will have permissions to access to the Portfolio
groups: ARN for the groups who will have permissions to access to the Portfolio
hub_account: AWS account ID where the Portfolio will be created
pipeline_account: AWS account ID where the main Infrastructure Pipeline will be created
region: the AWS region to be used for the deployment of the account
"shared_accounts_ecs":["012345678901","012345678902"],
    "shared_accounts_storage":["012345678901","012345678902"],
    "roles":[],
    "users":[],
    "groups":[],
    "hub_account":"012345678901",
    "pipeline_account":"012345678901",
    "region":"eu-west-1"

There are two mechanisms that can be used to create Service Catalog Products in this solution: 1) providing a CloudFormation template or 2) declaring a CDK stack (that will be transformed as part of the pipeline). Our sample contains two Products, each demonstrating one of these options: an Amazon Elastic Container Services (ECS) deployment and an Amazon Simple Storage Service (S3) product.

These Products are automatically shared with accounts specified in the shared_accounts_storage variable. Each product is managed by a CDK Python file in the cdk_service_catalog folder.

Figure shows Pipeline stages that AWS CodePipeline runs through

Figure shows Pipeline stages that AWS CodePipeline runs through

Figure shows Pipeline stages that AWS CodePipeline runs through

The Pipeline stages that AWS CodePipeline runs through are as follows:

  1. Download the AWS CodeCommit code
  2. Synthesize the CDK code to transform it into a CloudFormation template
  3. Auto-modify the Pipeline in case you have made manual changes to it
  4. Display the different Portfolios and Products associated in a Hub account in a Region or in multiple accounts

Adding new Portfolios and Products

To add a new Portfolio to the Pipeline, we recommend creating a new class under cdk_service_catalog similar to cdk_service_catalog_ecs_stack.py from our sample. Once the new class is created with the products you wish to associate, we instantiate the new class inside cdk_pipelines.py, and then add it inside the wave in the stage. There are two ways to create portfolio products. The first one is by creating a CloudFormation template, as can be seen in the Amazon Elastic Container Service (ECS) example.  The second way is by creating a CDK stack that will be transformed into a template, as can be seen in the Storage example.

Product and Portfolio definition:

class ECSCluster(servicecatalog.ProductStack):
    def __init__(self, scope, id):
        super().__init__(scope, id)
        # Parameters for the Product Template
        cluster_name = cdk.CfnParameter(self, "clusterName", type="String", description="The name of the ECS cluster")
        container_insights_enable = cdk.CfnParameter(self, "container_insights", type="String",default="False",allowed_values=["False","True"],description="Enable Container Insights")
        vpc = cdk.CfnParameter(self, "vpc", type="AWS::EC2::VPC::Id", description="VPC")
        ecs.Cluster(self,"ECSCluster_template", enable_fargate_capacity_providers=True,cluster_name=cluster_name.value_as_string,container_insights=bool(container_insights_enable.value_as_string),vpc=vpc)
              cdk.Tags.of(self).add("key", "value")

Clean up

The following will help you clean up all necessary parts of this post: After completing your demo, feel free to delete your stack using the CDK CLI:

cdk destroy --all

Conclusion

In this post, we demonstrated how Service Catalog deployments can be accelerated by building a CI/CD pipeline using self-managed services. The Portfolio & Product estate is defined in its entirety by using Infrastructure-as-Code and automatically deployed based on your configuration. To learn more about AWS CDK Pipelines or AWS Service Catalog, visit the appropriate product documentation.

Authors:

 

César Prieto Ballester

César Prieto Ballester is a Senior DevOps Consultant at AWS. He enjoys automating everything and building infrastructure using code. Apart from work, he plays electric guitar and loves riding his mountain bike.

Daniel Mutale

Daniel Mutale is a Cloud Infrastructure Architect at AWS Professional Services. He enjoys creating cloud based architectures and building out the underlying infrastructure to support the architectures using code. Apart from work, he is an avid animal photographer and has a passion for interior design.

Raphael Sack

Raphael is a technical business development manager for Service Catalog & Control Tower. He enjoys tinkering with automation and code and active member of the management tools community.

Orchestrate big data jobs on on-premises clusters with AWS Step Functions

Post Syndicated from Göksel SARIKAYA original https://aws.amazon.com/blogs/big-data/orchestrate-big-data-jobs-on-on-premises-clusters-with-aws-step-functions/

Customers with specific needs to run big data compute jobs on an on-premises infrastructure often require a scalable orchestration solution. For large-scale distributed compute clusters, the orchestration of jobs must be scalable to maximize their utilization, while at the same time remain resilient to any failures to prevent blocking the ever-growing influx of data and jobs. Moreover, on-premises compute resources can’t be extended on demand, therefore, the jobs may be competing for the same resources with different priorities.

This post showcases serverless building blocks for orchestrating big data jobs using AWS Step Functions, AWS Lambda, and Amazon DynamoDB with a focus on reliability, maintainability, and monitoring. In this solution, Step Functions enables thousands of workflows to run parallel. Additionally, Lambda provides flexibility implementing arbitrary interfaces to the on-premises infrastructure and its compute resources. With additional steps in the orchestration, the solution also allows operations to monitor thousands of parallel jobs in a visual interface for better debugging.

Architecture

The proposed serverless solution consists of the following main components:

  • Job trigger – Requests new compute jobs to run on the on-premises cluster. For simplicity, in this architecture we assume that the trigger is a client calling Step Functions directly. However, you could extend this to include Amazon API Gateway to create a job API to interface with the orchestration solution or a rule engine to trigger jobs when relevant data becomes available.
  • Job manager – This Step Functions workflow runs once per compute job, with multiple workflows running in parallel. It tracks the status of a job from queueing, scheduling, running, retrying, all the way to its completion. Ideally, a job can be scheduled immediately, but workflows can run for days if a job is very low priority and compute resources are sparse. The job manager delegates the decision when or where to run the job to the job queue manager. Communication to the on-premises cluster is abstracted through a Lambda adapter.
  • Job queue manager – Maintains a queue of all jobs. With the given job properties (for example based on priority), the job queue manager decides the running time of jobs, and the cluster on which they run. To illustrate the concept, the architecture considers real-time information on the resource utilization of the compute clusters (memory, CPU) for scheduling. However, you could apply different scheduling algorithms as required given the flexibility of Lambda.
  • On-premises compute cluster – Provides the computing resources, data nodes, and tools to run compute jobs.

The following diagram illustrates the solution architecture.

Solution Architecture

The main process of the solution consists of seven steps:

  1. The job trigger runs a new Step Functions workflow to run a compute job on premises and provides the necessary information (such as priority and required resources).
  2. The job manager creates a new record in DynamoDB to add the job to the queue of the job queue manager, and the workflow waits for the job queue manager to call back.
  3. Amazon EventBridge triggers a scheduled Lambda function in the job queue manager periodically (for example, every 5 minutes), decoupled from the job requests.
  4. The job scheduler Lambda function retrieves real-time information from cluster metrics to see whether jobs can be scheduled at this point in time.
  5. The job scheduler function fetches all queued jobs from DynamoDB and tries to schedule as many of those jobs as possible to available compute resources based on priority, as well as memory and CPU demands.
  6. For each job that can be scheduled to the compute cluster, the job scheduler function communicates back to the job manager to signal the workflow to continue and that the job can be run.
  7. The job manager communicates with the on-premises cluster through the compute cluster adapter Lambda function to run the job, track its status periodically, and retry in case of errors.

On-premises compute cluster

In this post, we assume the on-premises compute cluster offers interfaces to interact with the compute resources. For example, customers could run a Spark compute cluster on premises that allows the following basic interactions through an API:

  • Upload and trigger a compute job on a cluster (for example, upload a Spark JAR file and submit)
  • Get the status of a compute job (such as running, stopped, or error)
  • Get error output in case of failures in the compute job (for example, the job failed due to access denied)

In addition, we assume the cluster can provide metrics on its current utilization. For example, the cluster could provide Prometheus metrics as aggregates over all resources within a compute cluster:

  • Memory utilization (for example, 2 TB with 80% utilization)
  • CPU utilization (for example, 5,000 cores with 50% utilization)

We use the terminology introduced here for the example in this post. Depending on the capabilities of the on-premises cluster, you can adjust these concepts. For example, the compute cluster could use Kubernetes or SLURM instead of Spark.

Job manager

The job manager is responsible for communicating with on-premises clusters to trigger big data jobs and query their status. It’s a Step Functions state machine that consists of three steps, as illustrated in the following figure.

The first step is JobQueueRequest, which makes a request to the job queue manager component and waits for the callback. When the job queue manager sends OK to the waiting step with a callback pattern, the second step StartJobRun runs.

The StartJobRun step communicates with the on-premises environment (for example, via HTTP post to a REST API endpoint) to trigger an on-premises job.

The third step GetJobStatus queries the job status from the on-premises cluster. If the job status is InProgress, the state machine waits for a configured time. When the Wait state is over, it returns to the GetJobStatus step to query the job status again in a loop. When the job returns a successful state from the on-premises cluster, the state machine completes its cycle with a Success state. If the job fails with a timeout or with an error, the state machine completes its cycle with a Fail state.

The following screenshot shows the details of the state machine on the Step Functions console.

Jpob Manager Step Function Inputs

Job queue manager

The job queue manager is responsible for managing job queues based on job priorities and cluster utilization. It consists of DynamoDB, Lambda, and EventBridge.

The JobQueue table keeps data of waiting jobs, including jobId as the primary key, priority as the sort key, needed memory and CPU consumptions, callbackId, and timestamp information. You can add further information to the table dynamically if required by the scheduling algorithm.

The following screenshot shows the attribute details of the JobQueue table.

EventBridge triggers the job scheduler Lambda function on a regular bases in a configured interval. First, the job scheduler function gets waiting jobs data from the JobQueue table in DynamoDB. Then it establishes a connection with the on-premises cluster to fetch cluster metrics such as memory and CPU utilization. Based on this information, the function decides which jobs are ready to be triggered on the on-premises cluster.

The scheduling algorithm proposed here follows a simple concept to maximize resource utilization, while respecting the job priority. Essentially, for an on-premises cluster (we could potentially have multiple in different geographies), the job scheduler Lambda function builds a queue of jobs according to their priority, while allocating the first job in the queue to compute resources on the cluster. If enough resources are available, the scheduler moves to the next job in the queue and repeats.

Due to the flexibility of Lambda functions, you can tailor the scheduling algorithm for a specific use case. Cluster scheduling algorithms are still an open research topic with different optimization goals, such as throughput, data location, fairness, deadlines, and more.

Get started

In this section, we provide a starting point for the solution described in this post. The steps walk you through creating a Step Functions state machine with the appropriate template, and the necessary Lambda and DynamoDB interactions to create the job manager and job queue manager building blocks. Example code for the Lambda functions is excluded from this post, because the communication with the on-premises cluster to trigger jobs can vary depending on your on-premises interface.

  1. On the Step Functions console, choose State machines.
  2. Choose Create state machine.
  3. Select Run a sample project.
  4. Select Job Poller.
    Job Poller State Machine Template
  5. Scroll down to see the sample projects, which are defined using Amazon States Language (ASL).
  6. Review the example definition, then choose Next.Job Manager Step Functions Template
  7. Choose Deploy resources.
    Deployment can take up to 10 minutes.Step Functions Deploy Resources
    The deployment creates the state machine that is responsible for job management. After you deploy the resources, you need to edit the sample ASL code to add the extra JobQueueRequest step in the state machine.
  8. Select the created state machine.
  9. Choose Edit to add ARNs of the three Lambda functions to make a request in the job queue manager (Job Queue Request), to submit a job to the on-premises cluster (Submit Job), and to poll the status of the jobs (Get Job Status).Job Manager Step Functions Definition
    Now you’re ready to create the job queue manager.
  10. On the DynamoDB console, create a table for storing job metadata.
  11. On the EventBridge console, create a scheduled rule that triggers the Lambda function at a configured interval.
  12. On the Lambda console, create the function that communicates with the on-premises cluster to fetch cluster metrics. It also gets jobs from the DynamoDB table to retrieve information including job priorities, required memory, and CPU to run the job on the on-premises cluster.

Limitations

This solution uses Step Functions to track all jobs until completion, and therefore the Step Functions quotas must be considered for potential use cases. Mainly, a workflow can run for a maximum of 1 year (cannot be increased) and by default 1 million parallel runs can run in a single account (can be increased to millions). See Quotas for further details.

Conclusion

This post described how to orchestrate big data jobs running in parallel on on-premises clusters with a Step Functions workflow. To learn more about how to use Step Functions workflows for serverless orchestration, visit Serverless Land.


About the Authors

Göksel Sarikaya is a Senior Cloud Application Architect at AWS Professional Services. He enables customers to design scalable, high-performance, and cost effective applications using the AWS Cloud. He helps them to be more flexible and competitive during their digital transformation journey.

Nicolas Jacob Baer is a Senior Cloud Application Architect with a strong focus on data engineering and machine learning, based in Switzerland. He works closely with enterprise customers to design data platforms and build advanced analytics/ml use-cases.

Shukhrat Khodjaev is a Senior Engagement Manager at AWS ProServe, based out of Berlin. He focuses on delivering engagements in the field of Big Data and AI/ML that enable AWS customers to uncover and to maximize their value through efficient use of data.

Leverage DevOps Guru for RDS to detect anomalies and resolve operational issues

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/devops/leverage-devops-guru-for-rds-to-detect-anomalies-and-resolve-operational-issues/

The Relational Database Management System (RDBMS) is a popular choice among organizations running critical applications that supports online transaction processing (OLTP) use-cases. But managing the RDBMS database comes with its own challenges. AWS has made it easier for organizations to operate these databases in the cloud, thereby addressing the undifferentiated heavy lifting with managed databases (Amazon Aurora, Amazon RDS). Although using managed services has freed up engineering from provisioning hardware, database setup, patching, and backups, they still face the challenges that come with running a highly performant database. As applications scale in size and sophistication, it becomes increasingly challenging for customers to detect and resolve relational database performance bottlenecks and other operational issues quickly.

Amazon RDS Performance Insights is a database performance tuning and monitoring feature, that lets you quickly assess your database load and determine when and where to take action. Performance Insights lets non-experts in database administration diagnose performance problems with an easy-to-understand dashboard that visualizes database load. Furthermore, Performance Insights expands on the existing Amazon RDS monitoring features to illustrate database performance and help analyze any issues that affect it. The Performance Insights dashboard also lets you visualize the database load and filter the load by waits, SQL statements, hosts, or users.

On Dec 1st, 2021, we announced Amazon DevOps Guru for RDS, a new capability for Amazon DevOps Guru. It’s a fully-managed machine learning (ML)-powered service that detects operational and performance related issues for Amazon Aurora engines. It uses the data that it collects from Performance Insights, and then automatically detects and alerts customers of application issues, including database problems. When DevOps Guru detects an issue in an RDS database, it publishes an insight in the DevOps Guru dashboard. The insight contains an anomaly for the resource AWS/RDS. If DevOps Guru for RDS is turned on for your instances, then the anomaly contains a detailed analysis of the problem. DevOps Guru for RDS also recommends that you perform an investigation, or it provides a specific corrective action. For example, the recommendation might be to investigate a specific high-load SQL statement or to scale database resources.

In this post, we’ll deep-dive into some of the common issues that you may encounter while running your workloads against Amazon Aurora MySQL-Compatible Edition databases, with simulated performance issues. We’ll also look at how DevOps Guru for RDS can help identify and resolve these issues. Simulating a performance issue is resource intensive, and it will cost you money to run these tests. If you choose the default options that are provided, and clean up your resources using the following clean-up instructions, then it will cost you approximately $15 to run the first test only. If you wish to run all of the tests, then you can choose “all” in the Tests parameter choice. This will cost you approximately $28 to run all three tests.

Prerequisites

To follow along with this walkthrough, you must have the following prerequisites:

  • An AWS account with a role that has sufficient access to provision the required infrastructure. The account should also not have exceeded its quota for the resources being deployed (VPCs, Amazon Aurora, etc.).
  • Credentials that enable you to interact with your AWS account.
  • If you already have Amazon DevOps Guru turned on, then make sure that it’s tagged properly to detect issues for the resource being deployed.

Solution overview

You will clone the project from GitHub and deploy an AWS CloudFormation template, which will set up the infrastructure required to run the tests. If you choose to use the defaults, then you can run only the first test. If you would like to run all of the tests, then choose the “all” option under Tests parameter.

We simulate some common scenarios that your database might encounter when running enterprise applications. The first test simulates locking issues. The second test simulates the behavior when the AUTOCOMMIT property of the database driver is set to: True. This could result in statement latency. The third test simulates performance issues when an index is missing on a large table.

Solution walk through

Clone the repo and deploy resources

  1. Utilize the following command to clone the GitHub repository that contains the CloudFormation template and the scripts necessary to simulate the database load. Note that by default, we’ve provided the command to run only the first test.
    git clone https://github.com/aws-samples/amazon-devops-guru-rds.git
    cd amazon-devops-guru-rds
    
    aws cloudformation create-stack --stack-name DevOpsGuru-Stack \
        --template-body file://DevOpsGuruMySQL.yaml \
        --capabilities CAPABILITY_IAM \
        --parameters ParameterKey=Tests,ParameterValue=one \
    ParameterKey=EnableDevOpsGuru,ParameterValue=y

    If you wish to run all four of the tests, then flip the ParameterValue of the Tests ParameterKey to “all”.

    If Amazon DevOps Guru is already enabled in your account, then change the ParameterValue of the EnableDevOpsGuru ParameterKey to “n”.

    It may take up to 30 minutes for CloudFormation to provision the necessary resources. Visit the CloudFormation console (make sure to choose the region where you have deployed your resources), and make sure that DevOpsGuru-Stack is in the CREATE_COMPLETE state before proceeding to the next step.

  2. Navigate to AWS Cloud9, then choose Your environments. Next, choose DevOpsGuruMySQLInstance followed by Open IDE. This opens a cloud-based IDE environment where you will be running your tests. Note that in this setup, AWS Cloud9 inherits the credentials that you used to deploy the CloudFormation template.
  3. Open a new terminal window which you will be using to clone the repository where the scripts are located.

  1. Clone the repo into your Cloud9 environment, then navigate to the directory where the scripts are located, and run initial setup.
git clone https://github.com/aws-samples/amazon-devops-guru-rds.git
cd amazon-devops-guru-rds/scripts
sh setup.sh 
# NOTE: If you are running all test cases, use sh setup.sh all command instead. 
source ~/.bashrc
  1. Initialize databases for all of the test cases, and add random data into them. The script to insert random data takes approximately five hours to complete. Your AWS Cloud9 instance is set up to run for up to 24 hours before shutting down. You can exit the browser and return between 5–24 hours to validate that the script ran successfully, then continue to the next step.
source ./connect.sh test 1
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
exit;
python3 ct.py

If you chose to run all test cases, and you ran the sh setup.sh all command in Step 4, open two new terminal windows and run the following commands to insert random data for test cases 2 and 3.

# Test case 2 – Open a new terminal window to run the commands
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 2
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
exit;
python3 ct.py
# Test case 3 - Open a new terminal window to run the commands
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 3
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
exit;
python3 ct.py
  1. Return between 5-24 hours to run the next set of commands.
  1. Add an index to the first database.
source ./connect.sh test 1
CREATE UNIQUE INDEX test1_pk ON test1(id);
INSERT INTO test1 VALUES (-1, 'locker', current_timestamp);
exit;
  1. If you chose to run all test cases, and you ran the sh setup.sh all command in Step 4, add an index to the second database. NOTE: Do no add an index to the third database.
source ./connect.sh test 2
CREATE UNIQUE INDEX test1_pk ON test1(id);
INSERT INTO test1 VALUES (-1, 'locker', current_timestamp);
exit;

DevOps Guru for RDS uses Performance Insights, and it establishes a baseline for the database metrics. Baselining involves analyzing the database performance metrics over a period of time to establish a “normal” behavior. DevOps Guru for RDS then uses ML to detect anomalies against the established baseline. If your workload pattern changes, then DevOps Guru for RDS establishes a new baseline that it uses to detect anomalies against the new “normal”. For new database instances, DevOps Guru for RDS takes up to two days to establish an initial baseline, as it requires an analysis of the database usage patterns and establishing what is considered a normal behavior.

  1. Allow two days before you start running the following tests.

Scenario 1: Locking Issues

In this scenario, multiple sessions compete for the same (“locked”) record, and they must wait for each other.
In real life, this often happens when:

  • A database session gets disconnected due to a (i.e., temporary network) malfunction, while still holding a critical lock.
  • Other sessions become stuck while waiting for the lock to be released.
  • The problem is often exacerbated by the application connection manager that keeps spawning additional sessions (because the existing sessions don’t complete the work on time), thus creating a distinct “inclined slope” pattern that you’ll see in this scenario.

Here’s how you can reproduce it:

  1. Connect to the database.
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 1
  1. In your MySQL, enter the following SQL, and don’t exit the shell.
START TRANSACTION;
UPDATE test1 SET timer=current_timestamp WHERE id=-1;
-- Do NOT exit!
  1. Open a new terminal, and run the command to simulate competing transactions. Give it approximately five minutes before you run the commands in this step.
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 1
exit;
python3 locking_scenario.py 1 1200 2
  1. After the program completes its execution, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. You’ll notice a summary of the insight under Description.

Shows navigation to Amazon DevOps Guru Insights and RDS DB Load Anomalous screen to find the summary description of the anomaly.

  1. Choose the View Recommendations link on the top right, and observe the databases for which it’s showing the recommendations.
  2. Next, choose View detailed analysis for database performance anomaly for the following resources.
  3. Under To view a detailed analysis, choose a resource name, choose the database associated with the first test.

 Shows the detailed analysis of the database performance anomaly. The database experiencing load is chosen, and a graphical representation of how the Average active sessions (AAS) spikes, which Amazon DevOps Guru is able to identify.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

In this example, DevOps Guru for RDS has detected a high and unusual spike of database load, and then marked it as “performance anomaly”.

Note that the relative size of the anomaly is significant: 490 times higher than the “typical” database load, which is why it’s deemed: “HIGH severity”.

In the analysis section, note that a single “wait event”, wait/synch/mutex/innodb/aurora_lock_thread_slot_futex, is dominating the entire spike. Moreover, a single SQL is “responsible” (or more precisely: “suffering”) from this wait event at the time of the problem. Select the wait event name and see a simple explanation of what’s happening in the database. For example, it’s “record locking”, where multiple sessions are competing for the same database records. Additionally, you can select the SQL hash and see the exact text of the SQL that’s responsible for the issue.

If you’re interested in why DevOps Guru for RDS detected this problem, and why these particular wait events and an SQL were selected, the Why is this a problem? and Why do we recommend this? links will provide the answer.

Finally, the most relevant part of this analysis is a View troubleshooting doc link. It references a document that contains a detailed explanation of the likely causes for this problem, as well as the actions that you can take to troubleshoot and address it.

Scenario 2: Autocommit: ON

In this scenario, we must run multiple batch updates, and we’re using a fairly popular driver setting: AUTOCOMMIT: ON.

This setting can sometimes lead to performance issues as it causes each UPDATE statement in a batch to be “encased” in its own “transaction”. This leads to data changes being frequently synchronized to disk, thus dramatically increasing batch latency.

Here’s how you can reproduce the scenario:

  1. On your Cloud9 terminal, run the following commands:
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 2
exit;
python3 batch_autocommit.py 50 1200 1000 10000000
  1. Once the program completes its execution, or after an hour, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. Then choose Recommendations and choose View detailed analysis for database performance anomaly for the following resources. Under To view a detailed analysis, choose a resource name, choose the database associated with the second test.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

Note that DevOps Guru for RDS detected a significant (and unusual) spike of database load and marked it as a HIGH severity anomaly.

The spike looks similar to the previous example (albeit, “smaller”), but it describes a different database problem (“COMMIT slowdowns”). This is because of a different database wait event that dominates the spike: wait/io/aurora_redo_log_flush.

As in the previous example, you can select the wait event name to see a simple description of what’s going on, and you can select the SQL hash to see the actual statement that is slow. Furthermore, just as before, the View troubleshooting doc link references the document that describes what you can do to troubleshoot the problem further and address it.

Scenario 3: Missing index

Have you ever wondered what would happen if you drop a frequently accessed index on a large table?

In this relatively simple scenario, we’re testing exactly that – an index gets dropped causing queries to switch from fast index lookups to slow full table scans, thus dramatically increasing latency and resource use.

Here’s how you can reproduce this problem and see it for yourself:

  1. On your Cloud9 terminal, run the following commands:
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 3
exit;
python3 no_index.py 50 1200 1000 10000000
  1. Once the program completes its execution, or after an hour, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. Then choose Recommendations and choose View detailed analysis for database performance anomaly for the following resources. Under To view a detailed analysis, choose a resource name, choose the database associated with the third test.

Shows the detailed analysis of the database performance anomaly. The database experiencing load is chosen and a graphical representation of how the Average active sessions (AAS) spikes which Amazon DevOps Guru is able to identify.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

As with the previous examples, DevOps Guru for RDS detected a high and unusual spike of database load (in this case, ~ 50 times larger than the “typical” database load). It also identified that a single wait event, wait/io/table/sql/handler, and a single SQL, are responsible for this issue.

The analysis highlights the SQL that you must pay attention to, and it links a detailed troubleshooting document that lists the likely causes and recommended actions for the problems that you see. While it doesn’t tell you that the “missing index” is the real root cause of the issue (this is planned in future versions), it does offer many relevant details that can help you come to that conclusion yourself.

Cleanup

On your terminal where you originally ran the AWS Command Line Interface (AWS CLI) command to create the CloudFormation resources, run the following command:

aws cloudformation delete-stack --stack-name DevOpsGuru-Stack

Conclusion

In this post, you learned how to leverage DevOps Guru for RDS to alert you of any operational issues with recommendations. You simulated some of the commonly encountered, real-world production issues, such as locking contentions, AUTOCOMMIT, and missing indexes. Moreover, you saw how DevOps Guru for RDS helped you detect and resolve these issues. Try this out, and let us know how DevOps Guru for RDS was able to address your use-case.

Authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Simsek Mert

Simsek Mert is a Cloud Application Architect with AWS Professional Services.
Simsek helps customers with their application architecture, containers, serverless applications, leveraging his over 20 years of experience.

Maxym Kharchenko

Maxym Kharchenko is a Principal Database Engineer at AWS. He builds automated monitoring tools that use machine learning to discover and explain performance problems in relational databases.

Jared Keating

Jared Keating is a Senior Cloud Consultant with Amazon Web Services Professional Services. Jared assists customers with their cloud infrastructure, compliance, and automation requirements drawing from his over 20 years of experience in IT.

How to let builders create IAM resources while improving security and agility for your organization

Post Syndicated from Jeb Benson original https://aws.amazon.com/blogs/security/how-to-let-builders-create-iam-resources-while-improving-security-and-agility-for-your-organization/

Many organizations restrict permissions to create and manage AWS Identity and Access Management (IAM) resources to a group of privileged users or a central team. This post explains how you can safely grant these permissions to builders – the people who are developing, testing, launching, and managing cloud infrastructure – to speed up your development, increase your agility, and improve your application security. In addition, you will use an example application stack to see how IAM permissions boundaries can help establish a secure, yet agile work environment for builders.

An example application stack

Defining and creating IAM resources within the application stack allows your builders to craft policies and roles that grant least privilege to application resources. When builders are entitled to create IAM resources, it is straightforward for them to scope policies by referencing the application resources directly in the IAM policies in the same template.

To illustrate this point you will build a simple “hello world” serverless application. The application includes an AWS Step Functions state machine that, once executed, will invoke an AWS Lambda function. You will use this example application along with some IAM policies and IAM roles to illustrate how you can use permissions boundaries to safely grant IAM privileges to builders.

In this example AWS CloudFormation template, the Resource element in MyStateMachineExecutionRole, which is specified as the role for MyStateMachine, includes a reference to the Amazon Resource Name (ARN) of MyLambdaFunction. This is a great example of the principle of least privilege as MyStateMachine will only have permissions to invoke MyLambdaFunction. Making this association is straightforward because the IAM, Step Functions, and Lambda resources are defined together in the same template.

Example application template

AWSTemplateFormatVersion: 2010-09-09
Description: builder-application

Resources:

MyLambdaFunctionExecutionRole:
	Type: AWS::IAM::Role
    Properties:
		AssumeRolePolicyDocument:
			Version: 2012-10-17
			Statement:
				- Effect: Allow
				Principal:
				Service:
					- lambda.amazonaws.com
				Action: sts:AssumeRole
		Policies:
			- PolicyName: MyLambdaFunctionBasicExecutionPolicy
			PolicyDocument:
				Version: 2012-10-17
				Statement:
					- Effect: Allow
					Action:
						- logs:CreateLogGroup
						- logs:CreateLogStream
						- logs:PutLogEvents
					Resource: !Sub arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:*


MyLambdaFunction:
  	Type: AWS::Lambda::Function
	Properties:
		Runtime: python3.8
		Role: !GetAtt MyLambdaFunctionExecutionRole.Arn
		Handler: index.handler
		Code:
			ZipFile: |
				def handler(event, context):
					return("Hello Builder!")
		Timeout: 30

MyStateMachineExecutionRole:
	Type: AWS::IAM::Role
	Properties:
		AssumeRolePolicyDocument:
			Version: 2012-10-17
			Statement:
				- Effect: Allow
				Principal:
				Service:
					- states.amazonaws.com
				Action: sts:AssumeRole
		Policies:
			- PolicyName: StateMachineExecutionPolicy
			PolicyDocument:
				Version: 2012-10-17
				Statement:
					- Effect: Allow
					Action:
						- lambda:InvokeFunction
					Resource: !GetAtt MyLambdaFunction.Arn
      
MyStateMachine:
	Type: AWS::StepFunctions::StateMachine
	Properties:
		DefinitionString: !Sub |
          {
            "StartAt": "State1",
            "States": {
              "State1": {
                "Type": "Task",
                "Resource": "${MyLambdaFunction.Arn}",
                "End": true
              }
            }
          }
		RoleArn: !GetAtt MyStateMachineExecutionRole.Arn

MyLambdaFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
	  Action: lambda:InvokeFunction
	  FunctionName: !Ref MyLambdaFunction
	  Principal: states.amazonaws.com
	  SourceArn: !Ref MyStateMachine

In an organization that uses a centralized approach to IAM management, a builder would not be able to deploy this example application because the roles the builders are granted prohibit IAM actions related to creating and managing roles and policies. This creates three key challenges for the organization:

  1. Builders often rely on a security or cloud team to create IAM resources. This approach adds an additional burden to that team which slows down development while builders wait for the roles and policies to be created, and encourages the team to grant overly-broad permissions so that they don’t have to be involved in the precise details of changes on a daily basis.
  2. IAM resources must be created before the rest of the stack, which makes it much more difficult to create least-privilege policies and roles.
  3. The code for IAM and application infrastructure is maintained separately, which requires extra coordination when creating and updating workloads.

Removing these challenges can simplify your cloud development, lower the amount of overhead and coordination required across your organization, and improve your security posture. Further, reducing the burden on your security team can free up time for them to focus on enforcing additional data perimeters and implementing best practices from the security pillar of the AWS Well-Architected Framework.

Use permissions boundaries on application roles created by builders

So how can your organization shift to a decentralized approach to IAM management and allow builders to safely create application IAM policies and roles, as well as prevent builders from being able to escalate their own privileges? This is a scenario where permissions boundaries should be used.

Permissions boundaries allow an IAM policy to be attached to an IAM role to enforce limits on the permissions that role can be granted. The permissions boundary itself does not grant any permissions – it’s just a guardrail that defines the maximum entitlements. You can create a builder policy that requires a specified permissions boundary be attached to any application roles that a builder creates, effectively setting the maximum permissions on any role that a builder can generate. The permissions for any application roles created will be the intersection of the application policies and the permissions boundary associated with the application role. Said another way, a permission granted in the application policy must also be granted in the permissions boundary, or else it will be implicitly denied according to the policy evaluation logic.

The set of permissions required for builder roles (assumed by actual people or CI/CD pipelines) to deploy infrastructure and develop applications are different than those required for application roles (assumed by workloads/applications) to execute within applications and workflows. If you think of permissions in terms of control plane actions involved in creating, deleting, and modifying resources, versus data plane actions needed to execute the daily business of those resources, builder and application roles typically operate more on the control plane and data plane, respectively. While permissions policies attached to application roles should be tightly scoped to include only those actions needed to perform a specific task, policies for permissions boundaries should be highly reusable and include a broad set of permissions across a suite of services that a variety of applications might need. In addition, it is a best practice to not share roles between humans and services. To summarize, policies for builder and application roles should be designed with the following criteria in mind:

Policy Role Permissions Scope Reusability User
builder builder control plane broad high human
application application data plane specific low service
application-boundary application data plane broad high application role

Following the steps in this section, you will create:

  1. A builder policy that grants builders permissions for services needed to deploy and manage applications, and to create and manage IAM policies and roles for those applications.
  2. An application-boundary policy that defines the extent of the permissions any application role created by a builder can have.
  3. A builder role with the builder policy attached as the permissions policy, the application-boundary policy attached as a permissions boundary, and a trust policy that allows anyone in the account to assume the builder role.

Create the builder IAM resources

In this first procedure, you will use CloudFormation to create the builder and application-boundary IAM policies, and the builder IAM role.

To launch the builder IAM stack

  1. Open a Linux terminal with the AWS Command Line Interface (CLI) installed and AWS credentials configured for a user or role that has permissions to create IAM resources.
  2. Launch the builder IAM stack by using the following command:
    aws cloudformation create-stack \
    --stack-name builder-iam \
    --template-url https://awsiammedia.s3.amazonaws.com/public/sample/993-grant-IAM-permissions-for-builders+/builder-iam.json \
    --capabilities CAPABILITY_NAMED_IAM

The builder policy you created uses paths to organize IAM policies and roles into isolated “spaces” that can be specified as resource constraints. It also requires roles to have a permissions boundary attached. This approach helps manage role delegation and prevents builders from escalating their privileges or modifying roles that may be used by other teams.

Note that paths can only be added to IAM resources using the AWS CLI or APIs – not via the console. If this is an issue, another option is to specify that policies and roles start with a specific phrase, for example “application-roles-*”. Just be sure to use different phrases for the builder, application, and permissions boundaries resources to maintain isolation and prevent builders from being able to escalate their privileges.

The builder policy also includes some basic control plane permissions, while the application-boundary policy you created includes permissions used across a suite of services that a typical serverless application might need. However, both policies are only meant to demonstrate the concepts in this blog post. In practice, you will need to create policies that more accurately reflect the permissions needed by builder and application roles in your organization. See the example-permissions-boundaries repository on the AWS Samples GitHub site for more ideas.

Use the builder role to launch an application stack

In this section you will assume the builder role and verify that you can launch the application stack, including the application roles, but only if the required permissions boundary and path are specified.

To test launching a stack that creates application roles

  1. Assume the builder role by using the following set of commands (copy and paste into the terminal as a single block). This step uses the jq program, which is available in most operating system package repositories.
    export AWS_DEFAULT_OUTPUT="json"; \
    role=builder; \
    aws_role=$(aws iam get-role --role-name $role); \
    role_arn=$(echo $aws_role|jq '.Role.Arn'|tr -d '"'); \
    aws_credentials=$(aws sts assume-role --role-arn $role_arn --role-session-name builder-test); \
    export AWS_ACCESS_KEY_ID=$(echo $aws_credentials|jq '.Credentials.AccessKeyId'|tr -d '"'); \
    export AWS_SECRET_ACCESS_KEY=$(echo $aws_credentials|jq '.Credentials.SecretAccessKey'|tr -d '"'); \
    export AWS_SESSION_TOKEN=$(echo $aws_credentials|jq '.Credentials.SessionToken'|tr -d '"')

    You will use this role for all of the following procedures up until the “Clean up” section. If, during the course of this exercise, you get an error that “The security token included in the request is expired”, create a new terminal and repeat this step to get a fresh set of credentials.

    The trust policy for the builder role allows any principal in the account to assume it. You can use a combination of the Principal and Condition attributes to further reduce its scope. Normally builder roles are assumed directly via federation or AWS SSO.

  2. Check that the you have now assumed the builder role by using the following command:
    aws sts get-caller-identity

  3. Launch the example application stack by using the following command:
    aws cloudformation create-stack \
    --stack-name builder-application \
    --template-url https://awsiammedia.s3.amazonaws.com/public/sample/993-grant-IAM-permissions-for-builders+/builder-application-1.yml \
    --capabilities CAPABILITY_NAMED_IAM

  4. If things are set up correctly, the stack will fail. To confirm, check the StackStatus by using the following command – it should show ROLLBACK_IN_PROGRESS or ROLLBACK_COMPLETE.
    aws cloudformation describe-stacks \
    --stack-name builder-application | jq '.Stacks[].StackStatus'

  5. The stack failed to create because the builder role you assumed does not have permissions to create the two application roles without specifying the /application_roles/ path and attaching a permissions boundary with /permissions_boundaries/ in the path. To see the details, use the following command:
    aws cloudformation describe-stack-events \
    --stack-name builder-application | jq '.StackEvents[] | select(.ResourceStatus=="CREATE_FAILED") | .ResourceStatusReason'

  6. If the original stack did not create successfully, you will not be able to update it, so you will need to delete it instead by using the following command:
    aws cloudformation delete-stack \
    --stack-name builder-application

  7. Launch a new stack with an updated template that includes the required path and permissions boundary by using the following command:
    aws cloudformation create-stack \
    --stack-name builder-application \
    --template-url https://awsiammedia.s3.amazonaws.com/public/sample/993-grant-IAM-permissions-for-builders+/builder-application-2.yml \
    --capabilities CAPABILITY_NAMED_IAM

  8. After a few minutes, confirm the stack was successfully created by using the following command and verifying StackStatus is CREATE_COMPLETE:
    aws cloudformation describe-stacks \
    --stack-name builder-application | jq '.Stacks[].StackStatus'

To verify the application is working

  1. Start an execution of the state machine and verify the application is working by using the following commands. Run them one at a time to allow the execution to finish.
    state_machine_arn=$(aws cloudformation describe-stack-resources --stack-name builder-application | jq '.StackResources[] | select (.LogicalResourceId=="MyStateMachine") | .PhysicalResourceId' | tr -d '"')
    
    execution_arn=$(aws stepfunctions start-execution --state-machine-arn $state_machine_arn | jq '.executionArn' | tr -d '"')
    
    aws stepfunctions describe-execution --execution-arn $execution_arn | jq '.output'

    If the output is “\”Hello Builder!\””, then the application is working.

Test that a builder can’t escalate their privileges

In this section, you will test scenarios where a builder attempts, intentionally or not, to escalate their privileges by first modifying the policies attached to the builder role and then extending the permissions of an application role beyond the permissions boundary.

To test updating a builder policy attached to the builder role

  1. Create an environment variable for your AWS account number, which will be used in several of the steps below, using the following command:
    export AWS_ACCOUNT=$(aws sts get-caller-identity | jq '.Account' | tr -d '"')

  2. Retrieve the builder policy and save it to a file by using the following command:
    aws iam get-policy-version \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT:policy/builder_policies/builder \
    --version-id v1 | jq '.PolicyVersion.Document' > builder-policy.json

  3. Add the following JSON block to the “Statement” array in the builder-policy.json file and save the changes.
    {
    	"Sid": "SneakyBuilder",
    	"Effect": "Allow",
    	"Action": "*",
    	"Resource": "*"
    }

  4. Try to update the builder policy and set it as the default version by using the following command:
    aws iam create-policy-version \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT:policy/builder_policies/builder \
    --policy-document file://builder-policy.json \
    --set-as-default

    You should get an AccessDenied error because the builder role can only modify policies in the /application_policies/ space.

To test adding an application policy to the builder role

  1. Save a copy of builder-policy.json as a new file called sneaky-policy.json using the following command:
    cp builder-policy.json sneaky-policy.json

  2. Create the new policy using the following command:
    aws iam create-policy \
    --policy-name sneaky-policy \
    --path /application_policies/ \
    --policy-document file://sneaky-policy.json

    You should not get an error in this step because you are creating a new policy that complies with the resource constraint for the statement that includes the iam:CreatePolicy permission in the builder policy. But, it’s just a policy for now – it can’t have any effect unless attached to a role.

  3. Now try to attach this new policy to the builder role by using the following command:
    aws iam attach-role-policy \
    --role-name builder \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT:policy/application_policies/sneaky-policy

    You should get an error because you’re attempting to attach a policy to a role that’s not in the /application-roles/ space, in this case the builder role.

To test modifying an application role with actions outside the permissions boundary

In this procedure, you will attempt to escalate the privileges of the MyLambdaFunctionExecutionRole by adding an action (s3:CreateBucket) that is outside of the permissions boundary attached to the role and then attempting to execute that action when MyLambdaFunction is invoked.

  1. Update the builder-application stack by using the following command:
    aws cloudformation update-stack \
    --stack-name builder-application \
    --template-url https://awsiammedia.s3.amazonaws.com/public/sample/993-grant-IAM-permissions-for-builders+/builder-application-3.yml \
    --capabilities CAPABILITY_NAMED_IAM

  2. After a few minutes, confirm the stack was successfully updated by using the following command and verifying StackStatus is UPDATE_COMPLETE:
    aws cloudformation describe-stacks \
    --stack-name builder-application | jq '.Stacks[].StackStatus'

  3. Start an execution of the state machine and verify the result by using the following commands. Run them one at a time to allow the execution to finish.
    state_machine_arn=$(aws cloudformation describe-stack-resources --stack-name builder-application | jq '.StackResources[] | select (.LogicalResourceId=="MyStateMachine") | .PhysicalResourceId' | tr -d '"')
    
    execution_arn=$(aws stepfunctions start-execution --state-machine-arn $state_machine_arn | jq '.executionArn' | tr -d '"')
    
    aws stepfunctions describe-execution --execution-arn $execution_arn | jq '.status'

    The status should be FAILED. Remember – the effective permissions for any application roles will be the intersection of the attached permissions policies and the permissions boundary. Thus, this execution failed because even though you were able to modify the inline policy of the Lambda function to add s3:CreateBucket, since that action is not allowed in the application-boundary policy attached to the Lambda as a permissions boundary, the request to create an S3 bucket was denied.

  4. Get the name of the latest log stream by using the following commands:
    lambda_function_name=$(aws cloudformation describe-stack-resources --stack-name builder-application | jq '.StackResources[] | select (.LogicalResourceId=="MyLambdaFunction") | .PhysicalResourceId' | tr -d '"')
    
    aws logs describe-log-streams \
    --log-group-name /aws/lambda/$lambda_function_name \
    --order-by LastEventTime \
    --descending | jq -r '.logStreams[0].logStreamName'

  5. To verify the actual error was due to a lack of permissions, get the event message by using the following command, replacing <value> with the value of the log stream copied in the step above. If using a bash terminal, you will need to escape any dollar signs in <value> with a backslash character:
    aws logs get-log-events \
    --log-group-name /aws/lambda/$lambda_function_name \
    --log-stream-name <value> | jq '.events[] | select (.message | contains("[ERROR]"))'

    The error should read [ERROR] ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied, which confirms that the permissions boundary prevented you from escalating your privileges as a builder via an application role.

You have now verified that you can safely allow builders to create IAM policies and roles!

Clean up

After you have finished testing, clean up the resources created in this example. Because the builder role does not have permissions to delete builder policies and roles, you will need to assume a different role that can manage IAM resources to complete step 3 below. If you create a new terminal session, make sure the AWS_ACCOUNT environment variable is set.

To clean up

  1. Delete the builder-application stack by using the following command:
    aws cloudformation delete-stack --stack-name builder-application

  2. Delete the sneaky-policy by using the following command:
    aws iam delete-policy \
    --policy-arn arn:aws:iam::$AWS_ACCOUNT:policy/application_policies/sneaky-policy

  3. Delete the builder-iam stack by using the following command:
    aws cloudformation delete-stack --stack-name builder-iam

Service control policies

Permissions boundaries are applied to individual IAM users or roles within an account. If your organization has multiple accounts, you must create and maintain these boundaries in each account for each individual user or role. But what if you’d like to apply a subset of these rules or others across some or all of your accounts? In this case, you could use service control policies (SCPs), which are a feature of AWS Organizations, to provide central control over the maximum available permissions for multiple accounts in your organization. By organizing accounts into organizational units (OUs), which are groups of accounts that serve an application or service, you can apply service control policies (SCPs) to create targeted governance boundaries for your OUs. To learn more about creating SCPs, see Get more out of service control policies in a multi-account environment in the AWS Security Blog.

Additional tools

Creating and managing tightly scoped policies and roles is an ongoing process that requires a lot of thought and attention to detail. AWS IAM enables fine-grained access control to AWS services, and permissions boundaries are an advanced feature. There is no substitute for actual testing like you performed in the “Test that a builder can’t escalate their privileges” section above, however you can also use the IAM Policy Simulator as a tool to test policies and determine whether or not specific actions are allowed for a given user, group, or role. Additional tools you can use to create, audit, and update IAM policies include:

  • Access Advisor – to review when services and actions were last accessed.
  • Access Analyzer – to help identify resources in your organization and accounts that are shared with an external identity, validate IAM policies against policy grammar and best practices, and generate IAM policies based on access activity in your AWS CloudTrail logs.
  • AWS Cloud Development Kit (CDK) – has built-in convenience methods to help you follow best practices, including the ability to generate least-privilege policies for cloud applications with a single line of code.
  • Open Source tools like cfn-nag and cdk-nag – to inspect CloudFormation templates and CDK applications for patterns that may indicate insecure infrastructure, for example IAM policies that are too permissive.

Conclusion

In this post, you learned how to put policies and guardrails in place that will allow your organization to grant IAM permissions to builders. These changes will enable your builders to develop and deploy cloud infrastructure and applications more rapidly, and will help strengthen your organization’s security culture by extending the responsibility to a broader group. To learn more about creating, testing, and refining IAM policies and permissions boundaries, see Creating IAM policies, Testing IAM policies, and Refining permissions using access information in the IAM User Guide, and IAM policy types: How and when to use them in the AWS Security Blog.

Amazon EC2 DL1 instances Deep Dive

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/amazon-ec2-dl1-instances-deep-dive/

This post is written by Amr Ragab, Principal Solutions Architect, Amazon EC2.

AWS is excited to announce that the new Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances are now generally available in US-East (N. Virginia) and US-West (Oregon). DL1 provides up to 40% better price performance for training deep learning models as compared to current generation GPU-based EC2 instances. The dl1.24xlarge instance type features eight Intel-Habana Gaudi accelerators, which are custom-built to train deep learning models. Each Gaudi accelerator has 32 GB of high bandwidth memory (HBM2) and a peer-to-peer bidirectional bandwidth of 100 Gbps RoCE, for a total bidirectional interconnect bandwidth of 700 Gbps per card. Further instance specifications are as follows:

Instance Size vCPU Instance Memory (GiB) Gaudi Accelerators Network Bandwidth (Gbps) Total Accelerator Interconnect (Gbs) Local Instance Storage EBS Bandwidth (Gbps)
d1.24xlarge 96 768 8 4×100 Gbps 700 4x1TB NVMe 19

Instance Architecture

System architecture of the amazon ec2 dl1 instances.

As the preceding instance architecture indicates, pairs of Gaudi accelerators (e.g., Gaudi0 and Gaudi1) are attached directly through a PCIe Gen3x16 link. Additionally, peer-to-peer networking via 100 Gbps RoCEv2 links – with seven active links per card – provides a torus configuration with a total of 700 Gbps of interconnect bandwidth per card. This topology is a separate interconnect outside of the two NUMA domains. Furthermore, the instance supports four EFA ENIs and 4x1TB of local NVMe SSD storage. We will provide a peer-direct driver over EFA, which will let you utilize high throughput, low latency peer-direct networking between accelerators across multiple instances to efficiently scale multi-node distributed training workloads.

Quick Start

Quickly get started with DL1 and SynapseAI SDK through with the following options:

1) Habana Deep Learning AMIs provided by AWS.

2) AWS Marketplace AMIs provided by Habana.

3) Using Packer to build a custom Amazon Machine Images (AMI) provided by this GitHub repo. This repo also provides build scripts to create Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS) AMIs.

After selecting an AMI, launch a dl1.24xlarge instance in either us-east-1 or us-west-2. To help identify in which availability zone(s) dl1.24xlarge is available, run the following command:

aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters Name=instance-type,Values=dl1.24xlarge \
--region us-west-2 \
--output table

Once launched, you can connect to the instance over SSH (with the correct security group attached).

Habana Collectives Communication Library (HCL/HCCL)

As part of the Habana SynapseAI SDK, Habana Gaudi’s use the HCCL library for handling the collectives between HPUs. Get more information on HCCL here. On DL1 through the HCL-tests, we can confirm close to 700 Gbps (689 Gbps) per card for the collectives tested as follows.

You can confirm these tests by cloning the github repo here.

Habana DL1 HCCL tests.

Amazon EKS Quick Start

Support for DL1 on Amazon EKS is available today with Amazon EKS versions > 1.19. The following is a quick start to get up and running quickly with DL1.

The following dependencies will be needed:

eksctl – You need version 0.70.0+ of eksctl.
kubectl – You use Kubernetes version 1.20 in this post.

Create EKS cluster:

eksctl create cluster --region us-east-1 --without-nodegroup \
--vpc-public-subnets subnet-037d8e430963c2d3e,subnet-0abe898359a7d43e9

Nodegroup configuration – save the following codeblock to a file called dl1-managed-ng.yaml. Replace the AMI ID in the code block with the AMI created earlier.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: fabulous-rainbow-1635807811
  region: us-west-2

vpc:
  id: vpc-34f1894c
  subnets:
    public:
      endpoint-one:
        id: subnet-4532e73d
      endpoint-two:
        id: subnet-8f8b7dc5

managedNodeGroups:
  - name: dl1-ng-1d
    instanceType: dl1.24xlarge
    volumeSize: 200
    instancePrefix: dl1-ng-1d-worker
    ami: ami-072c632cbbc2255b3
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        cloudWatch: true
    ssh:
      allow: true
      publicKeyName: amrragab-aws
    subnets:
    - endpoint-one
    minSize: 1
    desiredCapacity: 1
    maxSize: 4
    overrideBootstrapCommand: |
      #!/bin/bash
      /etc/eks/bootstrap.sh fabulous-rainbow-1635807811

Create the managed nodegroup with the following command:

eksctl create nodegroup -f dl1-managed-ng.yaml

Once the nodegroup has been completed, you must apply the habana-k8s-device-plugin

kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml

Once completed, you should see the Gaudi devices as an allocatable resource in your EKS
cluster, presenting 8 Gaudi accelerators per DL1 node in the cluster.

Allocatable:

attachable-volumes-aws-ebs: 39
cpu:                        95690m
ephemeral-storage:          192188443124
habana.ai/gaudi:            8
hugepages-1Gi:              0
hugepages-2Mi:              30000Mi
memory:                     753055132Ki
pods:                       15

Example Distributed Machine Learning (ML) Workloads

The following tables are examples of Mixed Precision/FP32 training results comparing DL1 to the common GPU instances used for ML training.

Model: ResNet50
Framework: TensorFlow 2
Dataset: Imagenet2012
GitHub: https://github.com/HabanaAI/Model-
References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras

Instance Type Batch Size
Mixed Precision Training Throughput (images/sec)
8x Gaudi – 32 GB (dl1.24xlarge) 256 13036
8x A100 – 40 GB (p4d.24xlarge) 256 17921
8x V100 – 32 GB (p3dn.24xlarge) 256 9685
8x V100 – 16GB (p3.16xlarge) 256 8945

Model: Bert Large – Pretraining
Framework: Pytorch 1.9
Dataset: Wikipedia/BooksCorpus
GitHub: https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/bert

Instance Type Batch Size
@128 Sequence
Length
Mixed Precision Training Throughput (seq/sec)
8x Gaudi – 32 GB (dl1.24xlarge) 256 1318
8x A100 – 40 GB (p4d.24xlarge) 8192 2979
8x V100 – 32 GB (p3dn.24xlarge) 8192 1458
8x V100 – 16GB (p3.16xlarge) 8192 1013

You can find a more comprehensive list of ML models supported with performance data here. Support for containers with TensorFlow and Pytorch are also available. Furthermore, you can stay up-to-date with the operator support for TensorFlow and Pytorch.

CONCLUSION

We are excited to innovate on behalf of our customers and provide a diverse choice in ML accelerators with DL1 instances. The DL1 instances powered by Gaudi accelerators can provide up to 40% better price performance for training deep learning models as compared to current generation GPU-based EC2 instances. DL1 instances use the Habana SynapseAI SDK with framework support in Pytorch and TensorFlow. Additional future support for EFA with peer direct HPUs across nodes will also be supported. Now it’s time to go power up your ML workloads with Amazon EC2 DL1 instances.