Tag Archives: devops

Leverage DevOps Guru for RDS to detect anomalies and resolve operational issues

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/devops/leverage-devops-guru-for-rds-to-detect-anomalies-and-resolve-operational-issues/

The Relational Database Management System (RDBMS) is a popular choice among organizations running critical applications that supports online transaction processing (OLTP) use-cases. But managing the RDBMS database comes with its own challenges. AWS has made it easier for organizations to operate these databases in the cloud, thereby addressing the undifferentiated heavy lifting with managed databases (Amazon Aurora, Amazon RDS). Although using managed services has freed up engineering from provisioning hardware, database setup, patching, and backups, they still face the challenges that come with running a highly performant database. As applications scale in size and sophistication, it becomes increasingly challenging for customers to detect and resolve relational database performance bottlenecks and other operational issues quickly.

Amazon RDS Performance Insights is a database performance tuning and monitoring feature, that lets you quickly assess your database load and determine when and where to take action. Performance Insights lets non-experts in database administration diagnose performance problems with an easy-to-understand dashboard that visualizes database load. Furthermore, Performance Insights expands on the existing Amazon RDS monitoring features to illustrate database performance and help analyze any issues that affect it. The Performance Insights dashboard also lets you visualize the database load and filter the load by waits, SQL statements, hosts, or users.

On Dec 1st, 2021, we announced Amazon DevOps Guru for RDS, a new capability for Amazon DevOps Guru. It’s a fully-managed machine learning (ML)-powered service that detects operational and performance related issues for Amazon Aurora engines. It uses the data that it collects from Performance Insights, and then automatically detects and alerts customers of application issues, including database problems. When DevOps Guru detects an issue in an RDS database, it publishes an insight in the DevOps Guru dashboard. The insight contains an anomaly for the resource AWS/RDS. If DevOps Guru for RDS is turned on for your instances, then the anomaly contains a detailed analysis of the problem. DevOps Guru for RDS also recommends that you perform an investigation, or it provides a specific corrective action. For example, the recommendation might be to investigate a specific high-load SQL statement or to scale database resources.

In this post, we’ll deep-dive into some of the common issues that you may encounter while running your workloads against Amazon Aurora MySQL-Compatible Edition databases, with simulated performance issues. We’ll also look at how DevOps Guru for RDS can help identify and resolve these issues. Simulating a performance issue is resource intensive, and it will cost you money to run these tests. If you choose the default options that are provided, and clean up your resources using the following clean-up instructions, then it will cost you approximately $15 to run the first test only. If you wish to run all of the tests, then you can choose “all” in the Tests parameter choice. This will cost you approximately $28 to run all three tests.


To follow along with this walkthrough, you must have the following prerequisites:

  • An AWS account with a role that has sufficient access to provision the required infrastructure. The account should also not have exceeded its quota for the resources being deployed (VPCs, Amazon Aurora, etc.).
  • Credentials that enable you to interact with your AWS account.
  • If you already have Amazon DevOps Guru turned on, then make sure that it’s tagged properly to detect issues for the resource being deployed.

Solution overview

You will clone the project from GitHub and deploy an AWS CloudFormation template, which will set up the infrastructure required to run the tests. If you choose to use the defaults, then you can run only the first test. If you would like to run all of the tests, then choose the “all” option under Tests parameter.

We simulate some common scenarios that your database might encounter when running enterprise applications. The first test simulates locking issues. The second test simulates the behavior when the AUTOCOMMIT property of the database driver is set to: True. This could result in statement latency. The third test simulates performance issues when an index is missing on a large table.

Solution walk through

Clone the repo and deploy resources

  1. Utilize the following command to clone the GitHub repository that contains the CloudFormation template and the scripts necessary to simulate the database load. Note that by default, we’ve provided the command to run only the first test.
    git clone https://github.com/aws-samples/amazon-devops-guru-rds.git
    cd amazon-devops-guru-rds
    aws cloudformation create-stack --stack-name DevOpsGuru-Stack \
        --template-body file://DevOpsGuruMySQL.yaml \
        --capabilities CAPABILITY_IAM \
        --parameters ParameterKey=Tests,ParameterValue=one \

    If you wish to run all four of the tests, then flip the ParameterValue of the Tests ParameterKey to “all”.

    If Amazon DevOps Guru is already enabled in your account, then change the ParameterValue of the EnableDevOpsGuru ParameterKey to “n”.

    It may take up to 30 minutes for CloudFormation to provision the necessary resources. Visit the CloudFormation console (make sure to choose the region where you have deployed your resources), and make sure that DevOpsGuru-Stack is in the CREATE_COMPLETE state before proceeding to the next step.

  2. Navigate to AWS Cloud9, then choose Your environments. Next, choose DevOpsGuruMySQLInstance followed by Open IDE. This opens a cloud-based IDE environment where you will be running your tests. Note that in this setup, AWS Cloud9 inherits the credentials that you used to deploy the CloudFormation template.
  3. Open a new terminal window which you will be using to clone the repository where the scripts are located.

  1. Clone the repo into your Cloud9 environment, then navigate to the directory where the scripts are located, and run initial setup.
git clone https://github.com/aws-samples/amazon-devops-guru-rds.git
cd amazon-devops-guru-rds/scripts
sh setup.sh 
# NOTE: If you are running all test cases, use sh setup.sh all command instead. 
source ~/.bashrc
  1. Initialize databases for all of the test cases, and add random data into them. The script to insert random data takes approximately five hours to complete. Your AWS Cloud9 instance is set up to run for up to 24 hours before shutting down. You can exit the browser and return between 5–24 hours to validate that the script ran successfully, then continue to the next step.
source ./connect.sh test 1
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
python3 ct.py

If you chose to run all test cases, and you ran the sh setup.sh all command in Step 4, open two new terminal windows and run the following commands to insert random data for test cases 2 and 3.

# Test case 2 – Open a new terminal window to run the commands
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 2
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
python3 ct.py
# Test case 3 - Open a new terminal window to run the commands
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 3
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
python3 ct.py
  1. Return between 5-24 hours to run the next set of commands.
  1. Add an index to the first database.
source ./connect.sh test 1
CREATE UNIQUE INDEX test1_pk ON test1(id);
INSERT INTO test1 VALUES (-1, 'locker', current_timestamp);
  1. If you chose to run all test cases, and you ran the sh setup.sh all command in Step 4, add an index to the second database. NOTE: Do no add an index to the third database.
source ./connect.sh test 2
CREATE UNIQUE INDEX test1_pk ON test1(id);
INSERT INTO test1 VALUES (-1, 'locker', current_timestamp);

DevOps Guru for RDS uses Performance Insights, and it establishes a baseline for the database metrics. Baselining involves analyzing the database performance metrics over a period of time to establish a “normal” behavior. DevOps Guru for RDS then uses ML to detect anomalies against the established baseline. If your workload pattern changes, then DevOps Guru for RDS establishes a new baseline that it uses to detect anomalies against the new “normal”. For new database instances, DevOps Guru for RDS takes up to two days to establish an initial baseline, as it requires an analysis of the database usage patterns and establishing what is considered a normal behavior.

  1. Allow two days before you start running the following tests.

Scenario 1: Locking Issues

In this scenario, multiple sessions compete for the same (“locked”) record, and they must wait for each other.
In real life, this often happens when:

  • A database session gets disconnected due to a (i.e., temporary network) malfunction, while still holding a critical lock.
  • Other sessions become stuck while waiting for the lock to be released.
  • The problem is often exacerbated by the application connection manager that keeps spawning additional sessions (because the existing sessions don’t complete the work on time), thus creating a distinct “inclined slope” pattern that you’ll see in this scenario.

Here’s how you can reproduce it:

  1. Connect to the database.
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 1
  1. In your MySQL, enter the following SQL, and don’t exit the shell.
UPDATE test1 SET timer=current_timestamp WHERE id=-1;
-- Do NOT exit!
  1. Open a new terminal, and run the command to simulate competing transactions. Give it approximately five minutes before you run the commands in this step.
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 1
python3 locking_scenario.py 1 1200 2
  1. After the program completes its execution, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. You’ll notice a summary of the insight under Description.

Shows navigation to Amazon DevOps Guru Insights and RDS DB Load Anomalous screen to find the summary description of the anomaly.

  1. Choose the View Recommendations link on the top right, and observe the databases for which it’s showing the recommendations.
  2. Next, choose View detailed analysis for database performance anomaly for the following resources.
  3. Under To view a detailed analysis, choose a resource name, choose the database associated with the first test.

 Shows the detailed analysis of the database performance anomaly. The database experiencing load is chosen, and a graphical representation of how the Average active sessions (AAS) spikes, which Amazon DevOps Guru is able to identify.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

In this example, DevOps Guru for RDS has detected a high and unusual spike of database load, and then marked it as “performance anomaly”.

Note that the relative size of the anomaly is significant: 490 times higher than the “typical” database load, which is why it’s deemed: “HIGH severity”.

In the analysis section, note that a single “wait event”, wait/synch/mutex/innodb/aurora_lock_thread_slot_futex, is dominating the entire spike. Moreover, a single SQL is “responsible” (or more precisely: “suffering”) from this wait event at the time of the problem. Select the wait event name and see a simple explanation of what’s happening in the database. For example, it’s “record locking”, where multiple sessions are competing for the same database records. Additionally, you can select the SQL hash and see the exact text of the SQL that’s responsible for the issue.

If you’re interested in why DevOps Guru for RDS detected this problem, and why these particular wait events and an SQL were selected, the Why is this a problem? and Why do we recommend this? links will provide the answer.

Finally, the most relevant part of this analysis is a View troubleshooting doc link. It references a document that contains a detailed explanation of the likely causes for this problem, as well as the actions that you can take to troubleshoot and address it.

Scenario 2: Autocommit: ON

In this scenario, we must run multiple batch updates, and we’re using a fairly popular driver setting: AUTOCOMMIT: ON.

This setting can sometimes lead to performance issues as it causes each UPDATE statement in a batch to be “encased” in its own “transaction”. This leads to data changes being frequently synchronized to disk, thus dramatically increasing batch latency.

Here’s how you can reproduce the scenario:

  1. On your Cloud9 terminal, run the following commands:
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 2
python3 batch_autocommit.py 50 1200 1000 10000000
  1. Once the program completes its execution, or after an hour, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. Then choose Recommendations and choose View detailed analysis for database performance anomaly for the following resources. Under To view a detailed analysis, choose a resource name, choose the database associated with the second test.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

Note that DevOps Guru for RDS detected a significant (and unusual) spike of database load and marked it as a HIGH severity anomaly.

The spike looks similar to the previous example (albeit, “smaller”), but it describes a different database problem (“COMMIT slowdowns”). This is because of a different database wait event that dominates the spike: wait/io/aurora_redo_log_flush.

As in the previous example, you can select the wait event name to see a simple description of what’s going on, and you can select the SQL hash to see the actual statement that is slow. Furthermore, just as before, the View troubleshooting doc link references the document that describes what you can do to troubleshoot the problem further and address it.

Scenario 3: Missing index

Have you ever wondered what would happen if you drop a frequently accessed index on a large table?

In this relatively simple scenario, we’re testing exactly that – an index gets dropped causing queries to switch from fast index lookups to slow full table scans, thus dramatically increasing latency and resource use.

Here’s how you can reproduce this problem and see it for yourself:

  1. On your Cloud9 terminal, run the following commands:
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 3
python3 no_index.py 50 1200 1000 10000000
  1. Once the program completes its execution, or after an hour, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. Then choose Recommendations and choose View detailed analysis for database performance anomaly for the following resources. Under To view a detailed analysis, choose a resource name, choose the database associated with the third test.

Shows the detailed analysis of the database performance anomaly. The database experiencing load is chosen and a graphical representation of how the Average active sessions (AAS) spikes which Amazon DevOps Guru is able to identify.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

As with the previous examples, DevOps Guru for RDS detected a high and unusual spike of database load (in this case, ~ 50 times larger than the “typical” database load). It also identified that a single wait event, wait/io/table/sql/handler, and a single SQL, are responsible for this issue.

The analysis highlights the SQL that you must pay attention to, and it links a detailed troubleshooting document that lists the likely causes and recommended actions for the problems that you see. While it doesn’t tell you that the “missing index” is the real root cause of the issue (this is planned in future versions), it does offer many relevant details that can help you come to that conclusion yourself.


On your terminal where you originally ran the AWS Command Line Interface (AWS CLI) command to create the CloudFormation resources, run the following command:

aws cloudformation delete-stack --stack-name DevOpsGuru-Stack


In this post, you learned how to leverage DevOps Guru for RDS to alert you of any operational issues with recommendations. You simulated some of the commonly encountered, real-world production issues, such as locking contentions, AUTOCOMMIT, and missing indexes. Moreover, you saw how DevOps Guru for RDS helped you detect and resolve these issues. Try this out, and let us know how DevOps Guru for RDS was able to address your use-case.


Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Simsek Mert

Simsek Mert is a Cloud Application Architect with AWS Professional Services.
Simsek helps customers with their application architecture, containers, serverless applications, leveraging his over 20 years of experience.

Maxym Kharchenko

Maxym Kharchenko is a Principal Database Engineer at AWS. He builds automated monitoring tools that use machine learning to discover and explain performance problems in relational databases.

Jared Keating

Jared Keating is a Senior Cloud Consultant with Amazon Web Services Professional Services. Jared assists customers with their cloud infrastructure, compliance, and automation requirements drawing from his over 20 years of experience in IT.

Integrating with GitHub Actions – CI/CD pipeline to deploy a Web App to Amazon EC2

Post Syndicated from Mahesh Biradar original https://aws.amazon.com/blogs/devops/integrating-with-github-actions-ci-cd-pipeline-to-deploy-a-web-app-to-amazon-ec2/

Many Organizations adopt DevOps Practices to innovate faster by automating and streamlining the software development and infrastructure management processes. Beyond cultural adoption, DevOps also suggests following certain best practices and Continuous Integration and Continuous Delivery (CI/CD) is among the important ones to start with. CI/CD practice reduces the time it takes to release new software updates by automating deployment activities. Many tools are available to implement this practice. Although AWS has a set of native tools to help achieve your CI/CD goals, it also offers flexibility and extensibility for integrating with numerous third party tools.

In this post, you will use GitHub Actions to create a CI/CD workflow and AWS CodeDeploy to deploy a sample Java SpringBoot application to Amazon Elastic Compute Cloud (Amazon EC2) instances in an Autoscaling group.

GitHub Actions is a feature on GitHub’s popular development platform that helps you automate your software development workflows in the same place that you store code and collaborate on pull requests and issues. You can write individual tasks called actions, and then combine them to create a custom workflow. Workflows are custom automated processes that you can set up in your repository to build, test, package, release, or deploy any code project on GitHub.

AWS CodeDeploy is a deployment service that automates application deployments to Amazon EC2 instances, on-premises instances, serverless AWS Lambda functions, or Amazon Elastic Container Service (Amazon ECS) services.

Solution Overview

The solution utilizes the following services:

  1. GitHub Actions – Workflow Orchestration tool that will host the Pipeline.
  2. AWS CodeDeploy – AWS service to manage deployment on Amazon EC2 Autoscaling Group.
  3. AWS Auto Scaling – AWS Service to help maintain application availability and elasticity by automatically adding or removing Amazon EC2 instances.
  4. Amazon EC2 – Destination Compute server for the application deployment.
  5. AWS CloudFormation – AWS infrastructure as code (IaC) service used to spin up the initial infrastructure on AWS side.
  6. IAM OIDC identity provider – Federated authentication service to establish trust between GitHub and AWS to allow GitHub Actions to deploy on AWS without maintaining AWS Secrets and credentials.
  7. Amazon Simple Storage Service (Amazon S3) – Amazon S3 to store the deployment artifacts.

The following diagram illustrates the architecture for the solution:

Architecture Diagram

  1. Developer commits code changes from their local repo to the GitHub repository. In this post, the GitHub action is triggered manually, but this can be automated.
  2. GitHub action triggers the build stage.
  3. GitHub’s Open ID Connector (OIDC) uses the tokens to authenticate to AWS and access resources.
  4. GitHub action uploads the deployment artifacts to Amazon S3.
  5. GitHub action invokes CodeDeploy.
  6. CodeDeploy triggers the deployment to Amazon EC2 instances in an Autoscaling group.
  7. CodeDeploy downloads the artifacts from Amazon S3 and deploys to Amazon EC2 instances.


Before you begin, you must complete the following prerequisites:

  • An AWS account with permissions to create the necessary resources.
  • A GitHub account with permissions to configure GitHub repositories, create workflows, and configure GitHub secrets.
  • A Git client to clone the provided source code.


The following steps provide a high-level overview of the walkthrough:

  1. Clone the project from the AWS code samples repository.
  2. Deploy the AWS CloudFormation template to create the required services.
  3. Update the source code.
  4. Setup GitHub secrets.
  5. Integrate CodeDeploy with GitHub.
  6. Trigger the GitHub Action to build and deploy the code.
  7. Verify the deployment.

Download the source code

  1. Clone the source code repository aws-codedeploy-github-actions-deployment.

git clone https://github.com/aws-samples/aws-codedeploy-github-actions-deployment.git

  1. Create an empty repository in your personal GitHub account. To create a GitHub repository, see Create a repo. Clone this repo to your computer. Furthermore, ignore the warning about cloning an empty repository.

git clone https://github.com/<username>/<repoName>.git

Figure2: Github Clone

  1. Copy the code. We need contents from the hidden .github folder for the GitHub actions to work.

cp -r aws-codedeploy-github-actions-deployment/. <new repository>

e.g. GitActionsDeploytoAWS

  1. Now you should have the following folder structure in your local repository.

Figure3: Directory Structure

Repository folder structure

  • The .github folder contains actions defined in the YAML file.
  • The aws/scripts folder contains code to run at the different deployment lifecycle events.
  • The cloudformation folder contains the template.yaml file to create the required AWS resources.
  • Spring-boot-hello-world-example is a sample application used by GitHub actions to build and deploy.
  • Root of the repo contains appspec.yml. This file is required by CodeDeploy to perform deployment on Amazon EC2. Find more details here.

The following commands will help make sure that your remote repository points to your personal GitHub repository.

git remote remove origin

git remote add origin <your repository url>

git branch -M main

git push -u origin main

Deploy the CloudFormation template

To deploy the CloudFormation template, complete the following steps:

  1. Open AWS CloudFormation console. Enter your account ID, user name, and Password.
  2. Check your region, as this solution uses us-east-1.
  3. If this is a new AWS CloudFormation account, select Create New Stack. Otherwise, select Create Stack.
  4. Select Template is Ready
  5. Select Upload a template file
  6. Select Choose File. Navigate to template.yml file in your cloned repository at “aws-codedeploy-github-actions-deployment/cloudformation/template.yaml”.
  7. Select the template.yml file, and select next.
  8. In Specify Stack Details, add or modify the values as needed.
    • Stack name = CodeDeployStack.
    • VPC and Subnets = (these are pre-populated for you) you can change these values if you prefer to use your own Subnets)
    • GitHubThumbprintList = 6938fd4d98bab03faadb97b34396831e3780aea1
    • GitHubRepoName – Name of your GitHub personal repository which you created.

Figure4: CloudFormation Parameters

  1. On the Options page, select Next.
  2. Select the acknowledgement box to allow for the creation of IAM resources, and then select Create. It will take CloudFormation approximately 10 minutes to create all of the resources. This stack would create the following resources.
    • Two Amazon EC2 Linux instances with Tomcat server and CodeDeploy agent are installed
    • Autoscaling group with Internet Application load balancer
    • CodeDeploy application name and deployment group
    • Amazon S3 bucket to store build artifacts
    • Identity and Access Management (IAM) OIDC identity provider
    • Instance profile for Amazon EC2
    • Service role for CodeDeploy
    • Security groups for ALB and Amazon EC2

Update the source code

  1.  On the AWS CloudFormation console, select the Outputs tab. Note that the Amazon S3 bucket name and the ARM of the GitHub IAM Role. We will use this in the next step.

Figure5: CloudFormation Output

  1. Update the Amazon S3 bucket in the workflow file deploy.yml. Navigate to /.github/workflows/deploy.yml from your Project root directory.

Replace ##s3-bucket## with the name of the Amazon S3 bucket created previously.

Replace ##region## with your AWS Region.

Figure6: Actions YML

  1. Update the Amazon S3 bucket name in after-install.sh. Navigate to aws/scripts/after-install.sh. This script would copy the deployment artifact from the Amazon S3 bucket to the tomcat webapps folder.

Figure7: CodeDeploy Instruction

Remember to save all of the files and push the code to your GitHub repo.

  1. Verify that you’re in your git repository folder by running the following command:

git remote -V

You should see your remote branch address, which is similar to the following:

[email protected] GitActionsDeploytoAWS % git remote -v

origin [email protected]:<username>/GitActionsDeploytoAWS.git (fetch)

origin [email protected]:<username>/GitActionsDeploytoAWS.git (push)

  1. Now run the following commands to push your changes:

git add .

git commit -m “Initial commit”

git push

Setup GitHub Secrets

The GitHub Actions workflows must access resources in your AWS account. Here we are using IAM OpenID Connect identity provider and IAM role with IAM policies to access CodeDeploy and Amazon S3 bucket. OIDC lets your GitHub Actions workflows access resources in AWS without needing to store the AWS credentials as long-lived GitHub secrets.

These credentials are stored as GitHub secrets within your GitHub repository, under Settings > Secrets. For more information, see “GitHub Actions secrets”.

  • Navigate to your github repository. Select the Settings tab.
  • Select Secrets on the left menu bar.
  • Select New repository secret.
  • Select Actions under Secrets.
    • Enter the secret name as ‘IAMROLE_GITHUB’.
    • enter the value as ARN of GitHubIAMRole, which you copied from the CloudFormation output section.

Figure8: Adding Github Secrets

Figure9: Adding New Secret

Integrate CodeDeploy with GitHub

For CodeDeploy to be able to perform deployment steps using scripts in your repository, it must be integrated with GitHub.

CodeDeploy application and deployment group are already created for you. Please use these applications in the next step:

CodeDeploy Application =CodeDeployAppNameWithASG

Deployment group = CodeDeployGroupName

To link a GitHub account to an application in CodeDeploy, follow until step 10 from the instructions on this page.

You can cancel the process after completing step 10. You don’t need to create Deployment.

Trigger the GitHub Actions Workflow

Now you have the required AWS resources and configured GitHub to build and deploy the code to Amazon EC2 instances.

The GitHub actions as defined in the GITHUBREPO/.github/workflows/deploy.yml would let us run the workflow. The workflow is currently setup to be manually run.

Follow the following steps to run it manually.

Go to your GitHub Repo and select Actions tab

Figure10: See Actions Tab

Select Build and Deploy link, and select Run workflow as shown in the following image.

Figure11: Running Workflow Manually

After a few seconds, the workflow will be displayed. Then, select Build and Deploy.

Figure12: Observing Workflow

You will see two stages:

  1. Build and Package.
  2. Deploy.

Build and Package

The Build and Package stage builds the sample SpringBoot application, generates the war file, and then uploads it to the Amazon S3 bucket.

Figure13: Completed Workflow

You should be able to see the war file in the Amazon S3 bucket.

Figure14: Artifacts saved in S3


In this stage, workflow would invoke the CodeDeploy service and trigger the deployment.

Figure15: Deploy With Actions

Verify the deployment

Log in to the AWS Console and navigate to the CodeDeploy console.

Select the Application name and deployment group. You will see the status as Succeeded if the deployment is successful.

Figure16: Verifying Deployment

Point your browsers to the URL of the Application Load balancer.

Note: You can get the URL from the output section of the CloudFormation stack or Amazon EC2 console Load Balancers.

Figure17: Verifying Application

Optional – Automate the deployment on Git Push

Workflow can be automated by changing the following line of code in your .github/workflow/deploy.yml file.


workflow_dispatch: {}


  #workflow_dispatch: {}
    branches: [ main ]

This will be interpreted by GitHub actions to automaticaly run the workflows on every push or pull requests done on the main branch.

After testing end-to-end flow manually, you can enable the automated deployment.

Clean up

To avoid incurring future changes, you should clean up the resources that you created.

  1. Empty the Amazon S3 bucket:
  2. Delete the CloudFormation stack (CodeDeployStack) from the AWS console.
  3. Delete the GitHub Secret (‘IAMROLE_GITHUB’)
    1. Go to the repository settings on GitHub Page.
    2. Select Secrets under Actions.
    3. Select IAMROLE_GITHUB, and delete it.


In this post, you saw how to leverage GitHub Actions and CodeDeploy to securely deploy Java SpringBoot application to Amazon EC2 instances behind AWS Autoscaling Group. You can further add other stages to your pipeline, such as Test and security scanning.

Additionally, this solution can be used for other programming languages.

About the Authors

Mahesh Biradar is a Solutions Architect at AWS. He is a DevOps enthusiast and enjoys helping customers implement cost-effective architectures that scale.
Suresh Moolya is a Cloud Application Architect with Amazon Web Services. He works with customers to architect, design, and automate business software at scale on AWS cloud.

Save Cost and Improve Lambda Application Performance with Proactive Insights from Amazon DevOps Guru

Post Syndicated from Venkata Moparthi original https://aws.amazon.com/blogs/devops/save-cost-and-improve-lambda-application-performance-with-proactive-insights-from-amazon-devops-guru/

AWS customers, regardless of size and market segment, constantly seek to improve application performance while reducing operational costs. Today, Amazon DevOps Guru generates proactive insights that enable you to reduce the cost and improve the performance of your AWS Lambda application. By proactively analyzing your application and making these cost-saving and/or performance-improving recommendations, DevOps Guru frees up your operations team to focus on other value-adding activities.

DevOps Guru is a machine learning (ML)-powered service that helps you effectively monitor your application by ingesting application metrics, learning your application’s behavior over time, and then detecting operational anomalies. Once an anomaly is detected, DevOps Guru generates insights that include specific recommendations of how to fix the underlying problem.

To make sure that AWS customers remain ahead of potential issues, DevOps Guru detects some applications issues proactively and provides recommendations that let customers correct them before customer-impacting events actually occur. These Proactive Insights are created by analyzing operational data and application metrics with ML algorithms that can identify early signals that are linked with future operational issues.

In this post, we’ll review a scenario in which the provisioned concurrency capacity for a Lambda function was set too low. This put the customer at risk of dropped requests (throttling), which degrade application performance and deliver poor user experience during traffic spikes.


In the scenario under review, we have an account with DevOps Guru set up to monitor a Lambda-based application stack. Enabling DevOps Guru and setting it up to monitor a Lambda function is straightforward, and you can refer to this post to see how this is done. For the Lambda function in this account, we have set the provisioned concurrency set too low. This Lambda documentation page covers how to estimate the appropriate concurrency levels for your function.

Architecture Overview

The reference architecture for our scenario can be seen in the following image.

In this simple serverless architecture, the Lambda-based application vends the metrics to Amazon CloudWatch. Then, DevOps Guru ingests the metrics from CloudWatch for analysis.

Architecture diagram explained in post.

By default, DevOps Guru ingests vended metrics via CloudWatch at no cost to customers.


The first time that you enable and configure DevOps Guru to monitor resources, it starts baselining your resources to determine your application’s normal behavior. Unlike rule-based alarming systems, DevOps Guru utilizes dynamic thresholds that are controlled by ML algorithms and calibrated to the specifics of your application to reduce noise. For a simple serverless stack, baselining can be completed in two hours. However, in a production environment baselining can take up to 24-hours depending upon the number of resources being monitored. After initial baselining, analysis becomes continuous and baselining is no longer required.

Proactive Insight Generation

Once baselining is complete, DevOps Guru analyzes the baselined operational and generates insights where present. These insights can be found on the Insights page of the DevOps Guru console. To view the available insights, navigate to Insights, and select the Proactive Insights or Reactive Insights. In this scenario, we’re reviewing a Proactive Insight.

Devops guru Insights page. Four proactive insights with status of ongoing

On this tab, note that the LambdaAuthorizer -1HQG1OD function has a concurrency spillover invocation. For a given Lambda function, concurrency spillover is invoked when the number of concurrent requests reaches the provisioned concurrency limit. When this occurs, Lambda either begins to run on unreserved concurrency (leading to cold starts) or rejects additional incoming requests, depending on your function scaling configuration.

By selecting the relevant insight from the list, we open the insight detail page. The insight overview card provides an overview of the insight, with high-level information such as insight description, severity, status, and the number of affected applications as shown in the following screenshot.

Insight detail page. Shows insight overview, previously explained in post.

The metrics card presents a graph plotted against time. In this case, provisioned concurrency invocation, which toggles from 0 to 1 when concurrency spillover occurs, was triggered because the Lambda function received more concurrent requests than were provisioned for.

Metric card with graph plotted against time.

The relevant events card is useful in situations where more than one application is affected, or when the initial event triggers additional events. This card plots all of the events from different related applications on a time axis. Therefore, we can pinpoint which event triggered the chain of events.

Relevent events card, previously explained in post.


The recommendation section of the insight page provides specific and actionable guidance on what actions customers should take to fix the underlying cause of the issue. In this case, DevOps Guru recommends that the customer set the provisioned concurrency to 264 to keep the utilization balanced at 65%. Providing such specific guidance takes away any ambiguity and significantly reduces troubleshooting time.

Recommendations section previously explained in post.

Other Lambda-related Proactive Insights

While this scenario alerts customers to an issue that impacts application performance, DevOps Guru also provides alerts for cost-optimization issues. Some additional cost and performance-related issues that DevOps Guru identifies include:

  • Lambda Provisioned with No Autoscaling, which is triggered when autoscaling isn’t enabled, thereby putting the application at risk of degraded performance when requests are throttled during a traffic spike.
  • Low Lambda Provision Concurrency Utilization, which is triggered when provisioned concurrency is consistently higher than required, driving unnecessary cloud spend.
  • Over-provisioned Amazon DynamoDB Stream Shards, which is triggered when provisioned Amazon DynamoDB stream shards is consistently higher than required, driving unnecessary cloud spend.

DevOps Guru continues to expand its library of proactive insight use cases to deliver cost and performance improvements continuously to AWS customers.


As seen in the example above, DevOps Guru can proactively detect issues with your Lambda applications, tie these issues to related events, and provide precise remedial actions using its pre-trained ML models. As a customer, you can start leveraging these capabilities to improve the performance of your Lambda applications by simply enabling DevOps Guru—a process that requires minimal configuration and no previous ML expertise.

Start using DevOps Guru to monitor your Lambda Applications today!

About the authors

Mohit Gadkari

Mohit Gadkari is a Solutions Architect at Amazon Web Services (AWS) supporting SMB customers. He has been professionally using AWS since 2015 specializing in DevOps and Cloud Security and currently he is using this experience to help customers navigate the cloud.

Venkata Moparthi

Venkata Moparthi is a Cloud Infrastructure Architect at Amazon Web Services. He helps customers on their cloud adoption journey. He is passionate about technology and enjoys collaborating with customers architecting and implementing highly scalable and secure solutions.

Detecting security issues in logging with Amazon CodeGuru Reviewer

Post Syndicated from Brian Farnhill original https://aws.amazon.com/blogs/devops/detecting-security-issues-in-logging-with-amazon-codeguru-reviewer/

Amazon CodeGuru is a developer tool that provides intelligent recommendations for identifying security risks in code and improving code quality. To help you find potential issues related to logging of inputs that haven’t been sanitized, Amazon CodeGuru Reviewer now includes additional checks for both Python and Java. In this post, we discuss these updates and show examples of code that relate to these new detectors.

In December 2021, an issue was discovered relating to Apache’s popular Log4j Java-based logging utility (CVE-2021-44228). There are several resources available to help mitigate this issue (some of which are highlighted in a post on the AWS Public Sector blog). This issue has drawn attention to the importance of logging inputs in a way that is safe. To help developers understand where un-sanitized values are being logged, CodeGuru Reviewer can now generate findings that highlight these and make it easier to remediate them.

The new detectors and recommendations in CodeGuru Reviewer can detect findings in Java where Log4j is used, and in Python where the standard logging module is used. The following examples demonstrate how this works and what the recommendations look like.

Findings in Java

Consider the following Java sample that responds to a web request.

public ModelAndView handleRequest(HttpServletRequest request, HttpServletResponse response) {
    ModelAndView result = new ModelAndView("success");
    String userId = request.getParameter("userId");
    result.addObject("userId", userId);

    // More logic to populate `result`.
     log.info("Successfully processed {} with user ID: {}.", request.getRequestURL(), userId);
    return result;

This simple example generates a result to the initial request, and it extracts the userId field from the initial request to do this. Before returning the result, the userId field is passed to the log.info statement. This presents a potential security issue, because the value of userId is not sanitized or changed in any way before it is logged. CodeGuru Reviewer is able to identify that the variable userId points to a value that needs to be sanitized before it is logged, as it comes from an HTTP request. All user inputs in a request (including query parameters, headers, body and cookie values) should be checked before logging to ensure a malicious user hasn’t passed values that could compromise your logging mechanism.

CodeGuru Reviewer recommends to sanitize user-provided inputs before logging them to ensure log integrity. Let’s take a look at CodeGuru Reviewer’s findings for this issue.

A screenshot of the AWS Console that describes the log injection risk found by CodeGuru Reviewer

An option to remediate this risk would be to add a sanitize() method that checks and modifies the value to remove known risks. The specific process of doing this will vary based on the values you expect and what is safe for your application and its processes. By logging the now sanitized value, you have mitigated those risks that could impact on your logging framework. The modified code sample below shows one example of how this could be addressed.

public ModelAndView handleRequestSafely(HttpServletRequest request, HttpServletResponse response) {
    ModelAndView result = new ModelAndView("success");
    String userId = request.getParameter("userId");
    String sanitizedUserId = sanitize(userId);
    result.addObject("userId", sanitizedUserId);

    // More logic to populate `result`.
    log.info("Successfully processed {} with user ID: {}.", request.getRequestURL(), sanitizedUserId);
    return result;

private static String sanitize(String userId) {
    return userId.replaceAll("\\D", "");

The example now uses the sanitize() method, which uses a replaceAll() call that uses a regular expression to remove all non-digit characters. This example assumes the userId value should only be digit characters, ensuring that any other characters that could be used to expose a vulnerability in the logging framework are removed first.

Findings in Python

Now consider the following python code from a sample Flask project that handles a web request.

from flask import app, current_app, request

def getUserInput():
    input = request.args.get('input')
    current_app.logger.info("User input: %s", input)

    # More logic to process user input.

In this example, the input variable is assigned the input query string value from a web request. Then, the Flask logger records its value as an info level message. This has the same challenge as the Java example above. However this time rather than changing the value, we can instead inspect it and choose to log it only when it is in a format we expect. A simple example of this could be where we expect only alphanumeric characters in the input variable. The isalnum() function can act as a simple test in this case. Here is an example of what this style of validation could look like.

from flask import app, current_app, request

def safe_getUserInput():
    input = request.args.get('input')    
    if input.isalnum():
        current_app.logger.info("User input: %s", input)        
        current_app.logger.warning("Unexpected input detected")

Getting started

While log sanitization implementation is a long journey for many, it is a guardrail for maintaining your application’s log integrity. With CodeGuru Reviewer detecting log inputs that are neither sanitized nor validated, developers can use these recommendations as a guide to reduce risks related to log injection attacks. Additionally, you can provide feedback on recommendations in the CodeGuru Reviewer console or by commenting on the code in a pull request. This feedback helps improve the precision of CodeGuru Reviewer, so the recommendations you see get better over time.

To get started with CodeGuru Reviewer, you can leverage AWS Free Tier without any cost. For 90 days, you can review up to 100K lines of code in onboarded repositories per AWS account. For more information, please review the pricing page.

About the authors

Brian Farnhill

Brian Farnhill is a Software Development Engineer in the Australian Public Sector team. His background is in building solutions and helping customers improve DevOps tools and processes. When he isn’t working, you’ll find him either coding for fun or playing online games.

Jia Qin

Jia Qin is part of the Solutions Architect team in Malaysia. She loves developing on AWS, trying out new technology, and sharing her knowledge with customers. Outside of work, she enjoys taking walks and petting cats.

Streamlining evidence collection with AWS Audit Manager

Post Syndicated from Nicholas Parks original https://aws.amazon.com/blogs/security/streamlining-evidence-collection-with-aws-audit-manager/

In this post, we will show you how to deploy a solution into your Amazon Web Services (AWS) account that enables you to simply attach manual evidence to controls using AWS Audit Manager. Making evidence-collection as seamless as possible minimizes audit fatigue and helps you maintain a strong compliance posture.

As an AWS customer, you can use APIs to deliver high quality software at a rapid pace. If you have compliance-focused teams that rely on manual, ticket-based processes, you might find it difficult to document audit changes as those changes increase in velocity and volume.

As your organization works to meet audit and regulatory obligations, you can save time by incorporating audit compliance processes into a DevOps model. You can use modern services like Audit Manager to make this easier. Audit Manager automates evidence collection and generates reports, which helps reduce manual auditing efforts and enables you to scale your cloud auditing capabilities along with your business.

AWS Audit Manager uses services such as AWS Security Hub, AWS Config, and AWS CloudTrail to automatically collect and organize evidence, such as resource configuration snapshots, user activity, and compliance check results. However, for controls represented in your software or processes without an AWS service-specific metric to gather, you need to manually create and provide documentation as evidence to demonstrate that you have established organizational processes to maintain compliance. The solution in this blog post streamlines these types of activities.

Solution architecture

This solution creates an HTTPS API endpoint, which allows integration with other software development lifecycle (SDLC) solutions, IT service management (ITSM) products, and clinical trial management systems (CTMS) solutions that capture trial process change amendment documentation (in the case of pharmaceutical companies who use AWS to build robust pharmacovigilance solutions). The endpoint can also be a backend microservice to an application that allows contract research organizations (CRO) investigators to add their compliance supporting documentation.

In this solution’s current form, you can submit an evidence file payload along with the assessment and control details to the API and this solution will tie all the information together for the audit report. This post and solution is directed towards engineering teams who are looking for a way to accelerate evidence collection. To maximize the effectiveness of this solution, your engineering team will also need to collaborate with cross-functional groups, such as audit and business stakeholders, to design a process and service that constructs and sends the message(s) to the API and to scale out usage across the organization.

To download the code for this solution, and the configuration that enables you to set up auto-ingestion of manual evidence, see the aws-audit-manager-manual-evidence-automation GitHub repository.

Architecture overview

In this solution, you use AWS Serverless Application Model (AWS SAM) templates to build the solution and deploy to your AWS account. See Figure 1 for an illustration of the high-level architecture.

Figure 1. The architecture of the AWS Audit Manager automation solution

Figure 1. The architecture of the AWS Audit Manager automation solution

The SAM template creates resources that support the following workflow:

  1. A client can call an Amazon API Gateway endpoint by sending a payload that includes assessment details and the evidence payload.
  2. An AWS Lambda function implements the API to handle the request.
  3. The Lambda function uploads the evidence to an Amazon Simple Storage Service (Amazon S3) bucket (3a) and uses AWS Key Management Service (AWS KMS) to encrypt the data (3b).
  4. The Lambda function also initializes the AWS Step Functions workflow.
  5. Within the Step Functions workflow, a Standard Workflow calls two Lambda functions. The first looks for a matching control within an assessment, and the second updates the control within the assessment with the evidence.
  6. When the Step Functions workflow concludes, it sends a notification for success or failure to subscribers of an Amazon Simple Notification Service (Amazon SNS) topic.

Deploy the solution

The project available in the aws-audit-manager-manual-evidence-automation GitHub repository contains source code and supporting files for a serverless application you can deploy with the AWS SAM command line interface (CLI). It includes the following files and folders:

src Code for the application’s Lambda implementation of the Step Functions workflow.
It also includes a Step Functions definition file.
template.yml A template that defines the application’s AWS resources.

Resources for this project are defined in the template.yml file. You can update the template to add AWS resources through the same deployment process that updates your application code.


This solution assumes the following:

  1. AWS Audit Manager is enabled.
  2. You have already created an assessment in AWS Audit Manager.
  3. You have the necessary tools to use the AWS SAM CLI (see details in the table that follows).

For more information about setting up Audit Manager and selecting a framework, see Getting started with Audit Manager in the blog post AWS Audit Manager Simplifies Audit Preparation.

The AWS SAM CLI is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. The AWS SAM CLI uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application’s build environment and API.

To use the AWS SAM CLI, you need the following tools:

Node.js Install Node.js 14, including the npm package management tool
Docker Install Docker community edition

To deploy the solution

  1. Open your terminal and use the following command to create a folder to clone the project into, then navigate to that folder. Be sure to replace <FolderName> with your own value.

    mkdir Desktop/<FolderName> && cd $_

  2. Clone the project into the folder you just created by using the following command.

    git clone https://github.com/aws-samples/aws-audit-manager-manual-evidence-automation.git

  3. Navigate into the newly created project folder by using the following command.

    cd aws-audit-manager-manual-evidence-automation

  4. In the AWS SAM shell, use the following command to build the source of your application.

    sam build

  5. In the AWS SAM shell, use the following command to package and deploy your application to AWS. Be sure to replace <DOC-EXAMPLE-BUCKET> with your own unique S3 bucket name.

    sam deploy –guided –parameter-overrides paramBucketName=<DOC-EXAMPLE-BUCKET>

  6. When prompted, enter the AWS Region where AWS Audit Manager was configured. For the rest of the prompts, leave the default values.
  7. To activate the IAM authentication feature for API gateway, override the default value by using the following command.


To test the deployed solution

After you deploy the solution, run an invocation like the one below for an assessment (using curl). Be sure to replace <YOURAPIENDPOINT> and <AWS REGION> with your own values.

curl –location –request POST
‘https://<YOURAPIENDPOINT>.execute-api.<AWS REGION>.amazonaws.com/Prod’ \
–header ‘x-api-key: ‘ \
–form ‘[email protected]”<PATH TO FILE>”‘ \
–form ‘AssessmentName=”GxP21cfr11″‘ \
–form ‘ControlSetName=”General requirements”‘ \
–form ‘ControlIdName=”11.100(a)”‘

Check to see that your file is correctly attached to the control for your assessment.

Form-data interface parameters

The API implements a form-data interface that expects four parameters:

  1. AssessmentName: The name for the assessment in Audit Manager. In this example, the AssessmentName is GxP21cfr11.
  2. ControlSetName: The display name for a control set within an assessment. In this example, the ControlSetName is General requirements.
  3. ControlIdName: this is a particular control within a control set. In this example, the ControlIdName is 11.100(a).
  4. Payload: this is the file representing evidence to be uploaded.

As a refresher of Audit Manager concepts, evidence is collected for a particular control. Controls are grouped into control sets. Control sets can be grouped into a particular framework. The assessment is considered an implementation, or an instance, of the framework. For more information, see AWS Audit Manager concepts and terminology.

To clean up the deployed solution

To clean up the solution, use the following commands to delete the AWS CloudFormation stack and your S3 bucket. Be sure to replace <YourStackId> and <DOC-EXAMPLE-BUCKET> with your own values.

aws cloudformation delete-stack –stack-name <YourStackId>
aws s3 rb s3://<DOC-EXAMPLE-BUCKET> –force


This solution provides a way to allow for better coordination between your software delivery organization and compliance professionals. This allows your organization to continuously deliver new updates without overwhelming your security professionals with manual audit review tasks.

Next steps

There are various ways to extend this solution.

  1. Update the API Lambda implementation to be a webhook for your favorite software development lifecycle (SDLC) or IT service management (ITSM) solution.
  2. Modify the steps within the Step Functions state machine to more closely match your unique compliance processes.
  3. Use AWS CodePipeline to start Step Functions state machines natively, or integrate a variation of this solution with any continuous compliance workflow that you have.

Learn more AWS Audit Manager, DevOps, and AWS for Health and start building!

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Nicholas Parks

Nicholas Parks

Nicholas has been using AWS since 2010 across various enterprise verticals including healthcare, life sciences, financial, retail, and telecommunications. Nicholas focuses on modernizations in pursuit of new revenue as well as application migrations. He specializes in Lean, DevOps cultural change, and Continuous Delivery.

Brian Tang

Brian Tang

Brian Tang is an AWS Solutions Architect based out of Boston, MA. He has 10 years of experience helping enterprise customers across a wide range of industries complete digital transformations by migrating business-critical workloads to the cloud. His core interests include DevOps and serverless-based solutions. Outside of work, he loves rock climbing and playing guitar.

Building Blue/Green application deployment to Micro Focus Enterprise Server

Post Syndicated from Kevin Yung original https://aws.amazon.com/blogs/devops/building-blue-green-application-deployment-to-micro-focus-enterprise-server/

Organizations running mainframe production workloads often follow the traditional approach of application deployment. To release new features of existing applications into production, the application is redeployed using the new version of software on the existing infrastructure. This poses the following challenges:

  • The cutover of the application deployment from testing to production usually takes place during a planned outage window with associated downtime.
  • Rollback is difficult, since the earlier version of the software must be redeployed from scratch on the existing infrastructure. This may result in applications being unavailable for longer durations owing to the rollback.
  • Due to differences in testing and production environments, some defects may leak into production, affecting the application code quality and thus increasing the number of production outages

Automated, robust application deployment is recognized as a prime driver for moving from a Mainframe to AWS, as service stability, security, and quality can be better managed. In this post, you will learn how to build Blue/Green (zero-downtime) deployments for mainframe applications rehosted to Micro Focus Enterprise Server with AWS Developer Tools (AWS CodeBuild, CodePipeline, and CodeDeploy).

This is a continuation of our previous post “Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite”. In our last post, we explained how you can implement a pattern for continuous integration and testing of mainframe applications with AWS Developer tools and Micro Focus Enterprise Suite. If you haven’t already checked it out, then we strongly recommend that you read through it before proceeding to the rest of this post.

Overview of solution

In this section, we explain the three important design “ingredients” to be implemented in the overall solution:

  1. Implementation of Enterprise Server Performance and Availability Cluster (PAC)
  2. End-to-end design of CI/CD pipeline for multiple teams development
  3. Blue/green deployment process for a rehosted mainframe application

First, let’s look at the solution design for the Micro Focus Enterprise Server PAC cluster.

Overview of Micro Focus Enterprise Server Performance and Availability Cluster (PAC)

In the Blue/Green deployment solution, Micro Focus Enterprise Server is the hosting environment for mainframe applications with the software installed into Amazon EC2 instances. Application deployment in Amazon EC2 Auto Scaling is one of the critical requirements to build a Blue/Green deployment. Micro Focus Enterprise Server PAC technology is the feature that allows for the Auto Scaling of Enterprise Server instances. For details on how to build Micro Focus Enterprise PAC Cluster with Amazon EC2 Auto Scaling and Systems Manager, see our AWS Prescriptive Guidance document. An overview of the infrastructure architecture is shown in the following figure, and the following table explains the components in the architecture.

Infrastructure architecture overview for blue/green application deployment to Micro Focus Enterprise Server

Components Description
Micro Focus Enterprise Servers Deploy applications to Micro Focus Enterprise Servers PAC in Amazon EC2 Auto Scaling Group.
Micro Focus Enterprise Server Common Web Administration (ESCWA) Manage Micro Focus Enterprise Server PAC with ESCWA server, e.g., Adding or Removing Enterprise Server to/from a PAC.
Relational Database for both user and system data files Setup Amazon Aurora RDS Instance in Multi-AZ to host both user and system data files to be shared across the Enterprise server instances.
Micro Focus Enterprise Server Scale-Out Repository (SOR) Setup an Amazon ElastiCache Redis Instance and replicas in Multi-AZ to host user data.
Application endpoint and load balancer Setup a Network Load Balancer to provide a hostname for end users to connect the application, e.g., accessing the application through a 3270 emulator.

CI/CD Pipelines design supporting multi-streams of mainframe development

In a previous DevOps post, Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite, we introduced two levels of pipelines. The first level of pipeline is used by mainframe project teams to test project scope changes. The second level of the pipeline is used for system integration tests, where the pipeline will perform tests for all of the promoted changes from the project pipelines and perform extensive systems tests.

In this post, we are extending the two levels pipeline to add a production deployment pipeline. When system testing is complete and successful, the tested application artefacts are promoted to the production pipeline in preparation for live production release. The following figure depicts each stage of the three levels of CI/CD pipeline and the purpose of each stage.

Different levels of CI/CD pipeline - Project Team Pipeline, Systems Test Pipeline and Production Deployment Pipeline

Let’s look at the artifact promotion to production pipeline in greater detail. The Systems Test Pipeline promotes the tested artifacts in binary format into an Amazon S3 bucket and the S3 event triggers production pipeline to kick-off. This artifact promotion process can be gated using a manual approval action in CodePipeline. For customers who want to have a fully automated continuous deployment, the manual promotion approval step can be removed.

The following diagram shows the AWS Stages in AWS CodePipeline of the production deployment pipeline:

Stages in production deployment pipeline using AWS CodePipeline

After the production pipeline is kicked off, it downloads the new version artifact from the S3 bucket. See the details of how to setup the S3 bucket as a Source of CodePipeline in the document AWS CodePipeline Document S3 as Source

In the following section, we explain each of these pipeline stages in detail:

  1. It prepares and packages a new version of production configuration artifacts, for example, the Micro Focus Enterprise Server config file, blue/green deployment scripts etc.
  2. Use in the CodeBuild Project to kick off an application blue/green deployment with AWS CodeDeploy.
  3. Use a manual approval gate to wait for an operator to validate the new version of the application and approve to continue the production traffic switch
  4. Continue the blue/green deployment by allowing traffic to the new version of the application and block the traffic to the old version.
  5. After a successful Blue/Green switch and deployment, tag the production version in the code repository.

Now that you’ve seen the pipeline design, we will dive deep into the details of the blue/green deployment with AWS CodeDeploy.

Blue/green deployment with AWS CodeDeploy

In the blue/green deployment, we used the technique of swapping Auto Scaling Group behind an Elastic Load Balancer. Refer to the AWS Blue/Green deployment whitepaper for the details of the technique. As AWS CodeDeploy is a fully-managed service that automates software deployment, it is used to automate the entire Blue/Green process.

Firstly, the following best practices are applied to setup the Enterprise Server’s infrastructure:

  1. AWS Image Builder is used to install Micro Focus Enterprise Server software and AWS CodeDeploy Agent into Amazon Machine Image (AMI). Create an EC2 Launch Template with the Enterprise Server AMI ID.
  2. A Network Load Balancer is used to setup a TCP connection health check to validate that Micro Focus Enterprise Server is listening on the required ports, e.g., port 9270, so that connectivity is available for 3270 emulators.
  3. A script was created to confirm application deployment validity in each EC2 instance. This is achieved by using a PowerShell script that triggers a CICS transaction from the Micro Focus Enterprise Server command line interface.

In the CodePipeline, we created a CodeBuild project to create a new deployment with CodeDeploy. We will go into the details of the CodeBuild buildspec.yaml configuration.

In the CodeBuild buildspec.yaml’s pre_build section, we used the following steps:

In the pre-build stage, the CodeBuild will perform two steps:

  1. Create an initial Amazon EC2 Auto Scaling using Micro Focus Enterprise Server AMI and a Launch Template for the first-time deployment of the application.
  2. Use AWS CLI to update the initial Auto Scaling Group name into a Systems Manager Parameter Store, and it will later be used by CodeDeploy to create a copy during the blue/green deployment.

In the build stage, the buildspec will perform the following steps:

  1. Retrieve the Auto Scaling Group name of the Enterprise Servers from the Systems Manager Parameter Store.
  2. Then, a blue/green deployment configuration is created for the deployment group of the application. In the AWS CLI command, we use the WITH_TRAFFIC_CONTROL option to let us manually verify and approve before switching the traffic to the new version of the application. The command snippet is shown here.
        ",deploymentReadyOption={actionOnTimeout=STOP_DEPLOYMENT,waitTimeInMinutes=600}" \


/usr/local/bin/aws deploy update-deployment-group \
      --application-name "${APPLICATION_NAME}" \
     --current-deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
     --auto-scaling-groups "${AsgName}" \
      --load-balancer-info targetGroupInfoList=[{name="${TARGET_GROUP_NAME}"}] \
      --deployment-style "deploymentType=$DeployType" \
      --Blue/Green-deployment-configuration "$BlueGreenConf"
  1. Next, the new version of application binary is released from the CodeBuild source DemoBinto the production S3 bucket.
release="bankdemo-$(date '+%Y-%m-%d-%H-%M').tar.gz"

/usr/local/bin/aws deploy push \
    --application-name ${APPLICATION_NAME} \
    --description "version - $(date '+%Y-%m-%d %H:%M')" \
    --s3-location ${RELEASE_FILE} \
    --source ${CODEBUILD_SRC_DIR_DemoBin}/
  1. Create a new deployment for the application to initiate the Blue/Green switch.
/usr/local/bin/aws deploy create-deployment \
    --application-name ${APPLICATION_NAME} \
    --s3-location bucket=${PRODUCTION_BUCKET},key=${release},bundleType=zip \
    --deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
    --description "Bankdemo Production Deployment ${release}"\
    --query deploymentId \
    --output text

After setting up the deployment options, the following is a snapshot of a deployment configuration from the AWS Management Console.

Snapshot of deployment configuration from AWS Management Console

In the AWS Post “Under the Hood: AWS CodeDeploy and Auto Scaling Integration”, we explain how AWS CodeDeploy sets up Auto Scaling lifecycle hooks to listen for Auto Scaling events. In the event of an EC2 instance launch and termination, AWS CodeDeploy can instruct its agent in the instance to run the prepared scripts.

In the following table, we list each stage in a blue/green deployment and the tasks that ran.

Hooks Tasks
BeforeInstall Create application folder structures in the newly launched Amazon EC2 and prepare for installation
  AfterInstall Enable Windows Firewall Rule for application traffic
Activate Micro Focus License using License Server
Prepare Production Database Connections
Import config to create Region in Micro Focus Enterprise Server
Deploy the latest application binaries into each of the Micro Focus Enterprise Servers
ApplicationStart Use AWS CLI to start a Systems Manager Automation “Scale-Out” runbook with the target of ESCWA server
The Automation runbook will add the newly launched Micro Focus Enterprise Server instance into a PAC
The Automation runbook will start the imported region in the newly launched Micro Focus Enterprise Server
Validate that the application is listening on a service port, for example, port 9270
Use the Micro Focus command “castran” to run an online transaction in Micro Focus Enterprise Server to validate the service status
AfterBlockTraffic Use AWS CLI to start a Systems Manager Automation “Scale-In” runbook with the target ESCWA server
The Automation runbook will try stopping the Region in the terminating EC2 instance
The Automation runbook will remove the Enterprise Server instance from the PAC

The tasks in the table are automated using PowerShell, and the scripts are used in appspec.yml config for CodeDeploy to orchestrate the deployment.

In the following appspec.yml, the locations of the binary files to be installed are defined in addition to the Micro Focus Enterprise Server Region XML config file. During the AfrerInstall stage, the XML config is imported into the Enterprise Server.

version: 0.0
os: windows
  - source: scripts
    destination: C:\scripts\
  - source: online
    destination: C:\BANKDEMO\online\
  - source: common
    destination: C:\BANKDEMO\common\
  - source: batch
    destination: C:\BANKDEMO\batch\
  - source: scripts\BANKDEMO.xml
    destination: C:\BANKDEMO\
    - location: scripts\BeforeInstall.ps1
      timeout: 300
    - location: scripts\AfterInstall.ps1    
    - location: scripts\ApplicationStart.ps1
      timeout: 300
    - location: scripts\ValidateServer.cmd
      timeout: 300
    - location: scripts\AfterBlockTraffic.ps1

Using the sample Micro Focus Bankdemo application, and the steps outlined above, we have setup a blue/green deployment process in Micro Focus Enterprise Server.

There are four important considerations when setting up blue/green deployment:

  1. For batch applications, the blue/green deployment should be invoked only outside of the scheduled “batch window”.
  2. For online applications, AWS CodeDeploy will deregister the Auto Scaling group from the target group of the Network Load Balancer. The deregistration may take a while as the server has to finish processing the ongoing requests before it can continue deployment of the new application instance. In this case, enabling Elastic Load Balancing connection draining feature with appropriate timeout value can minimize the risk of closing unfinished transactions. In addition, consider doing deployment in low-traffic windows to improve the deployment speeds.
  3. For application changes that require updates to the database schema, the version roll-forward and rollback can be managed via DB migrations tools, e.g., Flyway and Fluent Migrator.
  4. For testing in production environments, adherence to any regulatory compliance, such as full audit trail of events, must be considered.


In this post, we introduced the solution to use Micro Focus Enterprise Server PAC, Amazon EC2 Auto Scaling, AWS Systems Manager, and AWS CodeDeploy to automate the blue/green deployment of rehosted mainframe applications in AWS.

Through the blue/green deployment methodology, we can shift traffic between two identical clusters running different application versions in parallel. This mitigates the risks commonly associated with mainframe application deployment, namely downtime and rollback capacity, while ensure higher code quality in production through “Shift Right” testing.

A demo of the solution is available on the AWS Partner Micro Focus website [Solution-Demo]. If you’re interested in modernizing your mainframe applications, then please contact Micro Focus and AWS mainframe business development at [email protected].

Additional Information

About the authors

Kevin Yung

Kevin Yung

Kevin is a Senior Modernization Architect in AWS Professional Services Global Mainframe and Midrange Modernization (GM3) team. Kevin currently is focusing on leading and delivering mainframe and midrange applications modernization for large enterprise customers.

Krithika Palani Selvam

Krithika is a Senior Modernization Architect in AWS Professional Services Global Mainframe and Midrange Modernization (GM3) team. She is currently working with enterprise customers for migrating and modernizing mainframe and midrange applications to cloud.

Peter Woods

Peter Woods has been with Micro Focus for over 30 years <within the Application Modernisation & Connectivity portfolio>. His diverse range of roles has included Technical Support, Channel Sales, Product Management, Strategic Alliances Management and Pre-Sales and was primarily based in the UK. In 2017 Peter re-located to Melbourne, Australia and in his current role of AM2C APJ Regional Technical Leader and ANZ Pre-Sales Manager, he is charged with driving and supporting Application Modernisation sales activity across the APJ region.

Abraham Mercado Rondon

Abraham Rondon is a Solutions Architect working on Micro Focus Enterprise Solutions for the Application Modernization team based in Melbourne. After completing a degree in Statistics and before joining Micro Focus, Abraham had a long career in supporting Mainframe Applications in different countries doing progressive roles from Developer to Production Support, Business and Technical Analyst, and Project Team Lead.  Now, a vital part of the Micro Focus Application Modernization team, one of his main focus is Cloud implementations of mainframe DevOps and production workload rehost.

Using DevOps Automation to Deploy Lambda APIs across Accounts and Environments

Post Syndicated from Subrahmanyam Madduru original https://aws.amazon.com/blogs/architecture/using-devops-automation-to-deploy-lambda-apis-across-accounts-and-environments/

by Subrahmanyam Madduru – Global Partner Solutions Architect Leader, AWS, Sandipan Chakraborti – Senior AWS Architect, Wipro Limited, Abhishek Gautam – AWS Developer and Solutions Architect, Wipro Limited, Arati Deshmukh – AWS Architect, Infosys

As more and more enterprises adopt serverless technologies to deliver their business capabilities in a more agile manner, it is imperative to automate release processes. Multiple AWS Accounts are needed to separate and isolate workloads in production versus non-production environments. Release automation becomes critical when you have multiple business units within an enterprise, each consisting of a number of AWS accounts that are continuously deploying to production and non-production environments.

As a DevOps best practice, the DevOps engineering team responsible for build-test-deploy in a non-production environment should not release the application and infrastructure code on to both non-production and production environments.  This risks introducing errors in application and infrastructure deployments in production environments. This in turn results in significant rework and delays in delivering functionalities and go-to-market initiatives. Deploying the code in a repeatable fashion while reducing manual error requires automating the entire release process. In this blog, we show how you can build a cross-account code pipeline that automates the releases across different environments using AWS CloudFormation templates and AWS cross-account access.

Cross-account code pipeline enables an AWS Identity & Access Management (IAM) user to assume an IAM Production role using AWS Secure Token Service (Managing AWS STS in an AWS Region – AWS Identity and Access Management) to switch between non-production and production deployments based as required. An automated release pipeline goes through all the release stages from source, to build, to deploy, on non-production AWS Account and then calls STS Assume Role API (cross-account access) to get temporary token and access to AWS Production Account for deployment. This follow the least privilege model for granting role-based access through IAM policies, which ensures the secure automation of the production pipeline release.

Solution Overview

In this blog post, we will show how a cross-account IAM assume role can be used to deploy AWS Lambda Serverless API code into pre-production and production environments. We are building on the process outlined in this blog post: Building a CI/CD pipeline for cross-account deployment of an AWS Lambda API with the Serverless Framework by programmatically automating the deployment of Amazon API Gateway using CloudFormation templates. For this use case, we are assuming a single tenant customer with separate AWS Accounts to isolate pre-production and production workloads.  In Figure 1, we have represented the code pipeline workflow diagramatically for our use case.

Figure 1. AWS cross-account CodePipeline for production and non-production workloads

Figure 1. AWS cross-account AWS CodePipeline for production and non-production workloads

Let us describe the code pipeline workflow in detail for each step noted in the preceding diagram:

  1. An IAM user belonging to the DevOps engineering team logs in to AWS Command-line Interface (AWS CLI) from a local machine using an IAM secret and access key.
  2. Next, the  IAM user assumes the IAM role to the corresponding activities – AWS Code Commit, AWS CodeBuild, AWS CodeDeploy, AWS CodePipeline Execution and deploys the code for pre-production.
  3. A typical AWS CodePipeline comprises of build, test and deploy stages. In the build stage, the AWS CodeBuild service generates the Cloudformation template stack (template-export.yaml) into Amazon S3.
  4. In the deploy stage, AWS CodePipeline uses a CloudFormation template (a yaml file) to deploy the code from an S3 bucket containing the application API endpoints via Amazon API Gateway in the pre-production environment.
  5. The final step in the pipeline workflow is to deploy the application code changes onto the Production environment by assuming STS production IAM role.

Since the AWS CodePipeline is fully automated, we can use the same pipeline by switching between  pre-production and production accounts. These accounts assume the IAM role appropriate to the target environment and deploy the validated build to that environment using CloudFormation templates.


Here are the pre-requisites before you get started with implementation.

  • A user  with appropriate privileges (for example: Project Admin) in a production AWS account
  • A user with appropriate privileges (for example: Developer Lead) in a pre-production AWS account such as development
  • A CloudFormation template for deploying infrastructure in the pre-production account
  • Ensure your local machine has AWS CLI installed and configured 

Implementation Steps

In this section, we show how you can use AWS CodePipeline to release a serverless API in a secure manner to pre-production and production environments. AWS CloudWatch logging will be used to monitor the events on the AWS CodePipeline.

1. Create Resources in a pre-production account

In this step, we create the required resources such as a code repository, an S3 bucket, and a KMS key in a pre-production environment.

  • Clone the code repository into your CodeCommit. Make necessary changes to index.js and ensure the buildspec.yaml is there to build the artifacts.
    • Using codebase (lambda APIs) as input, you output a CloudFormation template, and environmental configuration JSON files (used for configuring Production and other non-Production environments such as dev, test). The build artifacts are packaged using AWS Serverless Application Model into a zip file and uploads it to an S3 bucket created for storing artifacts. Make note of the repository name as it will be required later.
  • Create an S3 bucket in a Region (Example: us-east-2). This bucket will be used by the pipeline for get and put artifacts. Make a note of the bucket name.
    • Make sure you edit the bucket policy to have your production account ID and the bucket name. Refer to AWS S3 Bucket Policy documentation to make changes to Amazon S3 bucket policies and permissions.
  • Navigate to AWS Key Management Service (KMS) and create a symmetric key.
  • Then create a new secret, configure the KMS key and provide access to development and production account. Make a note of the ARN for the key.

2. Create IAM Roles in the Production Account and required policies

In this step, we create roles and policies required to deploy the code.

    "Version": "2012-10-17",
    "Statement": [
        "Effect": "Allow",
        "Action": [
      "Resource": [
        "Your KMS Key ARN you created in Development Account"

Once you’ve created both policies, attach them to the previously created cross-account role.

3. Create a CloudFormation Deployment role

In this step, you need to create another IAM role, “CloudFormationDeploymentRole” for Application deployment. Then attach the following four policies to it.

Policy 1: For Cloudformation to deploy the application in the Production account

  "Version": "2012-10-17",
  "Statement": [
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:cloudformation:us-east-2:940679525002:stack/DevOps-Automation-API*/*"        }

Policy 2: For Cloudformation to perform required IAM actions

  "Version": "2012-10-17",
  "Statement": [
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
      "Resource": "*"

Policy 3: Lambda function service invocation policy

  "Version": "2012-10-17",
  "Statement": [
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:lambda:us-east-2:Your_Production_AccountID:function:SampleApplication*"

Policy 4: API Gateway service invocation policy

  "Version": "2012-10-17",
  "Statement": [
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
      "Resource": [
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:apigateway:*::/restapis/*/resources/*/methods/*/responses/*"
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:apigateway:*::/restapis/*"
      "Sid": "VisualEditor3",
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:apigateway:*::/restapis/*/resources/*/methods/*"

Make sure you also attach the S3 read/write access and KMS policies created in Step-2, to the CloudFormationDeploymentRole.

4. Setup and launch CodePipeline

You can launch the CodePipeline either manually in the AWS console using “Launch Stack” or programmatically via command-line in CLI.

On your local machine go to terminal/ command prompt and launch this command:

aws cloudformation deploy –template-file <Path to pipeline.yaml> –region us-east-2 –stack-name <Name_Of_Your_Stack> –capabilities CAPABILITY_IAM –parameter-overrides ArtifactBucketName=<Your_Artifact_Bucket_Name>  ArtifactEncryptionKeyArn=<Your_KMS_Key_ARN>  ProductionAccountId=<Your_Production_Account_ID>  ApplicationRepositoryName=<Your_Repository_Name> RepositoryBranch=master

If you have configured a profile in AWS CLI,  mention that profile while executing the command:

–profile <your_profile_name>

After launching the pipeline, your serverless API gets deployed in pre-production as well as in the production Accounts. You can check the deployment of your API in production or pre-production Account, by navigating to the API Gateway in the AWS console and looking for your API in the Region where it was deployed.

Figure 2. Check your deployment in pre-production/production environment

Figure 2. Check your deployment in pre-production/production environment

Then select your API and navigate to stages, to view the published API with an endpoint. Then validate your API response by selecting the API link.

Figure 3. Check whether your API is being published in pre-production/production environment

Figure 3. Check whether your API is being published in pre-production/production environment

Alternatively you can also navigate to your APIs by navigating through your deployed application CloudFormation stack and selecting the link for API in the Resources tab.


If you are trying this out in your AWS accounts, make sure to delete all the resources created during this exercise to avoid incurring any AWS charges.


In this blog, we showed how to build a cross-account code pipeline to automate releases across different environments using AWS CloudFormation templates and AWS Cross Account Access. You also learned how serveless APIs can be securely deployed across pre-production and production accounts. This helps enterprises automate release deployments in a repeatable and agile manner, reduce manual errors and deliver business cababilities more quickly.

Automate code reviews with Amazon CodeGuru Reviewer

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/devops/automate-code-reviews-with-amazon-codeguru-reviewer/

A common problem in software development is accidentally or unintentionally merging code with bugs, defects, or security vulnerabilities into your main branch. Finding and mitigating these faulty lines of code deployed to the production environment can cause severe outages in running applications and can cost unnecessary time and effort to fix.

Amazon CodeGuru Reviewer tackles this issue using automated code reviews, which allows developers to fix the issue based on automated CodeGuru recommendations before the code moves to production.

This post demonstrates how to use CodeGuru for automated code reviews and uses an AWS CodeCommit approval process to set up a code approval governance model.

Solution overview

In this post, you create an end-to-end code approval workflow and add required approvers to your repository pull requests. This can help you identify and mitigate issues before they’re merged into your main branches.

Let’s discuss the core services highlighted in our solution. CodeGuru Reviewer is a machine learning-based service for automated code reviews and application performance recommendations. CodeCommit is a fully managed and secure source control repository service. It eliminates the need to scale infrastructure to support highly available and critical code repository systems. CodeCommit allows you to configure approval rules on pull requests. Approval rules act as a gatekeeper on your source code changes. Pull requests that fail to satisfy the required approvals can’t be merged into your main branch for production deployment.

The following diagram illustrates the architecture of this solution.

With CodeCommit repository, creating a pull request and approval rule. Then run the workflow to test the code, review CodeGuru recommendations to make appropriate changes, and run the workflow again to confirm that the code is ready to be merged

The solution has three personas:

  • Repository admin – Sets up the code repository in CodeCommit
  • Developer – Develops the code and uses pull requests in the main branch to move the code to production
  • Code approver – Completes the code review based on the recommendations from CodeGuru and either approves the code or asks for fixes for the issue

The solution workflow contains the following steps:

  1. The repository admin sets up the workflow, including a code repository in CodeCommit for the development group, required access to check in their code to the dev branch, integration of the CodeCommit repository with CodeGuru, and approval details.
  2. Developers develop the code and check in their code in the dev branch. This creates a pull request to merge the code in the main branch.
  3. CodeGuru analyzes the code and reports any issues, along with recommendations based on the code quality.
  4. The code approver analyzes the CodeGuru recommendations and provides comments for how to fix the issue in the code.
  5. The developers fix the issue based on the feedback they received from the code approver.
  6. The code approver analyzes the CodeGuru recommendations of the updated code. They approve the code to merge if everything is okay.
  7. The code gets merged in the main branch upon approval from all approvers.
  8. An AWS CodePipeline pipeline is triggered to move the code to the preproduction or production environment based on its configuration.

In the following sections, we walk you through configuring the CodeCommit repository and creating a pull request and approval rule. We then run the workflow to test the code, review recommendations and make appropriate changes, and run the workflow again to confirm that the code is ready to be merged.


Before we get started, we create an AWS Cloud9 development environment, which we use to check in the Python code for this solution. The sample Python code for the exercise is available at the link. Download the .py files to a local folder.

Complete the following steps to set up the prerequisite resources:

  1. Set up your AWS Cloud9 environment and access the bash terminal, preferably in the us-east-1 Region.
  2. Create three AWS Identity and Access Management (IAM) users and its roles for the repository admin, developer, and approver by running the AWS CloudFormation template.

Configuring IAM roles and users

  1. Sign in to the AWS Management Console.
  2. Download ‘Persona_Users.yaml’ from github
  3. Navigate to AWS CloudFormation and click on Create Stack drop down to choose With new resouces (Standard).
  4. click on Upload a template file to upload file form local.
  5. Enter a Stack Name such as ‘Automate-code-reviews-codeguru-blog’.
  6. Enter IAM user’s temp password.
  7. Click Next to all the other default options.
  8. Check mark I acknowledge that AWS CloudFormation might create IAM resources with custom names. Click Create Stack.

This template creates three IAM users for Repository admin, Code Approver, Developer that are required at different steps while following this blog.

Configure the CodeCommit repository

Let’s start with CodeCommit repository. The repository works as the source control for the Java and Python code.

  1. Sign in to the AWS Management Console as the repository admin.
  2. On the CodeCommit console, choose Getting started in the navigation pane.
  3. Choose Create repository.

Creating AWS CodeCommit create a new repository using AWS Console

  1. For Repository name, enter transaction_alert_repo.
  2. Select Enable Amazon CodeGuru Reviewer for Java and Python – optional.
  3. Choose Create.

create CodeCommit repository named transaction_alert_repo, check box on Enable Amazon CodeGuru Reviewer for Java and Python

The repository is created.

  1. On the repository details page, choose Clone HTTPS on the Clone URL menu.

clone HTTPS link for CodeCommit repo transaction_alert_repo using clone URL menu

  1. Copy the URL to use in the next step to clone the repository in the development environment.

Clone link for HTTPS for CodeCommit repo transaction_alert_repo is avaiable to copy

  1. On the CodeGuru console, choose Repositories in the navigation pane under Reviewer.

You can see our CodeCommit repository is associated with CodeGuru.

CodeCommit repository is to be associated with CodeGuru

  1. Sign in to the console as the developer.
  2. On the AWS Cloud9 console, clone the repository, using the URL that you copied in the previous step.

This action clones the repository and creates the transaction_alert_repo folder in the environment.

git clone https://git-codecommit.us-east-.amazonaws.com/v1/repos/transaction_alert_repo
cd transaction_alert_repo
echo "This is a test file" > README.md
git add -A
git commit -m "initial setup"
git push

git clone CodeCommit repo to Cloud9 using git clone command, readme.md file is created locally and pushed back to CodeCommit repo]

  1. Check the file in CodeCommit to confirm that the README.md file is copied and available in the CodeCommit repository.

CodeCommit repo is now pushed with readme.md file

  1. In the AWS Cloud9 environment, choose the transaction_alert_repo folder.
  2. On the File menu, choose Upload Local Files to upload the Python files from your local folder (which you downloaded earlier).

Upload downloaded python test files that we are going to use for this blog from local system to Cloud9

  1. Choose Select files and upload read_file.py and read_rule.py.

Drag and drop python files on cloud9 upload UI

  1. You can see that both files are copied in the AWS Cloud9 environment under the transaction_alert_repo folder:
git checkout -b dev
git add -A
git commit -m "initial import of files"
git push --set-upstream origin dev

Push python local files are pushed to CodeCommit repo using git push command

  1. Check the CodeCommit console to confirm that the read_file.py and read_rule.py files are copied in the repository.

Check the CodeCommit console to verify these pushed files are available

Create a pull request

Now we create our pull request.

  1. On the CodeCommit console, navigate to your repository and choose Pull requests in the navigation pane.
  2. Choose Create pull request.

Create pull request for the new files added

  1. For Destination, choose master.
  2. For Source, choose dev.
  3. Choose Compare to see any conflict details in merging the request.

Pull request is visible to master branch, ready to merge

  1. If the environments are mergeable, enter a title and description.
  2. Choose Create pull request.

Pull request is merged and CodeGuru recommendation is triggered

Create an approval rule

We now create an approval rule as the repository admin.

  1. Sign in to the console as the repository admin.
  2. On the CodeCommit console, navigate to the pull request you created.
  3. On the Approvals tab, choose Create approval rule.

Creating new Approval rule for any merge action

  1. For Rule name, enter Require an approval before merge.
  2. For Number of approvals needed, enter 1.
  3. Under Approval pool members, provide an IAM ARN value for the code approver.
  4. Choose Create.

Approval Rule mentions, requires an approval before merge

Review recommendations

We can now view any recommendations regarding our pull request code review.

  1. As the repository admin, on the CodeGuru console, choose Code reviews in the navigation pane.
  2. On the Pull request tab, confirm that the code review is completed, as it might take some time to process.
  3. To review recommendations, choose the completed code review.

Check CodeGuru recommendation to see avaiable recommendation

You can now review the recommendation details, as shown in the following screenshot.

Review CodeGuru review recommendation details

  1. Sign in to the console as the code approver.
  2. Navigate to the pull request to view its details.

check pull request in detail, check stauts and Approval status

  1. On the Changes tab, confirm that the CodeGuru recommendation files are available.

confirm that the CodeGuru recommendation files are available

  1. Check the details of each recommendation and provide any comments in the New comment section.

The developer can see this comment as feedback from the approver to fix the issue.

  1. Choose Save.

In CodeGuru console developer can see this comment as feedback from the approver to fix the issue

  1. Enter any overall comments regarding the changes and choose Save.

Enter any overall comments regarding the changes and choose save

  1. Sign in to the console as the developer.
  2. On the CodeCommit console, navigate to the pull request -> select the request -> click on Changes to review the approver feedback.

click on Changes to review the approver feedback in CodeCommit console

Make changes, rerun the code review, and merge the environments

Let’s say the developer makes the required changes in the code to address the issue and uploads the new code in the AWS Cloud9 environment. If CodeGuru doesn’t find additional issues, we can merge the environments.

  1. Run the following command to push the updated code to CodeCommit:
git add -A
git commit -m "code-fixed"
git push --set-upstream origin dev

git clone CodeCommit repo to Cloud9 using git clone command, readme.md file is created locally and pushed back to CodeCommit repo

  1. Sign in to the console as the approver.
  2. Navigate to the code review.

CodeGuru hasn’t found any issue in the updated code, so there are no recommendations.

CodeGuru hasn’t found any issue in the updated code, so there are no recommendations avaiable this time

  1. On the CodeCommit console, you can verify the code and provide your approval comment.
  2. Choose Save.

Using CodeCommit console, code can be now verified for approval

  1. On the pull request details page, choose Approve.

New code is found with no conflict and can be approved

Now the developer can see on the CodeCommit console that the pull request is approved.

Code pull request is in Approved status

  1. Sign in to the console as the developer. On the pull request details page, choose Merge.

Approved code is ready to be merged

  1. Select your merge strategy. For this post, we select Fast forward merge.
  2. Choose Merge pull request.

Fast and forward merge is used to merge the code

You can see a success message.

Success message is generated for successful merge

  1. On the CodeCommit console, choose Code in the navigation pane for your repository.
  2. Choose master from the branch list.

The read_file.py and read_rule.py files are available under the main branch.

the new files are also avaiable in main branch beacuse of the successful merge

Clean up the resources

To avoid incurring future charges, remove the resources created by this solution by


This post highlighted the benefits of CodeGuru automated code reviews. You created an end-to-end code approval workflow and added required approvers to your repository pull requests. This solution can help you identify and mitigate issues before they’re merged into your main branches.

You can get started from the CodeGuru console by integrating CodeGuru Reviewer with your supported CI/CD pipeline.

For more information about automating code reviews and check out the documentation.

About the Authors

Dhiraj Thakur

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Akshay Goel

Akshay is a Cloud Support Associate with Amazon Web Services working closing with all AWS deployment services. He loves to play, test, create, modify and simplify the solution which makes the task easy and interesting.

Sameer Goel

Sameer is a Sr. Solutions Architect in Netherlands, who drives customer success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with a master’s degree from NEU Boston, with a concentration in data science. He enjoys building and experimenting with AI/ML projects on Raspberry Pi.

Deploy and Manage Gitlab Runners on Amazon EC2

Post Syndicated from Sylvia Qi original https://aws.amazon.com/blogs/devops/deploy-and-manage-gitlab-runners-on-amazon-ec2/

Gitlab CI is a tool utilized by many enterprises to automate their Continuous integration, continuous delivery and deployment (CI/CD) process. A Gitlab CI/CD pipeline consists of two major components: A .gitlab-ci.yml file describing a pipeline’s jobs, and a Gitlab Runner, an application that executes the pipeline jobs.

Setting up the Gitlab Runner is a time-consuming process. It involves provisioning the necessary infrastructure, installing the necessary software to run pipeline workloads, and configuring the runner. For enterprises running hundreds of pipelines across multiple environments, it is essential to automate the Gitlab Runner deployment process so as to be deployed quickly in a repeatable, consistent manner.

This post will guide you through utilizing Infrastructure-as-Code (IaC) to automate Gitlab Runner deployment and administrative tasks on Amazon EC2. With IaC, you can quickly and consistently deploy the entire Gitlab Runner architecture by running a script. You can track and manage changes efficiently. And, you can enforce guardrails and best practices via code. The solution presented here also offers autoscaling so that you save costs by terminating resources when not in use. You will learn:

  • How to deploy Gitlab Runner quickly and consistently across multiple AWS accounts.
  • How to enforce guardrails and best practices on the Gitlab Runner through IaC.
  • How to autoscale Gitlab Runner based on workloads to ensure best performance and save costs.

This post comes from a DevOps engineer perspective, and assumes that the engineer is familiar with the practices and tools of IaC and CI/CD.

Overview of the solution

The following diagram displays the solution architecture. We use AWS CloudFormation to describe the infrastructure that is hosting the Gitlab Runner. The main steps are as follows:

  1. The user runs a deploy script in order to deploy the CloudFormation template. The template is parameterized, and the parameters are defined in a properties file. The properties file specifies the infrastructure configuration, as well as the environment in which to deploy the template.
  2. The deploy script calls CloudFormation CreateStack API to create a Gitlab Runner stack in the specified environment.
  3. During stack creation, an EC2 autoscaling group is created with the desired number of EC2 instances. Each instance is launched via a launch template, which is created with values from the properties file. An IAM role is created and attached to the EC2 instance. The role contains permissions required for the Gitlab Runner to execute pipeline jobs. A lifecycle hook is attached to the autoscaling group on instance termination events. This ensures graceful instance termination.
  4. During instance launch, CloudFormation uses a cfn-init helper script to install and configure the Gitlab Runner:
    1. cfn-init installs the Gitlab Runner software on the EC2 instance.
    2. cfn-init configures the Gitlab Runner as a docker executor using a pre-defined docker image in the Gitlab Container Registry. The docker executor implementation lets the Gitlab Runner run each build in a separate and isolated container. The docker image contains the software required to run the pipeline workloads, thereby eliminating the need to install these packages during each build.
    3. cfn-init registers the Gitlab Runner to Gitlab projects specified in the properties file, so that these projects can utilize the Gitlab Runner to run pipelines.
  1. The user may repeat the same steps to deploy Gitlab Runner into another environment.

Architecture diagram previously explained in post.


This walkthrough will demonstrate how to deploy the Gitlab Runner, and how easy it is to conduct Gitlab Runner administrative tasks via this architecture. We will walk through the following tasks:

  • Build a docker executor image for the Gitlab Runner.
  • Deploy the Gitlab Runner stack.
  • Update the Gitlab Runner.
  • Terminate the Gitlab Runner.
  • Add/Remove Gitlab projects from the Gitlab Runner.
  • Autoscale the Gitlab Runner based on workloads.

The code in this post is available at https://github.com/aws-samples/amazon-ec2-gitlab-runner.git


For this walkthrough, you need the following:

  • A Gitlab account (all tiers including Gitlab Free self-managed, Gitlab Free SaaS, and higher tiers). This demo uses gitlab.com free tire.
  • A Gitlab Container Registry.
  • Git client to clone the source code provided.
  • An AWS account with local credentials properly configured (typically under ~/.aws/credentials).
  • The latest version of the AWS CLI. For more information, see Installing, updating, and uninstalling the AWS CLI.
  • Docker is installed and running on the localhost/laptop.
  • Nodejs and npm installed on the localhost/laptop.
  • A VPC with 2 private subnets and that is connected to the internet via NAT gateway allowing outbound traffic.
  • The following IAM service-linked role created in the AWS account: AWSServiceRoleForAutoScaling
  • An Amazon S3 bucket for storing Lambda deployment packages.
  • Familiarity with Git, Gitlab CI/CD, Docker, EC2, CloudFormation and Amazon CloudWatch.

Build a docker executor image for the Gitlab Runner

The Gitlab Runner in this solution is implemented as docker executor. The Docker executor connects to Docker Engine and runs each build in a separate and isolated container via a predefined docker image. The first step in deploying the Gitlab Runner is building a docker executor image. We provided a simple Dockerfile in order to build this image. You may customize the Dockerfile to install your own requirements.

To build a docker image using the sample Dockerfile:

  1. Create a directory where we will store our demo code. From your terminal run:
mkdir demo-repos && cd demo-repos
  1. Clone the source code repository found in the following location:
git clone https://github.com/aws-samples/amazon-ec2-gitlab-runner.git
  1. Create a new project on your Gitlab server. Name the project any name you like.
  2. Clone your newly created repo to your laptop. Ignore the warning about cloning an empty repository.
git clone <your-repo-url>
  1. Copy the demo repo files into your newly created repo on your laptop, and push it to your Gitlab repository. You may customize the Dockerfile before pushing it to Gitlab.
cp -r amazon-ec2-gitlab-runner/* <your-repo-dir>
cd <your-repo-dir>
git add .
git commit -m “Initial commit”
git push
  1. On the Gitlab console, go to your repository’s Package & Registries -> Container Registry. Follow the instructions provided on the Container Registry page in order to build and push a docker image to your repository’s container registry.

Deploy the Gitlab Runner stack

Once the docker executor image has been pushed to the Gitlab Container Registry, we can deploy the Gitlab Runner. The Gitlab Runner infrastructure is described in the Cloudformation template gitlab-runner.yaml. Its configuration is stored in a properties file called sample-runner.properties. A launch template is created with the values in the properties file. Then it is used to launch instances. This architecture lets you deploy Gitlab Runner to as many environments as you like by utilizing the configurations provided in the appropriate properties files.

During the provisioning process, utilize a cfn-init helper script to run a series of commands to install and configure the Gitlab Runner.

              command: sudo yum -y install docker
              command: sudo service docker start
              command: sudo wget -O /usr/bin/gitlab-runner https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64
              command: sudo chmod a+x /usr/bin/gitlab-runner
              command: sudo useradd --comment 'GitLab Runner' --create-home gitlab-runner --shell /bin/bash
              command: sudo gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner
              command: !Sub 'aws configure set default.region ${AWS::Region}'
              command: !Sub 
                - |
                  for GitlabGroupToken in `aws ssm get-parameters --names /${AWS::StackName}/ci-tokens --query 'Parameters[0].Value' | sed -e "s/\"//g" | sed "s/,/ /g"`;do
                      sudo gitlab-runner register \
                      --non-interactive \
                      --url "${GitlabServerURL}" \
                      --registration-token $GitlabGroupToken \
                      --executor "docker" \
                      --docker-image "${DockerImagePath}" \
                      --description "Gitlab Runner with Docker Executor" \
                      --locked="${isLOCKED}" --access-level "${ACCESS}" \
                      --docker-volumes "/var/run/docker.sock:/var/run/docker.sock" \
                      --tag-list "${RunnerEnvironment}-${RunnerVersion}-docker"
                - isLOCKED: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, isLOCKED]
                  ACCESS: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, ACCESS]                              
              command: sudo gitlab-runner start

The helper script ensures that the Gitlab Runner setup is consistent and repeatable for each deployment. If a configuration change is required, users simply update the configuration steps and redeploy the stack. Furthermore, all changes are tracked in Git, which allows for versioning of the Gitlab Runner.

To deploy the Gitlab Runner stack:

  1. Obtain the runner registration tokens of the Gitlab projects that you want registered to the Gitlab Runner. Obtain the token by selecting the project’s Settings > CI/CD and expand the Runners section.
  2. Update the sample-runner.properties file parameters according to your own environment. Refer to the gitlab-runner.yaml file for a description of these parameters. Rename the file if you like. You may also create an additional properties file for deploying into other environments.
  3. Run the deploy script to deploy the runner:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

<properties-file> is the name of the properties file.

<region> is the region where you want to deploy the stack.

<aws-profile> is the name of the CLI profile you set up in the prerequisites section.

<stack-name> is the name you chose for the CloudFormation stack.

For example:

./deploy-runner.sh sample-runner.properties us-east-1 dev amazon-ec2-gitlab-runner-demo

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console:

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console.

Under your Gitlab project Settings > CICD > Runners > Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Now go to your Gitlab project Settings  CICD  Runners  Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Updating the Gitlab Runner

There are times when you would want to update the Gitlab Runner. For example, updating the instance VolumeSize in order to resolve a disk space issue, or updating the AMI ID when a new AMI becomes available.

Utilizing the properties file and launch template makes it easy to update the Gitlab Runner. Simply update the Gitlab Runner configuration parameters in the properties file. Then, run the deploy script to udpate the Gitlab Runner stack. To ensure that the changes take effect immediately (e.g., existing instances are replaced by new instances with the new configuration), we utilize an AutoscalingRollingUpdate update policy to automatically update the instances in the autoscaling group.

        MinInstancesInService: !Ref MinInstancesInService
        MaxBatchSize: !Ref MaxBatchSize
        PauseTime: "PT5M"
        WaitOnResourceSignals: true
          - HealthCheck
          - ReplaceUnhealthy
          - AZRebalance
          - AlarmNotification
          - ScheduledActions

The policy tells CloudFormation that when changes are detected in the launch template, update the instances in batch size of MaxBatchSize, while keeping a number of instances (specified in MinInstanceInService) in service during the update.

Below is an example of updating the Gitlab Runner instance type.

To update the instance type of the runner instance:

  1. Update the “InstanceType” parameter in the properties file.


  1. Run the deploy-runner.sh script to update the CloudFormation stack:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

Terminate the Gitlab Runner

There are times when an autoscaling group instance must be terminated. For example, during an autoscaling scale-in event, or when the instance is being replaced by a new instance during a stack update, as seen previously. When terminating an instance, you must ensure that the Gitlab Runner finishes executing any running jobs before the instance is terminated, otherwise your environment could be left in an inconsistent state. Also, we want to ensure that the terminated Gitlab Runner is removed from the Gitlab project. We utilize an autoscaling lifecycle hook to achieve these goals.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

There are also times when you want to terminate an instance manually. For example, when an instance is suspected to not be functioning properly. To terminate an instance from the Gitlab Runner autoscaling group, use the following command:

aws autoscaling terminate-instance-in-auto-scaling-group \
    --instance-id="${InstanceId}" \
    --no-should-decrement-desired-capacity \
    --region="${region}" \

The above command terminates the instance. The lifecycle hook ensures that the cleanup steps are conducted properly, and the autoscaling group launches another new instance to replace the old one.

Note that if you terminate the instance by using the “ec2 terminate-instance” command, then the autoscaling lifecycle hook actions will not be triggered.

Add/Remove Gitlab projects from the Gitlab Runner

As new projects are added to your enterprise, you may want to register them to the Gitlab Runner, so that those projects can utilize the Gitlab Runner to run pipelines. On the other hand, you would want to remove the Gitlab Runner from a project if it no longer wants to utilize the Gitlab Runner, or if it qualifies to utilize the Gitlab Runner. For example, if a project is no longer allowed to deploy to an environment configured by the Gitlab Runner. Our architecture offers a simple way to add and remove projects from the Gitlab Runner. To add new projects to the Gitlab Runner, update the RunnerRegistrationTokens parameter in the properties file, and then rerun the deploy script to update the Gitlab Runner stack.

To add new projects to the Gitlab Runner:

  1. Update the RunnerRegistrationTokens parameter in the properties file. For example:
  1. Update the Gitlab Runner stack. This updates the SSM parameter which stores the tokens.
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 
  1. Relaunch the instances in the Gitlab Runner autoscaling group. The new instances will use the new RunnerRegistrationTokens value. Run the following command to relaunch the instances:
./cycle-runner.sh <runner-autoscaling-group-name> <region> <optional-aws-profile>

To remove projects from the Gitlab Runner, follow the steps described above, with just one difference. Instead of adding new tokens to the RunnerRegistrationTokens parameter, remove the token(s) of the project that you want to dissociate from the runner.

Autoscale the runner based on custom performance metrics

Each Gitlab Runner can be configured to handle a fixed number of concurrent jobs. Once this capacity is reached for every runner, any new jobs will be in a Queued/Waiting status until the current jobs complete, which would be a poor experience for our team. Setting the number of concurrent jobs too high on our runners would also result in a poor experience, because all jobs leverage the same CPU, memory, and storage in order to conduct the builds.

In this solution, we utilize a scheduled Lambda function that runs every minute in order to inspect the number of jobs running on every runner, leveraging the Prometheus Metrics endpoint that the runners expose. If we approach the concurrent build limit of the group, then we increase the Autoscaling Group size so that it can take on more work. As the number of concurrent jobs decreases, then the scheduled Lambda function will scale the Autoscaling Group back in an effort to minimize cost. The Scaling-Up operation will ignore the Autoscaling Group’s cooldown period, which will help ensure that our team is not waiting on a new instance, whereas the Scale-Down operation will obey the group’s cooldown period.

Here is the logical sequence diagram for the work:

Sequence diagram

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

Congratulations! You have completed the walkthrough. Take some time to review the resources you have deployed, and practice the various runner administrative tasks that we have covered in this post.


Problem: I deployed the CloudFormation template, but no runner is listed in my repository.

Possible Cause: Errors have been encountered during cfn-init, causing runner registration to fail. Connect to your runner EC2 instance, and check /var/log/cfn-*.log files.

Cleaning up

To avoid incurring future charges, delete every resource provisioned in this demo by deleting the CloudFormation stack created in the “Deploy the Gitlab Runner stack” section.


This article demonstrated how to utilize IaC to efficiently conduct various administrative tasks associated with a Gitlab Runner. We deployed Gitlab Runner consistently and quickly across multiple accounts. We utilized IaC to enforce guardrails and best practices, such as tracking Gitlab Runner configuration changes, terminating the Gitlab Runner gracefully, and autoscaling the Gitlab Runner to ensure best performance and minimum cost. We walked through the deploying, updating, autoscaling, and terminating of the Gitlab Runner. We also saw how easy it was to clean up the entire Gitlab Runner architecture by simply deleting a CloudFormation stack.

About the authors

Sylvia Qi

Sylvia is a Senior DevOps Architect focusing on architecting and automating DevOps processes, helping customers through their DevOps transformation journey. In her spare time, she enjoys biking, swimming, yoga, and photography.

Sebastian Carreras

Sebastian is a Senior Cloud Application Architect with AWS Professional Services. He leverages his breadth of experience to deliver bespoke solutions to satisfy the visions of his customer. In his free time, he really enjoys doing laundry. Really.

Using Amazon Aurora Global Database for Low Latency without Application Changes

Post Syndicated from Roneel Kumar original https://aws.amazon.com/blogs/architecture/using-amazon-aurora-global-database-for-low-latency-without-application-changes/

Deploying global applications has many challenges, especially when accessing a database to build custom pages for end users. One example is an application using AWS [email protected]. Two main challenges include performance and availability.

This blog explains how you can optimally deploy a global application with fast response times and without application changes.

The Amazon Aurora Global Database enables a single database cluster to span multiple AWS Regions by asynchronously replicating your data within subsecond timing. This provides fast, low-latency local reads in each Region. It also enables disaster recovery from Region-wide outages using multi-Region writer failover. These capabilities minimize the recovery time objective (RTO) of cluster failure, thus reducing data loss during failure. You will then be able to achieve your recovery point objective (RPO).

However, there are some implementation challenges. Most applications are designed to connect to a single hostname with atomic, consistent, isolated, and durable (ACID) consistency. But Global Aurora clusters provide reader hostname endpoints in each Region. In the primary Region, there are two endpoints, one for writes, and one for reads. To achieve strong  data consistency, a global application requires the ability to:

  • Choose the optimal reader endpoints
  • Change writer endpoints on a database failover
  • Intelligently select the reader with the most up-to-date, freshest data

These capabilities typically require additional development.

The Heimdall Proxy coupled with Amazon Route 53 allows edge-based applications to access the Aurora Global Database seamlessly, without  application changes. Features include automated Read/Write split with ACID compliance and edge results caching.

Figure 1. Heimdall Proxy architecture

Figure 1. Heimdall Proxy architecture

The architecture in Figure 1 shows Aurora Global Databases primary Region in AP-SOUTHEAST-2, and secondary Regions in AP-SOUTH-1 and US-WEST-2. The Heimdall Proxy uses latency-based routing to determine the closest Reader Instance for read traffic, and redirects all write traffic to the Writer Instance. The Heimdall Configuration stores the Amazon Resource Name (ARN) of the global cluster. It automatically detects failover and cross-Region on the cluster, and directs traffic accordingly.

With an Aurora Global Database, there are two approaches to failover:

  • Managed planned failover. To relocate your primary database cluster to one of the secondary Regions in your Aurora global database, see Managed planned failovers with Amazon Aurora Global Database. With this feature, RPO is 0 (no data loss) and it synchronizes secondary DB clusters with the primary before making any other changes. RTO for this automated process is typically less than that of the manual failover.
  • Manual unplanned failover. To recover from an unplanned outage, you can manually perform a cross-Region failover to one of the secondaries in your Aurora Global Database. The RTO for this manual process depends on how quickly you can manually recover an Aurora global database from an unplanned outage. The RPO is typically measured in seconds, but this is dependent on the Aurora storage replication lag across the network at the time of the failure.

The Heimdall Proxy automatically detects Amazon Relational Database Service (RDS) / Amazon Aurora configuration changes based on the ARN of the Aurora Global cluster. Therefore, both managed planned and manual unplanned failovers are supported.

Solution benefits for global applications

Implementing the Heimdall Proxy has many benefits for global applications:

  1. An Aurora Global Database has a primary DB cluster in one Region and up to five secondary DB clusters in different Regions. But the Heimdall Proxy deployment does not have this limitation. This allows for a larger number of endpoints to be globally deployed. Combined with Amazon Route 53 latency-based routing, new connections have a shorter establishment time. They can use connection pooling to connect to the database, which reduces overall connection latency.
  2. SQL results are cached to the application for faster response times.
  3. The proxy intelligently routes non-cached queries. When safe to do so, the closest (lowest latency) reader will be used. When not safe to access the reader, the query will be routed to the global writer. Proxy nodes globally synchronize their state to ensure that volatile tables are locked to provide ACID compliance.

For more information on configuring the Heimdall Proxy and Amazon Route 53 for a global database, read the Heimdall Proxy for Aurora Global Database Solution Guide.

Download a free trial from the AWS Marketplace.


Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced ISV partner. They have AWS Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes.

Monitor AWS resources created by Terraform in Amazon DevOps Guru using tfdevops

Post Syndicated from Harish Vaswani original https://aws.amazon.com/blogs/devops/monitor-aws-resources-created-by-terraform-in-amazon-devops-guru-using-tfdevops/

This post was written in collaboration with Kapil Thangavelu, CTO at Stacklet

Amazon DevOps Guru is a machine learning (ML) powered service that helps developers and operators automatically detect anomalies and improve application availability. DevOps Guru utilizes machine learning models, informed by years of Amazon.com and AWS operational excellence to identify anomalous application behavior (e.g., increased latency, error rates, resource constraints) and surface critical issues that could cause potential outages or service disruptions. DevOps Guru’s anomaly detectors can also proactively detect anomalous behavior even before it occurs, helping you address issues before they happen; insights provide recommendations to mitigate anomalous behavior.

When you enable DevOps Guru, you can configure its coverage to determine which AWS resources you want to analyze. As an option, you can define the coverage boundary by selecting specific AWS CloudFormation stacks. For each stack you choose, DevOps Guru analyzes operational data from the supported resources to detect anomalous behavior. See Working with AWS CloudFormation stacks in DevOps Guru for more details.

For Terraform users, Stacklet developed an open-source tool called tfdevops, which converts Terraform state to an importable CloudFormation stack, which allows DevOps Guru to start monitoring the encapsulated AWS resources. Note that tfdevops is not a tool to convert Terraform into CloudFormation. Instead, it creates the CloudFormation stack containing the imported resources that are specified in the Terraform module and enables DevOps Guru to monitor the resources in that CloudFormation stack.

In this blog post, we will explain how you can configure and use tfdevops, to easily enable DevOps Guru for your existing AWS resources created by Terraform.

Solution overview

tfdevops performs the following steps to import resources into Amazon DevOps Guru:

  • It translates terraform state into an AWS CloudFormation template with a retain deletion policy
  • It creates an AWS CloudFormation stack with imported resources
  • It enrolls the stack into Amazon DevOps Guru

For illustration purposes, we will use a sample serverless application that includes some of the components DevOps Guru and tfdevops supports. This application consists of an Amazon Simple Queue Service (SQS) queue, and an AWS Lambda function that processes messages in the SQS queue. It also includes an Amazon DynamoDB table that the Lambda function uses to persist or to read data, and an Amazon Simple Notification Service (SNS) topic to where the Lambda function publishes the results of its processing. The following diagram depicts our sample application:

The architecture diagram shows a sample application containing an Amazon SQS queue, an AWS Lambda function, an Amazon SNS topic and an Amazon DynamoDB table.


Before getting started, make sure you have these prerequisites:


Follow these steps to monitor your AWS resources created with Terraform templates by using tfdevops:

  1. Install tfdevops following the instructions on GitHub
  2. Create a Terraform module with the resources supported by tfdevops
  3. Deploy the Terraform to your AWS account to create the resources in your account

Below is a sample Terraform module to create a sample AWS Lambda function, an Amazon DynamoDB table, an Amazon SNS topic and an Amazon SQS queue.

# IAM role for the lambda function
resource "aws_iam_role" "lambda_role" {
 name   = "iam_role_lambda_function"
 assume_role_policy = <<EOF
  "Version": "2012-10-17",
  "Statement": [
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      "Effect": "Allow",
      "Sid": ""

# IAM policy for logging from the lambda function
resource "aws_iam_policy" "lambda_logging" {

  name         = "iam_policy_lambda_logging_function"
  path         = "/"
  description  = "IAM policy for logging from a lambda"
  policy = <<EOF
  "Version": "2012-10-17",
  "Statement": [
      "Action": [
      "Resource": "arn:aws:logs:*:*:*",
      "Effect": "Allow"

# Policy attachment for the role
resource "aws_iam_role_policy_attachment" "policy_attach" {
  role        = aws_iam_role.lambda_role.name
  policy_arn  = aws_iam_policy.lambda_logging.arn

# Generates an archive from the source
data "archive_file" "default" {
  type        = "zip"
  source_dir  = "${path.module}/src/"
  output_path = "${path.module}/myzip/python.zip"

# Create a lambda function
resource "aws_lambda_function" "basic_lambda_function" {
  filename                       = "${path.module}/myzip/python.zip"
  function_name                  = "basic_lambda_function"
  role                           = aws_iam_role.lambda_role.arn
  handler                        = "index.lambda_handler"
  runtime                        = "python3.8"
  depends_on                     = [aws_iam_role_policy_attachment.policy_attach]

# Create a DynamoDB table
resource "aws_dynamodb_table" "sample_dynamodb_table" {
  name           = "sample_dynamodb_table"
  hash_key       = "sampleHashKey"
  billing_mode   = "PAY_PER_REQUEST"

  attribute {
    name = "sampleHashKey"
    type = "S"

# Create an SQS queue
resource "aws_sqs_queue" "sample_sqs_queue" {
  name          = "sample_sqs_queue"

# Create an SNS topic
resource "aws_sns_topic" "sample_sns_topic" {
  name = "sample_sns_topic"
  1. Run tfdevops to convert to CloudFormation template, deploy the stack and enable DevOps Guru

The following command generates a CloudFormation template locally from a Terraform state file:

tfdevops cfn -d ~/path/to/terraform/module --template mycfn.json --resources importable-ids.json

The following command deploys the CloudFormation template, creates a CloudFormation stack, imports resources, and activates DevOps Guru on the stack:

tfdevops deploy --template mycfn.json --resources importable-ids.json
  1. After tfdevopsfinishes the deployment, you can already see the stack in the CloudFormation dashboard.

CloudFormation dashboard showing the stack, GuruStack, created by tfdevops

tfdevops imports the existing resources in the Terraform module into AWS CloudFormation. Note, that these are not new resources and would have no additional cost implications for the resources itself. See Bringing existing resources into CloudFormation management to learn more about importing resources into CloudFormation.

Resources view for GuruStack listing the imported resources in GuruStack

  1. Your stack also appears at the DevOps Guru dashboard, indicating that DevOps Guru is monitoring your resources, and will alarm in case it detects anomalous behavior. Insights are co-related sequence of events and trails, grouped together to provide you with prescriptive guidance and recommendations to root-cause and resolve issues more quickly. See Working with insights in DevOps Guru to learn more about DevOps Guru insights.

Amazon DevOps Guru Dashboard displays the system health summary and system health overview of each CloudFormation stack. GuruStack is marked as healthy with 0 reactive insights and 0 proactive insights.

Note that when you use the tfdevops tool, it automatically enables DevOps Guru on the imported stack.

Amazon DevOps Guru Analyze resources displays the analysis coverage option selected. GuruStack is the selected stack for analysis

  1. Clean up – delete the stack

CloudFormation Stacks menu showing GuruStack as selected. The stack can be deleted by pressing the Delete button.


This blog post demonstrated how to enable DevOps Guru to monitor your AWS resources created by Terraform. Using the Stacklet’s tfdevops tool, you can create a CloudFormation stack from your Terraform state, and use that to define the coverage boundary for DevOps Guru. With that, if your resources have unexpected or unusual behavior, DevOps Guru will notify you and provide prescriptive recommendations to help you quickly fix the issue.

If you want to experiment DevOps Guru, AWS offers a free tier for the first three months that includes 7,200 AWS resource hours per month for free on each resource group A and B. Also, you can Estimate Amazon DevOps Guru resource analysis costs from the AWS Management Console. This feature scans selected resources to automatically generate a monthly cost estimate. Furthermore, refer to Gaining operational insights with AIOps using Amazon DevOps Guru to learn more about how DevOps Guru helps you increase your applications’ availability, and check out this workshop for a hands-on walkthrough of DevOps Guru’s main features and capabilities. To learn more about proactive insights, see Generating DevOps Guru Proactive Insights for Amazon ECS. To learn more about anomaly detection, see Anomaly Detection in AWS Lambda using Amazon DevOps Guru’s ML-powered insights.

About the authors

Harish Vaswani

Harish Vaswani is a Senior Cloud Application Architect at Amazon Web Services. He specializes in architecting and building cloud native applications and enables customers with best practices in their cloud journey. He is a DevOps and Machine Learning enthusiast. Harish lives in New Jersey and enjoys spending time with this family, filmmaking and music production.

Rafael Ramos

Rafael is a Solutions Architect at AWS, where he helps ISVs on their journey to the cloud. He spent over 13 years working as a software developer, and is passionate about DevOps and serverless. Outside of work, he enjoys playing tabletop RPG, cooking and running marathons.

Define application boundary using AWS resources tags in Amazon DevOps Guru

Post Syndicated from Suneel Joshi original https://aws.amazon.com/blogs/devops/define-application-boundary-using-aws-resources-tags-in-amazon-devops-guru/

Amazon DevOps Guru is an ML powered service that makes it easy to improve an application’s operational performance and availability. By analyzing application metrics, logs, events and traces, DevOps Guru identifies behaviors that deviate from normal operating patterns and creates insights that you can use to improve your application.

At re:Invent 2021, we announced a new tagging feature in DevOps Guru. This feature allows you to organize resources into logical applications, using AWS resources tags so that you can have more control over how applications are defined. Well-defined applications enable DevOps Guru to group related anomalies together to better identify problems and to provide more meaningful recommendations. A tag is a label consisting of a user-defined key and a value. Previously, the coverage boundary consisted of an entire AWS account or specific resources defined by AWS CloudFormation stacks.

Getting Started

Define Resources to analyze using AWS resources tags

An AWS resource tag is a label that consists of a key and a value. A key-value pair can create useful grouping of resources into different applications. For DevOps Guru, you specify one tag key across all your applications. Resources with the same tag value are grouped together into a logical application. The tag key needs to be prefixed with the string “devops-guru-”. Note that the prefix string is not case sensitive. The tag value can be any value you define. The next section describes how you can use tag values to define coverage boundary for your applications.

You can add tags to your resources using the AWS service to which each resource belongs, or use the Tag Editor. To manage tags using your resource’s service, you can use the console, AWS CLI or SDK of the service.

Define Application boundary using AWS resources tag values

For DevOps Guru, we define an application as a group of instantiated AWS resources (Amazon EC2, AWS Lambda, Amazon RDS, etc.) that your workload is running on. You assign the same tag value to all resources that make up your application. DevOps Guru will analyze each resource separately, and will also look at metrics and events across all resources in your application to detect anomalies and generate insights. For example, see the diagram below.

You can have one tag key across all your applications. For each application, assign a different tag value. All resources that make up an application should have the same tag value.

App 1 consists of 2 different resources for a database application – an EC2 instance and a database instance. Assigning the same tag value of RDS to both of the resources. I have another serverless application in App 2, which has a Lambda function and a DynamoDB instance. I assign a different tag value of serverless-app-1 to both of the App 2 resources.

Example Test Scenario

I am going to create a test scenario with an application server running in an EC2 instance. The application server is connected to an Aurora MySQL-Compatible database instance. I will instrument my application to introduce a misbehaving SQL query to create a performance anomaly.

In my example below, I tagged my EC2 instance and database instance with the tag value of RDS. I am interested in detecting performance issues in my Database instance and I want DevOps Guru to provide recommendations to fix those issues.

Manage DevOpsGuru Analysis Coverage

Next, I define the coverage boundary in DevOps Guru Console. In the Settings options in navigation pane, I select Analyzed resources and choose Edit.

To define coverage boundary in the DevOps Guru Console, select the “Settings” option in navigation pane, select “Analyzed resources“ and choose Edit.

Next, I select the “devops-guru-applications” as tag key from the dropdown menu. I am going to select RDS as the tag value, since I am interested in looking at performance issues in my Amazon Aurora database instance.

In the “Edit analyzed resources” screen, choose your tag key in the “Tag key” dropdown menu. Next, press the radio button for choosing specific tag values and then select the tag values to define the coverage boundary of your application.

Filter insights by tags

Next, I created my test scenario. Once DevOps Guru generated an insight, I am able to filter the insights by tag key or tag values. To display insights for my database instance, I select “Affected applications” from the search menu bar on insights page as shown below:

Insights generated by DevOps Guru can be filtered based on the tag values. Select “Affected applications” in the search menu bar in insights page.

Next, I select “Affected applications” as RDS in the above dropdown menu. Below is the Insight overview screen that gets displayed.

The metrics section of the Insights page provides a summary of the Anomaly detected. It also displays a graphical representation of the time window when the anomaly was detected as well as the detailed metric the anomaly was based upon.

The insights generated by DevOps Guru for my Amazon Aurora instances are enabled by Amazon DevOps Guru for RDS, a new feature we announced at re:Invent 2021. It allows developers to easily detect, diagnose, and resolve performance and operational issues in Amazon Aurora. For more information on Amazon DevOps Guru for RDS, see a related news blog written by my colleague, Marcia Villalba.

The insight summary indicates that there is high DB load, ten times above baseline. DevOps Guru for RDS uses anomaly detection on the database load (DB load) performance metric to detect issues. DB load is measured in units of Average Active Sessions (AAS). DB load measures the level of activity in your database, making it a great metric to understand the health of your database.

If you continue scrolling on the DevOps Guru for RDS analysis page, you can discover the cause for the problem and some recommendations to fix it. DevOps Guru for RDS detected there was a high load of wait events, and one SQL query was found to require further investigation. You can even see the exact SQL query if you click on the SQL digest IDs. The insight’s analysis and recommendation section is full of information on how to investigate further and fix the issue.

The easy-to-understand recommendations made by DevOps Guru for RDS means that as a DevOps engineer, you do not need to rely on a database administrator (DBA) or use any third party tools.

DevOps Guru for RDS provides specific recommendations to fix the performance issues detected. In this example, specific wait events contributing to high DB load were identified and specific SQL query ID was identified as a major contributor


AWS resources tags give you one more way to specify the resource analysis coverage boundary, in addition to existing methods of an entire AWS account or specific AWS CloudFormation stacks. AWS tags allows you to better isolate the applications you want DevOps Guru to analyze. In this post, we used AWS tags to define the coverage boundary for a database application. We reduced unrelated and unnecessary resource coverage from our analysis, thereby controlling our resource analysis costs.  Visit the DevOps Guru documentation to learn more about how to use tags to identify resources in your DevOps Guru applications.

About the author

Suneel Joshi

Suneel is an Enterprise Support Lead at Amazon Web Services. He provides advocacy and guidance to customers in their cloud journey as they plan and build cloud solutions. He is a DevOps and Machine Learning enthusiast. Among other things, he helps customers build intelligence in their applications using AI services.

Automate Container Anomaly Monitoring of Amazon Elastic Kubernetes Service Clusters with Amazon DevOps Guru

Post Syndicated from Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/automate-container-anomaly-monitoring-of-amazon-elastic-kubernetes-service-clusters-with-amazon-devops-guru/

Observability in a container-centric environment presents new challenges for operators due to the increasing number of abstractions and supporting infrastructure. In many cases, organizations can have hundreds of clusters and thousands of services/tasks/pods running concurrently. This post will demonstrate new features in Amazon DevOps Guru to help simplify and expand the capabilities of the operator. The features include grouping anomalies by metric and container cluster to improve context and simplify access and support for additional Amazon CloudWatch Container Insight metrics. An example of these capabilities in action would be that Amazon DevOps Guru can now identify anomalies in CPU, memory, or networking within Amazon Elastic Kubernetes Service (EKS), notifying the operators and letting them more easily navigate to the affected cluster to examine the collected data.

Amazon DevOps Guru offers a fully managed AIOps platform service that lets developers and operators improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. Its ML models take advantage of the expertise of AWS in operating highly available applications for the world’s largest ecommerce business for over 20 years. DevOps Guru automatically detects operational issues, predicts impending resource exhaustion, details likely causes, and recommends remediation actions.

Solution Overview

In this post, we will demonstrate the new Amazon DevOps Guru features around cluster grouping and additionally supported Amazon EKS metrics. To demonstrate these features, we will show you how to create a Kubernetes cluster, instrument the cluster using AWS Distro for OpenTelemetry, and then configure Amazon DevOps Guru to automate anomaly detection of EKS metrics. A previous blog provides detail on the AWS Distro for OpenTelemetry collector that is employed here.


EKS Cluster Creation

We employ the eksctl CLI tool to create an Amazon EKS. Using eksctl, you can provide details on the command line or specify a manifest file. The following manifest is used to create a single managed node using Amazon Elastic Compute Cloud (EC2), and this will be created and constrained to the specified Region via entry metadata/region and Availability Zones via the managedNodeGroups/availabilityZones entry. By default, this will create a new VPC with eight subnets.

# An example of ClusterConfig object using Managed Nodes
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig

      name: devopsguru-eks-cluster
      region: <SPECIFY_REGION_HERE>
      version: "1.21"

    availabilityZones: ["<FIRST_AZ>","<SECOND_AZ>"]
      - name: managed-ng-private
        privateNetworking: true
        instanceType: t3.medium
        minSize: 1
        desiredCapacity: 1
        maxSize: 6
        availabilityZones: ["<SPECIFY_AVAILABILITY_ZONE(S)_HERE"]
        volumeSize: 20
        labels: {role: worker}
          nodegroup-role: worker
          - "api"
  • To create an Amazon EKS cluster using eksctl and a manifest file, we use eksctl create as shown below. Note that this step will take 10 – 15 minutes to establish the cluster.
$ eksctl create cluster -f devopsguru-managed-node.yaml
2021-10-13 10:44:53 [i] eksctl version 0.69.0
2021-10-13 11:04:42 [✔] all EKS cluster resources for "devopsguru-eks-cluster" have been created
2021-10-13 11:04:44 [i] nodegroup "managed-ng-private" has 1 node(s)
2021-10-13 11:04:44 [i] node "<ip>.<region>.compute.internal" is ready
2021-10-13 11:04:44 [i] waiting for at least 1 node(s) to become ready in "managed-ng-private"
2021-10-13 11:04:44 [i] nodegroup "managed-ng-private" has 1 node(s)
2021-10-13 11:04:44 [i] node "<ip>.<region>.compute.internal" is ready
2021-10-13 11:04:47 [i] kubectl command should work with "/Users/<user>/.kube/config"
  • Once this is complete, you can use kubectl, the Kubernetes CLI, to access the managed nodes that are running.
$ kubectl get nodes
<ip>.<region>.compute.internal Ready <none> 76m v1.21.4-eks-033ce7e

AWS Distro for OpenTelemetry Collector Installation

We will use AWS Distro for OpenTelemetry Collector to extract metrics from a pod running in Amazon EKS. This will collect metrics within the Kubernetes cluster and surface them to Amazon CloudWatch. We start by defining a policy to allow access. The following information comes from the post here.

Attach the CloudWatchAgentServerPolicy IAM Policy to worker node

  • Open the Amazon EC2 console.
  • Select one of the worker node instances, and choose the IAM role in the description.
  • On the IAM role page, choose Attach policies.
  • In the list of policies, select the check box next to CloudWatchAgentServerPolicy. You can use the search box to find this policy.
  • Choose Attach policies.

Deploy AWS OpenTelemetry Collector on Amazon EKS

Next, you will deploy the AWS Distro for OpenTelemetry using a GitHub hosted manifest.

  • Deploy the artifact to the Amazon EKS cluster using the following command:
$ curl https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml | kubectl apply -f -
  • View the resources in the aws-otel-eks namespace.
$ kubectl get pods -l name=aws-otel-eks-ci -n aws-otel-eks
aws-otel-eks-ci-jdf2w 1/1 Running 0 107m

View Container Insight Metrics in Amazon CloudWatch

Access Amazon CloudWatch and select Metrics, All metrics to view the published metrics. Under Custom Namespaces, ContainerInsights is selectable. Under this, one can view metrics at the cluster, node, pod, namespace, and service granularity. The following example shows pod level metrics of CPU:

The AWS Console with Amazon Cloudwatch Container Insights Pod Level CPU Utilization.

Amazon Simple Notification Service

It is necessary to allow Amazon DevOps Guru access to Amazon SNS in order for Amazon SNS to publish events. During the setup process, an Amazon SNS Topic is created, and the following resource policy is applied:

    "Sid": "DevOpsGuru-added-SNS-topic-permissions",
    "Effect": "Allow",
    "Principal": {
        "Service": "region-id.devops-guru.amazonaws.com"
    "Action": "sns:Publish",
    "Resource": "arn:aws:sns:region-id:topic-owner-account-id:my-topic-name",
    "Condition" : {
      "StringEquals" : {
        "AWS:SourceArn": "arn:aws:devops-guru:region-id:topic-owner-account-id:channel/devops-guru-channel-id",
        "AWS:SourceAccount": "topic-owner-account-id"

Amazon DevOps Guru

Amazon DevOps Guru can now be leveraged to monitor the Amazon EKS cluster and Managed Node Group. Select Amazon DevOps Guru, and select Get started as shown in the following figure to do this.

The Amazon DevOps Guru service via the AWS Console.

Once selected, the Get started console displays, letting you specify the IAM role for DevOps guru to access the appropriate resources.

The Get started dialog for Amazon DevOps Guru including instructions on how the service operates, IAM Role Permissions and Amazon DevOps Guru analysis coverage.

Under the Amazon DevOps Guru analysis coverage, Choose later is selected. This will let us specify the CloudFormation stacks to monitor. Select Create a new SNS topic, and provide a name. This will be used to collect notifications and allow for subscribers to then be notified. Select Enable when complete.

The Amazon DevOps Guru analysis coverage allowing the user to select all resources in a region or to choose later. In addition the image shows the dialog that requests the user specify an Amazon SNS topic for notification when insights occur.

On the Manage DevOps Guru analysis coverage, select Analyze all AWS resources in the specified CloudFormation stacks in this Region. Then, select the cluster and managed node group AWS CloudFormation stacks so that DevOps Guru can monitor Amazon EKS.

A dialog where the user is able to specify the AWS CloudFormation stacks in a region for analysis coverage. Two stacks are select including the eks cluster and eks cluster managed node group.

Once this is selected, the display will update indicating that two CloudFormation stacks were added.

Amazon DevOps Guru Settings including DevOps Guru analysis coverage and Amazon SNS notifications.

Amazon DevOps Guru will finally start analysis for those two stacks. This will take several hours to collect data and to identify normal operating conditions. Once this process is complete, the Dashboard will display that those resources have been analyzed, as shown in the following figure.

The completed analysis by DevOps guru of the two AWS Cloudformation stacks indicating a healthy status for both.

Enable Encryption on Amazon SNS Topic

The Amazon SNS Topic created by Amazon DevOps Guru will not enable encryption by default. It is important to enable this feature to encrypt notifications at rest. Go to Amazon SNS, select the topic that is created and then Edit topic. Open the Encryption dialog box and enable encryption as shown in the following figure, specifying an alias, or accepting the default.

The Encryption dialog for Amazon SNS topic when it is Edited.

Deploy Sample Application on Amazon EKS To Trigger Insights

You will employ a sample application that is part of the AWS Distro for OpenTelemetry Collector to simulate failure. Using the following manifest, you will deploy a sample application that has pod resource limits for memory and CPU shares. These limits are artificially low and insufficient for the pod to run. The pod will exceed memory and will be identified for eviction by Amazon EKS. When it is evicted, it will attempt to be redeployed per the manifest requirement for a replica of one. In turn, this will repeat the process and generate memory and pod restart errors in Amazon CloudWatch. For this example, the deployment was left for over an hour, thereby causing the pod failure to repeat numerous times. The following is the manifest that you will create on the filesystem.

kind: Deployment
apiVersion: apps/v1
  name: java-sample-app
  namespace: aws-otel-eks
    name: java-sample-app
  replicas: 1
      name: java-sample-app
        name: java-sample-app
        - name: aws-otel-emitter
          image: aottestbed/aws-otel-collector-java-sample-app:0.9.0
              memory: "128Mi"
              cpu: "200m"
          - containerPort: 4567
          - name: OTEL_OTLP_ENDPOINT
            value: "localhost:4317"
            value: "service.namespace=AWSObservability,service.name=CloudWatchEKSService"
          - name: S3_REGION
            value: "us-east-1"
          imagePullPolicy: Always

To deploy the application, use the following command:

$ kubectl apply -f <manifest file name>
deployment.apps/java-sample-app created

Scenario: Improved context from DevOps Guru Container Cluster Grouping and Increased Metrics

For our scenario, Amazon DevOps Guru is monitoring additional Amazon CloudWatch Container Insight Metrics for EKS. The following figure shows the flow of information and eventual notification of the operator, so that they can examine the Amazon DevOps Guru Insight. Starting at step 1, the container agent (AWS Distro for OpenTelemetry) forwards container metrics to Amazon CloudWatch. In step 2, Amazon DevOps Guru is continually consuming those metrics and performing anomaly detection. If an anomaly is detected, then this generates an Insight, thereby triggering Amazon SNS notification as shown in step 3. In step 4, the operators access Amazon DevOps Guru console to examine the insight. Then, the operators can leverage the new user interface capability displaying which cluster, namespace, and pod/service is impacted along with correlated Amazon EKS metric(s).

 The flow of information and eventual notification of the operator, so that they can examine the Amazon DevOps Guru Insight. Starting at step 1, the container agent (AWS Distro for OpenTelemetry) forwards container metrics to Amazon CloudWatch. In step 2, Amazon DevOps Guru is continually consuming those metrics and performing anomaly detection. If an anomaly is detected, then this generates an Insight, thereby triggering Amazon SNS notification as shown in step 3. In step 4, the operators access Amazon DevOps Guru console to examine the insight. Then, the operators can leverage the new user interface capability displaying which cluster, namespace, and pod/service is impacted along with correlated Amazon EKS metric(s).

New EKS Container Metrics in DevOps Guru

As part of the release, the following pod and node metrics are now tracked by DevOps Guru:

  • pod_number_of_container_restarts – number of times that a pod is restarted (e.g., image pull issues, container failure).
  • pod_memory_utilization_over_pod_limit – memory that exceeds the pod limit called out in resource memory limits.
  • pod_cpu_utilization_over_pod_limit – CPU shares that exceed the pod limit called out in resource CPU limits.
  • pod_cpu_utilization – percent CPU Utilization within an active pod.
  • pod_memory_utilization – percent memory utilization within an active pod.
  • node_network_total_bytes – total bytes over the network interface for the managed node (e.g., EC2 instance)
  • node_filesystem_utilization – percent file system utilization for the managed node (e.g., EC2 instance).
  • node_cpu_utilization – percent CPU Utilization within a managed node (e.g., EC2 instance).
  • node_memory_utilization – percent memory utilization within a managed node (e.g., EC2 instance).

Operator Scenario

The Kubernetes Operator in the following figure is informed of an insight via Amazon SNS. The Amazon SNS message content appears in the following code, showing the originator and information identifying the InsightDescription, InsightSeverity, name of the container metric, and the Pod / EKS Cluster:

 "AccountId": "XXXXXXX",
 "Region": "<REGION>",
 "MessageType": "NEW_INSIGHT",
 "InsightId": "ADFl69Pwq1Aa6M373DhU0zkAAAAAAAAABuZzSBHxeiNexxnLYD7Lhb0vuwY9hLtz",
 "InsightUrl": "https://<REGION>.console.aws.amazon.com/devops-guru/#/insight/reactive/ADFl69Pwq1Aa6M373DhU0zkAAAAAAAAABuZzSBHxeiNexxnLYD7Lhb0vuwY9hLtz",
 "InsightType": "REACTIVE",
 "InsightDescription": "ContainerInsights pod_number_of_container_restarts Anomalous In Stack eksctl-devopsguru-eks-cluster-cluster",
 "InsightSeverity": "high",
 "StartTime": 1636147920000,
 "Anomalies": [
 "Id": "ALAGy5sIITl9e6i66eo6rKQAAAF88gInwEVT2WRSTV5wSTP8KWDzeCYALulFupOQ",
 "StartTime": 1636147800000,
 "SourceDetails": [
 "DataSource": "CW_METRICS",
 "DataIdentifiers": {
 "name": "pod_number_of_container_restarts",
 "namespace": "ContainerInsights",
 "period": "60",
 "stat": "Average",
 "unit": "None",
 "dimensions": "{\"PodName\":\"java-sample-app\",\"ClusterName\":\"devopsguru-eks-cluster\",\"Namespace\":\"aws-otel-eks\"}"
 "awsInsightSource": "aws.devopsguru"

Amazon DevOps Guru Console collects the insights under the Insights selection as shown in the following figure. Select Insights to view the details.

Amazon DevOps Guru Insights. An insight is displayed with a status of Ongoing and Severity of High.

Aggregated Metrics provides the identification of the EKS Container Metrics that have errored. In this case, pod_memory_utilization_over_pod_limit and pod_number_of_container_restarts.

Aggregated Metrics panel with pod_memory_utilization_over_pod_limit and pod_number_of_container_restarts for the Amazon EKS cluster names devopsguru-eks-cluster. Graphically a timeline including time and date is displayed conveying the length of the anomaly.

Further details can be identified by selecting and expanding each insight as shown in the following figure.

Displays the ability to expand the cluster metrics providing further information on the PodName, Namespace and ClusterName. Furthermore, a search bar is provided to search on name, stack or service name.

Note that the display provides information around the Cluster, PodName, and Namespace. This helps operators maintaining large numbers of EKS Clusters to quickly isolate the offending Pod, its operating Namespace, and EKS Cluster to which it belongs. A search bar provides further filtering to isolate the name, stack, or service name displayed.

Cleaning Up

Follow the steps to delete the resources to prevent additional charges being posted to your account.

Amazon EKS Cluster Cleanup

Follow these steps to detach the customer managed policy and delete the cluster.

  • Detach customer managed policy, AWSDistroOpenTelemetryPolicy, via IAM Console.
  • Delete cluster using eksctl.
$ eksctl delete cluster devopsguru-eks-cluster --region <region>
2021-10-13 14:08:28 [i] eksctl version 0.69.0
2021-10-13 14:08:28 [i] using region <region>
2021-10-13 14:08:28 [i] deleting EKS cluster "devopsguru-eks-cluster"
2021-10-13 14:08:30 [i] will drain 0 unmanaged nodegroup(s) in cluster "devopsguru-eks-cluster"
2021-10-13 14:08:32 [i] deleted 0 Fargate profile(s)
2021-10-13 14:08:33 [✔] kubeconfig has been updated
2021-10-13 14:08:33 [i] cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress
2021-10-13 14:09:02 [i] 2 sequential tasks: { delete nodegroup "managed-ng-private", delete cluster control plane "devopsguru-eks-cluster" [async] }
2021-10-13 14:09:02 [i] will delete stack "eksctl-devopsguru-eks-cluster-nodegroup-managed-ng-private"
2021-10-13 14:09:02 [i] waiting for stack "eksctl-devopsguru-eks-cluster-nodegroup-managed-ng-private" to get deleted
2021-10-13 14:12:30 [i] will delete stack "eksctl-devopsguru-eks-cluster-cluster"
2021-10-13 14:12:30 [✔] all cluster resources were deleted


In the previous scenarios, demonstration of the new cluster organization and additional container metrics was performed. Both of these features further simplify and expand the ability for an operator to more easily identify issues within a container cluster when Amazon DevOps Guru detects anomalies. You can start building your own solutions that employ Amazon CloudWatch Agent / AWS Distro for OpenTelemetry Agent and Amazon DevOps Guru by reading the documentation. This provides a conceptual overview and practical examples to help you understand the features provided by Amazon DevOps Guru and how to use them.

About the authors

Rahul Sharad Gaikwad

Rahul Sharad Gaikwad is a Lead Consultant – DevOps with AWS. He helps customers and partners on their Cloud and DevOps adoption journey. He is passionate about technology and enjoys collaborating with customers. In his spare time, he focuses on his PhD Research work. He also enjoys gymming and spending time with his family.

Leo Da Silva

Leo Da Silva is a Partner Solution Architect Manager at AWS and uses his knowledge to help customers better utilize cloud services and technologies. Over the years, he had the opportunity to work in large, complex environments, designing, architecting, and implementing highly scalable and secure solutions to global companies. He is passionate about football, BBQ, and Jiu Jitsu — the Brazilian version of them all.

Chris Riley

Chris Riley is a Senior Solutions Architect working with Strategic Accounts providing support in Industry segments including Healthcare, Financial Services, Public Sector, Automotive and Manufacturing via Managed AI/ML Services, IoT and Serverless Services.

Kubernetes Guardrails: Bringing DevOps and Security Together on Cloud

Post Syndicated from Alon Berger original https://blog.rapid7.com/2021/12/06/kubernetes-guardrails-bringing-devops-and-security-together-on-cloud/

Kubernetes Guardrails: Bringing DevOps and Security Together on Cloud

Cloud and container technologies are being increasingly embraced by organizations around the globe because of the efficiency, superior visibility, and control they provide to DevOps and IT teams.

While DevOps teams see the benefits of cloud and container solutions, these tools create a learning curve for their security colleagues. Because of this, security teams often want to slow down adoption while they figure out a strategy for maintaining security and compliance in these new fast-moving environments.

Container and Kubernetes (K8s) environments are already fairly complex as it is, and layering multiple additional security tools into the mix makes it even more challenging from a management perspective. Organizations need to find a way to enable their DevOps teams to move quickly and take advantage of the benefits of containers and K8s, while staying within the parameters the security team needs to maintain compliance with organizational policy.

This challenge goes beyond technology. These teams need to find a solution that allows them to work together well, doesn’t over-complicate their working relationship, and lets both sides get what they want with minimal overhead.

A holistic approach to Kubernetes security

As an open-source container orchestration system for automating deployment, scaling, and management of containerized applications, Kubernetes is extremely powerful. However, organizations must carefully balance their eagerness to embrace the dynamic, self-service nature of Kubernetes with the real-life need to manage and mitigate security and compliance risk.

Rapid7’s recent introduction of InsightCloudSec intelligently unifies both CSPM and CWPP functionalities, thus enabling a holistic approach for protecting valuable assets in the cloud — one that includes Kubernetes and workload security.

Learn more about InsightCloudSec here

Built for DevOps, trusted by security

In retrospect, 2020 was a tipping point for the Kubernetes community, with a massive increase in adoption across the globe. Many companies, seeking an efficient, cost-effective way to make this huge shift to the cloud, turned to Kubernetes. But this in turn created a growing need to remove Kubernetes security blind spots. For this reason, we’ve introduced Kubernetes Guardrails.

With Kubernetes Security Guardrails, organizations are equipped with a multi-cluster vulnerability scanner that covers rich Kubernetes security best practices and compliance policies, such as CIS Benchmarks. As part of Rapid7’s InsightCloudSec solution, this new capability introduces a platform-based and easy-to-maintain solution for Kubernetes security that is deployed in minutes and is fully streamlined in the Kubernetes pipeline.

Securing Kubernetes with InsightCloudSec

Kubernetes Security Guardrails is the most comprehensive solution for all relevant Kubernetes security requirements, designed from a DevOps perspective with in-depth visibility for security teams.

InsightCloudSec is designed to be an agentless state machine, seamlessly applied to any computing environment — public cloud or private software-defined infrastructure.

InsightCloudSec continually interacts with the APIs to gather information about the state of the hosts and the Kubernetes clusters of interest. These hosts can be GCP, AWS, Azure, or a private data center that can expose infrastructure information via an API.

Integrated within minutes, the Kubernetes Guardrails functionality simplifies the security assessment for the entire Kubernetes environment and the CI/CD pipeline, while also creating baseline profiles for each cluster, and highlighting and scoring security risks, misconfigurations, and hygiene drifts.

Both DevOps and Security teams enjoy the continuous and dynamic analysis of their Kubernetes deployments, all while seamlessly complying with regulatory requirements for Kubernetes.

With Kubernetes Guardrails, Dev teams are able to create a snapshot of cluster risks, delivered with a detailed list of misconfigurations, while detecting real-time hygiene and conformance drifts for deployments running on any cloud environment. Some of the most common use cases include:

  • Kubernetes vulnerability scanning
  • Hunting misplaced secrets and excessive secret access
  • Workload hardening (from pod security to network policies)
  • Istio security and configuration best practices
  • Ingress controllers security
  • Kubernetes API server access privileges
  • Kubernetes operators best practices
  • RBAC controls and misconfigurations

Ready to drive cloud security forward?

Rapid7 is proud to introduce a Kubernetes security solution that encapsulates all-in-one capabilities and unmatched coverage for all things Kubernetes.

With a security-first approach and strict compliance adherence, Kubernetes Guardrails enable a better understanding and control over distributed projects, and help organizations maintain smooth business operations.

Want to learn more? Watch the on-demand webinar on InsightCloudSec and its Kubernetes protection.

5 DevOps tips to speed up your developer workflow

Post Syndicated from Damian Brady original https://github.blog/2021-11-30-5-devops-tips-to-speed-up-your-developer-workflow/

TL;DR: From learning YAML to scripting with Bash, here are a few simple tips for developers who want to speed up their workflows.

From CI/CD to containerization management and server provisioning, DevOps gets a lot of buzz in tech today. You could even say that it’s a buzz … word.

As a developer, you might be part of a DevOps team, but you’re focused on building great software, not necessarily provisioning servers and managing containers.

Even still, a lot of what developers, DevOps engineers, and IT teams handle in today’s software development life cycle is focused on tools, testing, automations, and server orchestration. And, that’s even more true if you’re a team of one or engaging in a big open source project.

Here are five DevOps tips for any developer looking to work smarter and faster.

Tip #1: A little YAML can make frontend work easier

Initially released in 2001, YAML has become one of the defacto languages for a lot of declarative automation—and it’s commonly used in DevOps and development work for an array of frontend configurations, automations, and more.

YAML, which stands for Yet Another Markup Language, is a superset of JSON and is notable for being a human readable language. That means it focuses less on characters, like brackets, braces, and quotes ({}, [], “).

Here’s why this matters: Learning YAML (or even stepping up your YAML skills) makes it easier to store configurations for your own applications, like your settings in an easy-to-write and easy-to-read language.

For this reason, you’re likely to come across YAML files anywhere from enterprise development workflows to open source projects—and yes, you’ll see plenty of YAML files on GitHub (it powers a product we’re pretty fond of: GitHub Actions, but more on this later).

Whether you can apply YAML directly to your day-to-day dev workflows or leverage different tools that use YAML, there are some pretty big benefits to getting started with this language—or stepping up your YAML skills.

Looking to learn more about YAML? Try the Learn YAML in Y Minutes guide.

Tip #2: A few DevOps tools to keep you moving fast

Let’s clear up one thing first: “DevOps tools” is an umbrella term that covers everything from cloud platforms, server orchestration tools, code management, version control, and dozens of other things.

So when we talk about “DevOps tools,” we’re really talking about technologies that make it easier to write, test, host, and release software, as well as reduce any worries around unexpected failures.

Here are three “DevOps tools” that can speed up your workflows and let you focus on building great software.


You’re on the GitHub Blog, so we’re pretty sure you’re familiar with Git as a version control system and distributed source code management tool. It’s a mainstay of developers and a popular DevOps tool.

Here’s why: Git makes version control easy and gives teams a straightforward way to collaborate, experiment with different branches, and merge new features into the main software branch.

Learn how Git works >

Cloud-hosted integrated development environments (IDE)

I know, I know, saying cloud-hosted integrated development environments, or cloud IDEs, out loud is a bit of a mouthful (thank you, marketing). But these platforms are something you should start exploring immediately, if you haven’t already.

Here’s why: Cloud IDEs are fully hosted developer environments that let you write, run, and debug code—and they make spinning up new, preconfigured environments fast. Do you need proof? We launched our own cloud IDE called Codespaces earlier this year and started using it internally to build GitHub. It used to take us up to 45 minutes to spin up new developer environments—now it takes 10 seconds :mindblown:.

Cloud IDEs give you a super simple way to quickly spin up new, pre-configured development environments (and disposable development environments). Also, since they’re hosted in the cloud, you don’t need to worry about how powerful the computer you’re coding on is (friendly shout out here goes to the intrepid folks who have started coding on tablets).

Picture this: Your laptop fries itself (which has happened to me once or twice). You might have versions of npm, tools for connecting to your cloud provider, and any number of other configurations that you just lost. If you use a cloud IDE, you can spin up an environment in the cloud with all of your configurations, and that’s a magical thing to see.

Learn how cloud IDEs work >


If you don’t want to use a cloud IDE, dev containers are something you can use locally or in the cloud. Containers have exploded in popularity over the past decade for their utility in microservices architectures, CI/CD, and cloud-native application development, among other things. By nature, containers are lightweight and efficient making it easy to build, test, stage, and deploy software.

Learning the basics of containerization can be really handy—especially when it comes to testing your code in a lightweight environment that imitates your production environment. If you need to upgrade a library or try using an application on the next version of Node, you can do that really easily with containers before you hit production.

This can be especially useful for ”shifting left,” which is an important DevOps strategy. Catching issues or problems before you ever hit production can save a lot of headaches. If you can find those issues while you’re writing the code, that’s even better. Any problems will eventually mean more work, so the earlier you can catch them the better. After all, catching a problem before you get to the compiling stage can save you a headache or two.

Learn how containers work >

Tip #3: Automated testing and continuous integration (CI) to stay one step ahead

In any conversation around DevOps, you’ll probably hear about automated testing and continuous integration (CI). Yet while automated testing is typically part of a good CI development practice, it’s not strictly a requirement (but it should be … or at least part of your continuous delivery phase).

Most teams have some basic unit testing as part of their CI process, but stop short of testing for security vulnerabilities, automated UI testing, integration testing, etc.

Even still, these are two things that can help you step up your workflows by: (A) making sure your code works with the main branch; and (B) catching things like security vulnerabilities and other problems, so you can lessen your DevOps team’s workload.

Here’s how:

Using GitHub Actions to run automated tests

From ordering pizza to triggering an alarm, there’s a lot you can do with GitHub Actions. It all comes down to workflow automations.When it comes to setting up automated tests with GitHub Actions, you can either build your own action or leverage pre-built actions in the GitHub Marketplace.

[Learn how to build your own GitHub Actions workflow automations.]> Pro tip: Using Actions workflows that run on pull requests is a great way to check for security vulnerabilities, problems in your code, or anything else before you merge to the main branch. Doing this means you’re one step ahead and helps keep your main branch clean.

[Want to learn more about GitHub Actions? Check out our guide.]You can also configure your workflows to deploy to ephemeral testing environments. This means you can run your tests and deploy your changes to an environment where you can test your application. You can even configure your workflow to automatically tear these testing environments down after you’re finished.

All this means you’re testing things as much as possible before it’s time to go to production.

Using GitHub Actions to create CI pipelines

CI, or continuous integration, is the process of automatically integrating code from multiple people for a given project. A good CI practice means you can work faster, make sure your code compiles correctly, merge code changes more efficiently, and be sure your code plays nice with everyone else’s work.

The most powerful CI workflows are the ones that test all of the things you care about every single time you push your code to the server.

If you’re working on GitHub, GitHub Actions can do this for you, too. There are plenty of pre-built CI workflows in the GitHub Marketplace (and you can always build your own), but there are a few things to keep in mind when you start incorporating CI into your development flow. These include:

  • Run the necessary tests: Think about what build, integration, and testing automations you ideally need. You’ll want to consider things that may have gone wrong with releases in the past, and see if you can add a test for that in your CI.
  • Balance the time it takes to test your code with how fast you’re pushing new code: Let’s say you have teams pushing new code every five minutes (hypothetically), but the tests you’re running take 10 minutes to execute … that’s not great. It’s always best to balance what you’re checking and when with how long it takes, which might mean trimming your ideal list of tests down to a more realistic number, at least for your CI builds.

Get a tutorial on creating a CI pipeline with GitHub Actions >

Tip #4: Server orchestration tips for flexibility and speed

If you’re building a cloud-native application (or really even just using a few different servers, VMs, containers, or hosting services), you’re probably dealing with a few environments. Being able to make sure your application and infrastructure play well together means you can rely a little less on an operations team trying to get your software to run on existing infrastructure at the last minute.

That’s where server orchestration comes in. Server orchestration—or infrastructure orchestration—is often the job of IT and DevOps teams and includes configuring, managing, provisioning, and coordinating systems, applications, and core infrastructure needed to run software.

Pro tip: There’s a suite of tools that allow you to define and update the infrastructure you need to use.

A big advantage of infrastructure automation is improved scalability—and defined environments means it’s easier to tear down and rebuild an environment when something goes wrong (instead of starting from scratch, but we’ve all been there).

There’s another big advantage: If you want to test something, you don’t have to worry about asking the operations team to go and set up a server for you. You can instead do that as part of a workflow. You don’t have to worry about manually provisioning hardware or system requirements.

How to get started: Don’t try to replace everything in your environment with automated infrastructure automation. Instead, look for a part that might be easy to automate and start there—then the next piece and the next piece after that.

And definitely never start in production. Instead, begin with your testing environment. Once that works, move to your staging environment (and if that works, you can trust it’s good for production).

Tip #5: Repeatable tasks? Try scripting them with Bash or PowerShell

Picture this: You have a bunch of repeatable tasks that you’re executing on a local basis, and you’re spending way too much time working through them every week. There’s a better—and more efficient—way to handle this. How? Scripting with either Bash or PowerShell.

Bash has deep roots in the Unix world, and it’s a mainstay of IT and DevOps teams, and more than a few developers too. PowerShell is comparatively newer. Designed by Microsoft and launched in 2006, PowerShell replaced the command shell and earlier scripting languages for task automation and configuration management in Windows environments.

Today, both Bash and PowerShell are cross-platform (though most people with a Windows background will use PowerShell, and most people familiar with Linux or macOS will use Bash out of habit).

Pro tip: Bash and PowerShell have different ways of working. Where PowerShell works with objects, Bash passes information around as strings. Even still, whatever you choose is largely up to personal preference.

One of the more useful things I’ve done with Bash and PowerShell, for example, is building a script that pulls down the latest version of the code, creates a new branch, switches to that branch, pushes a draft pull request up to GitHub, and then opens VSCode (sub in your editor of choice here) in that branch.

It’s a series of small steps to make your life much easier. It’s something you might do once or twice a week, and if you can script that—it gives you more time to focus on what matters: writing great code.

The bottom line

There’s a big difference between an IT pro, a DevOps engineer, and a developer. But in today’s world of software development, a lot of core DevOps practices are becoming everyone’s job. Plus, any developer that can learn a few DevOps tricks can have an easier time working independently (and more efficiently at that), and continue to focus on what matters most: building great software. That’s something we can all get behind.

Additional resources

AWS attendee guide for DevOps and Developer Productivity track at re:Invent2021

Post Syndicated from Harshitha Putta original https://aws.amazon.com/blogs/devops/aws-attendee-guide-for-devops-and-developer-productivity-track-at-reinvent-2021/

AWS re:Invent is a learning conference hosted by Amazon Web Services for the global cloud computing community. We are super excited to join you at the 10th annual re:Invent to share the latest from AWS leaders and discover more ways to learn and build. Let’s celebrate this milestone, which will be offered in person in Las Vegas (November 29-December 3) and in virtual (November 29–December 10) formats. The health and safety of our customers, and partners remains our top priority and you can learn more about it in health measures page. For details about the virtual format, check out the virtual section. If you haven’t already registered, don’t forget to register and save your spot at your favorite sessions.

The AWS DevOps and Developer Productivity track at re:Invent offers you with sessions that are combination of cultural philosophies, practices, and tools that increase an organization’s ability to deliver applications and services at high velocity. The sessions vary from intermediate (200) through expert (400) levels, and help you accelerate the pace of innovation in your business. This blog post highlights the sessions from the Cloud Operations track that you shouldn’t miss.

Breakout Sessions

 AWS re:Invent breakout sessions are lecture-style and one hour long. These sessions are delivered by AWS experts, customers, and partners, and they typically include 10–15 minutes of Q&A at the end. For our virtual attendees, breakout sessions will be made available on demand in the week after re:Invent. 

Level 200 – Intermediate

  • DOP208-R1 and DOP208-R2 DevOps revolution

While DevOps has not changed much, the industry has fundamentally transformed over the last decade. Monolithic architectures have evolved into microservices. Containers and serverless have become the default. Applications are distributed on cloud infrastructure across the globe. The technical environment and tooling ecosystem has changed radically from the original conditions in which DevOps was created. So, what’s next? In this session, learn about the next phase of DevOps: a distributed model that emphasizes swift development, observable systems, accountable engineers, and resilient applications.

Level 300 – Advanced

  • DOP301 How to reuse patterns when developing infrastructure as code

This session explores the AWS Cloud Development Kit (AWS CDK) constructs and AWS CloudFormation modules and how they can be used to make building applications easier on AWS. Learn how you can extend CloudFormation to include support for third-party resources and how those resource types can be used by AWS Config.

  • DOP309 Amazon builders’ library: Operational excellence at Amazon

Operational excellence at Amazon is achieved through a DevOps model, where software development teams operate the systems they build. In this session, Senior Principal Engineer David Yanacek describes Amazon’s operational practices that he has observed during his 15 years of building and operating services at Amazon. Hear David describe the habits that teams have adopted, such as how teams handle retrospectives, share knowledge, and regularly review operational metrics as a team. David discusses how these behaviors have led teams to innovate to build better tools and make architectural shifts.

  • DOP310 Enabling decentralized development teams with a shared services platform

Speed in software development requires being able to equip development teams with tools and guardrails for DevOps, security, and infrastructure configuration. Too often, central teams find they need to piece together their own custom solutions or compromise the speed of their development organization in order to maintain standards. In this session, dive deep into the crawl, walk, and run options and best practices for building a shared services platform on AWS using tools and services such as AWS Copilot, AWS Proton, and pre-built solutions using AWS CloudFormation.

  • DOP311 Incorporating continuous resilience in your development ecosystem

Today, resilience encompasses a broad range of considerations from infrastructure, application patterns, and data management to application building and monitoring. Additionally, after incorporating resilience, it is essential to maintain it in a continuous manner. In this session, explore various considerations for implementing processes designed to provide continuous improvement through a DevOps methodology. Review various services that can incorporate resilience in the development process in a nearly continuous manner.

  • DOP312 Observing your applications from development through production

Implementing observability differs at various stages of the software development lifecycle. In development, detailed logging and tracing are necessary to understand application behavior. In testing, logging and tracing are needed but in varying levels of detail and must be augmented by new metrics. In integration and production, it’s necessary to correlate and contextualize large volumes of data with dashboards that encompass metrics, alarms, and notifications connected to internal and external events. In this session, explore the mechanisms, mental models, and tools (including Amazon CloudWatch, AWS CloudTrail, AWS X-Ray and Amazon DevOps Guru) that top-performing teams use to observe applications throughout various stages of the software development lifecycle.

  • DOP313 Best practices for securing your software delivery lifecycle

In this session, learn about ways you can secure your AWS CI/CD pipeline. Review topics like security of the pipeline versus security in the pipeline, and learn about practices to incorporate security checkpoints across various pipeline stages, security event management, and aggregation of vulnerability findings into a unified display. This session also introduces foundational methodologies that combine best practices, processes, and tools to increase an organization’s ability to deliver applications and services securely.

  • DOP314 Write, deploy, and provision cloud resources with AWS Developer Tools

In this session, learn how you can use various AWS Developer Tools to improve your ergonomics across the entire development lifecycle. This session dives deep into IDE extensions, SDKs, and toolkits that provide first-class integrations with AWS services. It also explores how to manage and fine-tune your resources with the AWS Command Line Interface (AWS CLI); how to define your infrastructure in common programming languages with AWS CDK; and how to automate testing, building, debugging, and deployment.

  • DOP315 What’s new with AWS CloudFormation and AWS CDK

Join this session to learn about new features to up-level your infrastructure as code (IaC) experiences on AWS. It covers working with AWS CloudFormation modules and AWS CDK constructs to make working with AWS easier; CloudFormation registry to streamline creating, publishing, discovering, and using AWS and third-party plugins; CloudFormation StackSets and CDK Pipelines to automate the deployment of resources and applications across multiple AWS Regions and accounts; CloudFormation Guard 2.0 to attain security and best practice compliance before deployments reach production; and more. Explore how AWS is improving our IaC coverage of AWS services and features in a scalable, decentralized way and how you can contribute.

  • DOP325 Building with the new AWS SDKs for Rust, Kotlin, and Swift

Writing code in the AWS SDKs for Rust, Kotlin, and Swift has never been as easy as it is now. This session explores how AWS built these SDKs in parallel, the commonalities they share, and how to build an application with each one. Then, it details best practices for using the SDKs and how to use the features to test your code efficiently. Lastly, this session takes a close look at how the SDKs work and reviews the road map for the future.

  • DOP328-S Slack is the digital HQ for AWS developers and DevOps teams (sponsored by Slack)

With increased pressure on software teams to release high-quality products faster, it’s more important than ever to work effectively in an interdependent and cross-functional manner. Yet, communication and collaboration have not changed to reflect the way Agile and DevOps teams actually get work done. Join this session to find out why Slack is the digital HQ for engineering and operations teams. This presentation is brought to you by Slack, an AWS Partner. Speakers: Logan Franey (Slack) and Clint Burns (Slack).

Level 400 – Expert

  • DOP402-R1 and DOP402-R2 Automating cross-account CI/CD pipelines

When building a deployment strategy for your applications, using a multi-account approach is a recommended best practice. This limits the area of impact for changes made and results in better modularity, security, and governance. In this session, dive deep into an example multi-account deployment using infrastructure as code (IaC) services such as the AWS CDK, AWS CodePipeline, and AWS CloudFormation. Also explore a real-world customer use case that is deploying at scale across hundreds of AWS accounts.

Builders’ Sessions

Builders Sessions are small-group sessions led by an AWS expert who guides you as you build the service or product. Each builders’ session begins with a short explanation or demonstration of what you are going to build. Once the demonstration is complete, use your laptop to experiment and build with the AWS expert.

Level 300 – Advanced

  • DOP303 Assessing your application resiliency using chaos engineering

This builders’ session guides you through the principles of chaos engineering and building observability to assess the resiliency of your application and infrastructure. Walk through a hands-on exercise using AWS Fault Injection Simulator to inject faults and observe the impacts using various managed services such as an Amazon CloudWatch dashboard. Learn to use Amazon DevOps Guru, which uses machine learning for observability, to improve the minimum time to recovery (MTTR) by decreasing downtime.

  • DOP304 Creating and publishing AWS CloudFormation public resources

This builders’ session guides you through the process for creating AWS CloudFormation extensions for a CloudFormation public or private registry. Also, learn how to consume resource types from the registries created by other teams or organizations.

  • DOP305-R1 and DOP305-R2 Continuous deployment with AWS CDK Pipelines

CDK Pipelines is a new AWS CDK construct that simplifies defining and building CI/CD pipelines for safely deploying software changes. This builders’ session shows you how to effectively use CDK Pipelines to manage software releases.

Chalk Talks

Chalk Talks are highly interactive sessions with a small audience. Experts lead you through problems and solutions on a digital whiteboard as the discussion unfolds. Each begins with a short lecture (10–15 minutes) delivered by an AWS expert, followed by a 45- or 50-minute Q&A session with the audience.

Level 200 – Intermediate

  • DOP201 Provisioning, automating, and orchestrating IaC on AWS

This chalk talk answers your questions about infrastructure as code (IaC), including relevant AWS services and partner products. It covers topics like IaC patterns, the AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation, augmentation of your CI/CD workflows with services like AWS Systems Manager, and how to select the right tools for the job. Join this talk and bring your questions.

  • DOP202 Increasing availability with AWS observability solutions

In this chalk talk, learn how you can use Amazon CloudWatch to gain insights about the trends and patterns of your infrastructure performance in real time. Learn how to slice and dice your CloudWatch Container Insights metrics to help you gain actionable insights. Also, learn how to monitor blue/green deployments to stay away from downtimes. Lastly, discuss how to use metric querying to analyze and compare how your business is doing across different areas.

  • DOP203-R1 and DOP203-R2 Getting started developing backend applications in the cloud

This chalk talk dives deep into the most common strategies for organizing the development of complex cloud applications using various AWS services and solutions. Learn about how to get started with your local development environment, about how to select the right infrastructure, and about achieving the fastest time to market. Join this talk and bring your questions.

  • DOP204 Testing on AWS

Testing software is a crucial part of the software delivery lifecycle, with testing methodologies prescribed for every stage of the process from local unit testing on a developer laptop to load testing production environments. This chalk talk briefly covers several test methodologies, then transitions to a Q&A session where you can ask AWS experts your testing-related questions.

  • DOP209-R1 and DOP209-R2 Amazon’s DevOps culture

In this chalk talk, learn about how Amazon enables its developers to rapidly release and iterate software while maintaining industry-leading standards on security, reliability, and performance. Consider the tradition of two-pizza teams and how to maintain a culture of DevOps in a large enterprise. Also, hear how you can help AWS customers build such a culture for themselves.

  • DOP210 Using on-premises Git with AWS Developer Tools for security and compliance

In this chalk talk, learn about using AWS Development Tools in conjunction with third-party Git solutions, such as GitHub Enterprise, Bitbucket Server, and more.

  • DOP211-R1 and DOP211-R2 Building scalable machine-learning pipelines

In this chalk talk, explore how to build, automate, manage, and scale machine learning (ML) workflows using the AWS-native DevOps tools and services. Learn how to provision the underlying resources needed to enable CI/CD capabilities for your ML development lifecycle. Also, learn how to use the built-in templates or create your own custom templates using AWS CloudFormation.

Level 300 – Advanced

  • DOP302 Continuous compliance for your development workflow

This chalk talk dives deep into the importance of and mechanisms for meeting security and compliance requirements for your organization. Learn ways that you can enforce pre- and post-deployment standards, shift-left testing, and use of services like Amazon CodeGuru Reviewer, AWS CloudFormation Guard, and AWS Config for security static analysis and runtime compliance checks. Join this talk and bring your questions.

  • DOP316-R1 and DOP316-R2 continuous integration strategies and best practices

In this chalk talk, learn about using continuous integration across your branches and pull-request workflows. Explore various considerations for monolith versus containerized applications, incorporating best practices like security checkpoints, generating test reports, integrating open-source packages, and more. Learn some build-optimization techniques with available tools and services. Lastly, evaluate integration into the GitOps model.

  • DOP317 Application deployment strategies for AWS applications

This chalk talk covers deployment strategies, including blue/green, in-place, feature-flag, and canary deployments. It also explores strategies for working with data structure changes.

  • DOP318 Deploying AWS Config conformance packs with RDK using AWS CodePipeline

In this chalk talk, learn how to use a simple pipeline to create custom AWS Config rules and deploy them across an AWS Organizations organization-using AWS Config, the AWS Developer Tools, the AWS Config Rule Development Kit (RDK), and RDKLib. During the talk, explore how to build, test, and deploy these rules at scale across multiple AWS accounts in a repeatable, secure, and automated way.

  • DOP319-R1 and DOP319-R2 Choose your own adventure: AWS Java developer tooling

Are you a Java developer deploying applications to AWS? Do you wonder how you can improve your development cycle, be more productive, and deliver better-performing applications? During this highly interactive chalk talk, AWS experts adapt topics in real time to cover those that interest you the most. Choose from a range of options, from Java-specific integrations with popular services like Amazon S3, AWS Lambda, and Amazon DynamoDB, to Java-focused IDE tooling to help you be more productive when you’re authoring code that runs on AWS. Also, review Amazon Corretto (open JDK) and Amazon CodeGuru as part of Java application development.

Level 400 – Expert

  • DOP403-R1 and DOP403-R2 Maximize value with AWS CloudFormation advanced features

In this chalk talk, gain insights into AWS CloudFormation advanced features to transform the way you provision and manage your AWS and third-party resources. Discover how to use best practice plugins from the CloudFormation registry to create a CloudFormation template, use CloudFormation Guard to check the template for security and compliance errors, disable the rollback to accelerate provisioning, and use CloudFormation StackSets to provision resources in multiple AWS accounts and Regions.


Workshops are two-hour interactive learning sessions where you work in small group teams to solve problems using AWS services. Each workshop starts with a short lecture (10–15 minutes) by the main speaker, and the rest of the time is spent working as a group. Come prepared with your laptop and a willingness to learn!

Level 200 – Intermediate

  • DOP205-R1 and DOP205-R2 Build your infrastructure with AWS CloudFormation and the AWS CDK

In this workshop, learn how to build using infrastructure as code with AWS CloudFormation and the AWS CDK. Create resources using CloudFormation and learn about maintenance and operations tips. Also delve into using the AWS CDK to enable developers to utilize their choice of programming language to create infrastructure. The workshop walks you through the steps of coding and building your own constructs (or integrating solution constructs) and publishing them as shared libraries. Let’s build!

  • DOP206 Improve availability and resilience with fault injection experiments

This workshop introduces you to chaos engineering. Learn how to improve the resiliency of applications and infrastructure using hypothesis-based experiments, disruption, and observation, including recurrent scenarios and CI/CD. Get hands-on with AWS Fault Injection Simulator and AWS Systems Manager to simulate outage scenarios, and learn how to combine this with observability tools like Amazon CloudWatch and Amazon DevOps Guru to uncover hidden issues, expose unseen areas, and validate remediation steps.

  • DOP207 Improving development ergonomics for developers

In this workshop, get hands-on with developer tools on AWS including AWS IDE toolkits, SDKs, and CLIs to build a modern application. Learn how to easily and efficiently build, test, and debug a serverless application and explore modern tooling, including Amazon CodeGuru, AWS Serverless Application Model (AWS SAM) tooling, and managed environments, to rapidly prototype and debug secure applications in the cloud.

Level 300 – Advanced

  • DOP306 Implementing release management strategies for CI/CD

This workshop guides you through building CI/CD pipelines with release-management best practices, including artifact management as well as zero-downtime release promotion and rollback mechanisms. Evaluate various rollback/roll forward strategies across compute types and assess the need for manual processes.

  • DOP307-R1 and DOP307-R2 AWS CLI tips and tricks

In this workshop, learn AWS Command Line Interface (AWS CLI) tips and tricks. Discover how to efficiently interact with your AWS services, manage your AWS resources, automate your regular repetitive operations, and utilize various use cases presented in this workshop. Join this workshop to hear about new feature functionalities integrated for developer operations.

  • DOP320 Observability: Best practices for improving developer productivity

In this hands-on workshop, dive deep into how you can improve developer productivity by correlating metrics and traces to identify user impact from any source and to find broken or expensive code paths as quickly as possible. Learn how to do this with AWS services, without having to re-instrument code, when adding new observability tools to development workflows.

Level 400 – Expert

  • DOP401-R1 and DOP401R2 Continuous security and compliance for your CI/CD pipeline

This workshop dives deep into the importance of and mechanisms for meeting security and compliance requirements for your organization. Learn ways that you can enforce pre- and post-deployment standards, shift-left testing, and use of services like Amazon CodeGuru Reviewer, AWS CloudFormation Guard, and AWS Config for security static analysis and runtime compliance checks. Join this talk and bring your questions.

In addition to these sessions, we offer leadership sessions through which you can hear directly from AWS leaders as they share the latest advances in AWS technologies, set the future product direction, and motivate you through compelling success stories. Also, expect to hear about the launch of new and exciting AWS services and features throughout the event.

Still looking for more?

We have an extensive list of curated content on DevOps on AWS, including case studies, white papers, previous re:Invent presentations, reference architectures, and how-to instructional videos. Subscribe to our AWS Developer Tools and Services channel to get updates when new videos are added.

About the author

Harshitha Putta

Harshitha Putta is a Senior Cloud Infrastructure Architect with AWS Professional Services in Seattle, WA. She is passionate about building innovative solutions using AWS services to help customers achieve their business objectives. She enjoys spending time with family and friends, playing board games and hiking.

Get started with AWS DevOps Guru Multi-Account Insight Aggregation with AWS Organizations

Post Syndicated from Ifeanyi Okafor original https://aws.amazon.com/blogs/devops/get-started-with-aws-devops-guru-multi-account-insight-aggregation-with-aws-organizations/

Amazon DevOps Guru is a fully managed service that uses machine learning (ML) to continuously analyze and consolidate operational data streams from multiple sources, such as Amazon CloudWatch metrics, AWS Config, AWS CloudFormation, AWS X-Ray, and provide you with a single console dashboard. This dashboard helps customers improve operational performance and avoid expensive downtime by generating actionable insights that flag operational anomalies, identify the likely root cause, and recommend corrective actions.

As customers scale their AWS resources across multiple accounts and deploy DevOps Guru across applications and use cases on these accounts, they get a siloed view of the operational health of their applications. Now you can enable multi-account support with AWS Organizations and designate a member account to manage operational insights across your entire organization. This delegated administrator can get a holistic view of the operational health of their applications across the organization—without the need for any additional customization.

In this post, we will walk through the process of setting up a delegated administrator. We will also explore how to monitor insights across your entire organization from this account.

Overview of the multi-account environment

The multi-account environment operates based on the organizational hierarchy that your organization has defined in AWS Organizations. This AWS service helps you centrally manage and govern your cloud environment. For this reason, you must have an organization set up in AWS Organizations before you can implement multi-account insights visibility. AWS Organizations is available to all AWS customers at no additional charge, and the service user guide has instructions for creating and configuring an organization in your AWS environment.

Understanding the management account, a delegated administrator, and other member accounts is fundamental to the multi-account visibility functionality in DevOps Guru. Before proceeding further, let’s recap these terms.

  • A management account is the AWS account you use to create your organization. You can create other accounts in your organization, invite and manage invitations for other existing accounts to join your organization, and remove accounts from your organization. The management account has wide permissions and access to accounts within the organization. It should only be used for absolutely essential administrative tasks, such as managing accounts, OUs, and organization-level policies. You can refer to the AWS Organizations FAQ for more information on management accounts.
  • When the management account gives a member account service-level administrative permissions, it becomes a delegated administrator. Because the permissions are assigned at the service level, delegated administrator’s privileges are confined to the AWS service in question (DevOps Guru, in this case). The delegated administrator manages the service on behalf of the management account, leaving the management account to focus on administrative tasks, such as account and policy management. Currently, DevOps Guru supports a single delegated administrator, which operates at the root level (i.e., at the organization level). When elevating a member account to a delegated administrator, it must be in the organization. For more information about adding a member account to an organization, see inviting an AWS account to join your organization.
  • Member accounts are accounts without any administrative privilege. An account can be a member of only one organization at a time.


You must have the following to enable multi-account visibility:

  • An organization already set up in AWS Organizations. AWS Organizations is available to all AWS customers at no additional charge, and the service user guide has instructions for creating and configuring an organization in your AWS environment.
  • A member account that is in your organization and already onboarded in DevOps Guru. This account will be registered as a delegated administrator. For more information about adding a member account to an organization, see inviting an AWS account to join your organization.

Setting up multi-account insights visibility in your organization

In line with AWS Organizations’ best practices, we recommend first assigning a delegated administrator. Although a management account can view DevOps Guru insights across the organization, management accounts should be reserved for organization-level management, while the delegated administrator manages at the service level.

Registering a Delegated Administrator

A delegated administrator must be registered by the management account. The steps below assume that you have a member account to register as a delegated administrator. If your preferred account is not yet in your organization, then invite the account to join your organization.

To register a delegated administrator

  1. Log in to the DevOps Guru Console with the management account.
  2. On the welcome page, under Set up type, select Monitor applications across your organizations. If you select Monitor applications in the current AWS account, then your dashboard will display insights for the management account.
  3. Under Delegated administrator, select Register a delegated administrator (Recommended).
  4. Select Register delegated administrator to complete the process.

Register a delegated administrator. Steps 1-4 highlighted in console screenshot.

To de-register a delegated administrator

  1. Log in to the DevOps Guru Console and navigate to the Management account
  2. On the Management account page, select De-register administrator.

De-register a delegated Administrator. Steps 1 and 2 highlighted on console.

Viewing insights as a Delegated Administrator

As the delegated administrator, you can choose to view insights from

  • specific accounts
  • specific OUs
  • the entire organization

To view insights from specific accounts

  1. Log in to the DevOps Guru console, and select Accounts from the dropdown menu on the dashboard.
  2. Select the search bar to display a list of member accounts.
  3. Select up to five accounts, and select anywhere outside the dropdown menu to apply your selection. Simply select the delegated administrator account from the dropdown menu to view insights from the delegated administrator account.

View Insights from specific account. Steps 1-3 highlighted on dashboard screenshot.

The system health summary now will display information for the selected accounts.

To view insights from specific organizational units

  1. Log in to the DevOps Guru console, and select Organizational Units from the dropdown menu on the dashboard.
  2. Select the search bar to display the list of OUs.
  3. Select up to five OUs, and select anywhere outside of the dropdown menu to apply your selection.

View insights from specific organizational units. Steps 1-3 highlighted in dashboard screenshot.

Now the system health summary will display information for the selected OUs. Nested OUs are currently not supported, so only the accounts directly under the OU are included when an OU is selected. Select the sub-OU in addition to the parent OU to include accounts in a sub-OU,.

To view insights across the entire organization

  1. Log in to the DevOps Guru console and navigate to the Insights
  2. On the Reactive tab, you can see a list of all the reactive insights in the organization. On the Proactive tab, you can also see a list of all the proactive insights in the organization. You will notice that the table displaying the insights now has an Account ID column (highlighted in the snapshot below). This table aggregates insights from different accounts in the organization.

View insights across the entire organization.

  1. Use one or more of the following filters to find the insights that you are looking for
    1. Choose the Reactive or Proactive
    2. In the search bar, you can add an account ID, status, severity, stack, or resource name to specify a filter.

View insights across the entire organization and filter. Steps 1-3 highlighted in Insights summary page.

    1. To search by account ID, select the search bar, select Account, then select an account ID in the submenu.
    2. Choose or specify a time range to filter by insight creation time. For example, 12h shows insights created in the past 12 hours, and 1d shows the insights of the previous day. 1w will show the past week’s insights, and 1m will show the last month’s insights. Custom lets you specify another time range. The maximum time range you can use to filter insights is 180 days.

View insights across the entire organization and fitler by time.

Viewing insights from the Management Account

Viewing insights from the management account is similar to viewing insights from the delegated administrator, so the process listed for the delegated administrator also applies to the management account. Although the management account can view insights across the organization, it should be reserved for running administrative tasks across various AWS services at an organization level.

Important notes

Multi-account insight visibility works at a region level, meaning that you can only view insights across the organization within a single AWS region. You must change the AWS region from the region dropdown menu at the top-right corner of the console to view insights from a different AWS region.

For data security reasons, the delegated administrator can only access insights generated across the organization after the selected member account became the delegated administrator. Insights generated across the organization before the delegated administrator registration will remain inaccessible to the delegated administrator.


The steps detailed above show how you can quickly enable multi-account visibility to monitor application health across your entire organization.

AWS Customers are now using AWS DevOps Guru to monitor and improve application performance, and you too can start monitoring your applications by following the instructions in the product documentation. Head over to the AWS DevOps Guru console to get started today.

About the authors

Ifeanyi Okafor

Ifeanyi Okafor is a Product Manager with AWS. He enjoys building products that solve customer problems at scale.

Haider Naqvi

Haider Naqvi is a Solutions Architect at AWS. He has extensive Software Development and Enterprise Architecture experience. He focuses on enabling customers re:Invent and achieve business outcome with AWS. He is based out of New York.

Nick Ardecky

Nick Ardecky is a Software Engineering working with the AWS DevOps Guru team. Nick enjoys building tools and visualizations that improve products and solve customer problems.

Deep learning image vector embeddings at scale using AWS Batch and CDK

Post Syndicated from Filip Saina original https://aws.amazon.com/blogs/devops/deep-learning-image-vector-embeddings-at-scale-using-aws-batch-and-cdk/

Applying various transformations to images at scale is an easily parallelized and scaled task. As a Computer Vision research team at Amazon, we occasionally find that the amount of image data we are dealing with can’t be effectively computed on a single machine, but also isn’t large enough to justify running a large and potentially costly AWS Elastic Map Reduce (EMR) job. This is when we can utilize AWS Batch as our main computing environment, as well as Cloud Development Kit (CDK) to provision the necessary infrastructure in order to solve our task.

In Computer Vision, we often need to represent images in a more concise and uniform way. Working with standard image files would be challenging, as they can vary in resolution or are otherwise too large in terms of dimensionality to be provided directly to our models. For that reason, the common practice for deep learning approaches is to translate high-dimensional information representations, such as images, into vectors that encode most (if not all) information present in them — in other words, to create vector embeddings.

This post will demonstrate how we utilize the AWS Batch platform to solve a common task in many Computer Vision projects — calculating vector embeddings from a set of images so as to allow for scaling.

 Architecture Overview

Diagram explained in post.

Figure 1: High-level architectural diagram explaining the major solution components.

As seen in Figure 1, AWS Batch will pull the docker image containing our code onto provisioned hosts and start the docker containers. Our sample code, referenced in this post, will then read the resources from S3, conduct the vectorization, and write the results as entries in the DynamoDB Table.

In order to run our image vectorization task, we will utilize the following AWS cloud components:

  • Amazon ECR — Elastic Container Registry is a Docker image repository from which our batch instances will pull the job images;
  • S3 — Amazon Simple Storage Service will act as our image source from which our batch jobs will read the image;
  • Amazon DynamoDB — NoSQL database in which we will write the resulting vectors and other metadata;
  • AWS Lambda — Serverless compute environment which will conduct some pre-processing and, ultimately, trigger the batch job execution; and
  • AWS Batch — Scalable computing environment powering our models as embarrassingly parallel tasks running as AWS Batch jobs.

To translate an image to a vector, we can utilize a pre-trained model architecture, such as AlexNet, ResNet, VGG, or more recent ones, like ResNeXt and Vision Transformers. These model architectures are available in most of the popular deep learning frameworks, and they can be further modified and extended depending on our project requirements. For this post, we will utilize a pre-trained ResNet18 model from MxNet. We will output an intermediate layer of the model, which will result in a 512 dimensional representation, or, in other words, a 512 dimensional vector embedding.

Deployment using Cloud Development Kit (CDK)

In recent years, the idea of provisioning cloud infrastructure components using popular programming languages was popularized under the term of infrastructure as code (IaC). Instead of writing a file in the YAML/JSON/XML format, which would define every cloud component we want to provision, we might want to define those components trough a popular programming language.

As part of this post, we will demonstrate how easy it is to provision infrastructure on AWS cloud by using Cloud Development Kit (CDK). The CDK code included in the exercise is written in Python and defines all of the relevant exercise components.

Hands-on exercise

1. Deploying the infrastructure with AWS CDK

For this exercise, we have provided a sample batch job project that is available on Github (link). By using that code, you should have every component required to do this exercise, so make sure that you have the source on your machine. The root of your sample project local copy should contain the following files:

batch_job_cdk - CDK stack code of this batch job project
src_batch_job - source code for performing the image vectorization
src_lambda - source code for the lambda function which will trigger the batch job execution
app.py - entry point for the CDK tool
cdk.json - config file specifying the entry point for CDK
requirements.txt - list of python dependencies for CDK 
  1. Make sure you have installed and correctly configured the AWS CLI and AWS CDK in your environment. Refer to the CDK documentation for more information, as well as the CDK getting started guide.
  2. Set the CDK_DEPLOY_ACCOUNT and CDK_DEPLOY_REGION environmental variables, as described in the project README.md.
  3. Go to the sample project root and install the CDK python dependencies by running pip install -r requirements.txt.
  4. Install and configure Docker in your environment.
  5. If you have multiple AWS CLI profiles, utilize the --profile option to specify which profile to use for deployment. Otherwise, simply run cdk deploy and deploy the infrastructure to your AWS account set in step 1.

NOTE: Before deploying, make sure that you are familiar with the restrictions and limitations of the AWS services we are using in this post. For example, if you choose to set an S3 bucket name in the CDK Bucket construct, you must avoid naming conflicts that might cause deployment errors.

The CDK tool will now trigger our docker image build, provision the necessary AWS infrastructure (i.e., S3 Bucket, DynamoDB table, roles and permissions), and, upon completion, upload the docker image to a newly created repository on Amazon Elastic Container Registry (ECR).

2. Upload data to S3

Console explained in post.

Figure 2: S3 console window with uploaded images to the `images` directory.

After CDK has successfully finished deploying, head to the S3 console screen and upload images you want to process to a path in the S3 bucket. For this exercise, we’ve added every image to the `images` directory, as seen in Figure 2.

For larger datasets, utilize the AWS CLI tool to sync your local directory with the S3 bucket. In that case, consider enabling the ‘Transfer acceleration’ option of your S3 bucket for faster data transfers. However, this will incur an additional fee.

3. Trigger batch job execution

Once CDK has completed provisioning our infrastructure and we’ve uploaded the image data we want to process, open the newly created AWS Lambda in the AWS console screen in order to trigger the batch job execution.

To do this, create a test event with the following JSON body:

"Paths": [

The JSON body that we provide as input to the AWS Lambda function defines a list of paths to directories in the S3 buckets containing images. Having the ability to dynamically provide paths to directories with images in S3, lets us combine multiple data sources into a single AWS Batch job execution. Furthermore, if we decide in the future to put an API Gateway in front of the Lambda, you could pass every parameter of the batch job with a simple HTTP method call.

In this example, we specified just one path to the `images` directory in the S3 bucket, which we populated with images in the previous step.

Console screen explained in post.

Figure 3: AWS Lambda console screen of the function that triggers batch job execution. Modify the batch size by modifying the `image_batch_limit` variable. The value of this variable will depend on your particular use-case, computation type, image sizes, as well as processing time requirements.

The python code will list every path under the images S3 path, batch them into batches of desired size, and finally save the paths to batches as txt files under tmp S3 path. Each path to a txt files in S3 will be passed as an input to a batch jobs.

Select the newly created event, and then trigger the Lambda function execution. The AWS Lambda function will submit the AWS Batch jobs to the provisioned AWS Batch compute environment.

Batch job explained in post.

Figure 4: Screenshot of a running AWS Batch job that creates feature vectors from images and stores them to DynamoDB.

Once the AWS Lambda execution finishes its execution, we can monitor the AWS Batch jobs being processed on the AWS console screen, as seen in Figure 4. Wait until every job has finished successfully.

4. View results in DynamoDB

Image vectorization results.

Figure 5: Image vectorization results stored for each image as a entry in the DynamoDB table.

Once every batch job is successfully finished, go to the DynamoDB AWS cloud console and see the feature vectors stored as strings obtained from the numpy tostring method, as well as other data we stored in the table.

When you are ready to access the vectors in one of your projects, utilize the code snippet provided here:

#!/usr/bin/env python3

import numpy as np
import boto3

def vector_from(item):
    item : DynamoDB response item object
    vector = np.frombuffer(item['Vector'].value, dtype=item['DataType'])
    assert len(vector) == item['Dimension']
    return vector

def vectors_from_dydb(dynamodb, table_name, image_ids):
    dynamodb : DynamoDB client
    table_name : Name of the DynamoDB table
    image_ids : List of id's to query the DynamoDB table for

    response = dynamodb.batch_get_item(
        RequestItems={table_name: {'Keys': [{'ImageId': val} for val in image_ids]}},

    query_vectors =  [vector_from(item) for item in response['Responses'][table_name]]
    query_image_ids =  [item['ImageId'] for item in response['Responses'][table_name]]

    return zip(query_vectors, query_image_ids)
def process_entry(vector, image_id):
    NOTE - Add your code here.

def main():
    Reads vectors from the batch job DynamoDB table containing the vectorization results.
    dynamodb = boto3.resource('dynamodb', region_name='eu-central-1')
    table_name = 'aws-blog-batch-job-image-transform-dynamodb-table'

    image_ids = ['B000KT6OK6', 'B000KTC6X0', 'B000KTC6XK', 'B001B4THHG']

    for vector, image_id in vectors_from_dydb(dynamodb, table_name, image_ids):
        process_entry(vector, image_id)

if __name__ == "__main__":

This code snippet will utilize the boto3 client to access the results stored in the DynamoDB table. Make sure to update the code variables, as well as to modify this implementation to one that fits your use-case.

5. Tear down the infrastructure using CDK

To finish off the exercise, we will tear down the infrastructure that we have provisioned. Since we are using CDK, this is very simple — go to the project root directory and run:

cdk destroy

After a confirmation prompt, the infrastructure tear-down should be underway. If you want to follow the process in more detail, then go to the CloudFormation console view and monitor the process from there.

NOTE: The S3 Bucket, ECR image, and DynamoDB table resource will not be deleted, since the current CDK code defaults to RETAIN behavior in order to prevent the deletion of data we stored there. Once you are sure that you don’t need them, remove those remaining resources manually or modify the CDK code for desired behavior.


In this post we solved an embarrassingly parallel job of creating vector embeddings from images using AWS batch. We provisioned the infrastructure using Python CDK, uploaded sample images, submitted AWS batch job for execution, read the results from the DynamoDB table, and, finally, destroyed the AWS cloud resources we’ve provisioned at the beginning.

AWS Batch serves as a good compute environment for various jobs. For this one in particular, we can scale the processing to more compute resources with minimal or no modifications to our deep learning models and supporting code. On the other hand, it lets us potentially reduce costs by utilizing smaller compute resources and longer execution times.

The code serves as a good point for beginning to experiment more with AWS batch in a Deep Leaning/Machine Learning setup. You could extend it to utilize EC2 instances with GPUs instead of CPUs, utilize Spot instances instead of on-demand ones, utilize AWS Step Functions to automate process orchestration, utilize Amazon SQS as a mechanism to distribute the workload, as well as move the lambda job submission to another compute resource, or pretty much tailor your project for anything else you might need AWS Batch to do.

And that brings us to the conclusion of this post. Thanks for reading, and feel free to leave a comment below if you have any questions. Also, if you enjoyed reading this post, make sure to share it with your friends and colleagues!

About the author

Filip Saina

Filip is a Software Development Engineer at Amazon working in a Computer Vision team. He works with researchers and engineers across Amazon to develop and deploy Computer Vision algorithms and ML models into production systems. Besides day-to-day coding, his responsibilities also include architecting and implementing distributed systems in AWS cloud for scalable ML applications.

Anomaly Detection in AWS Lambda using Amazon DevOps Guru’s ML-powered insights

Post Syndicated from Harish Vaswani original https://aws.amazon.com/blogs/devops/anomaly-detection-in-aws-lambda-using-amazon-devops-gurus-ml-powered-insights/

Critical business applications are monitored in order to prevent anomalies from negatively impacting their operational performance and availability. Amazon DevOps Guru is a Machine Learning (ML) powered solution that aids operations by detecting anomalous behavior and providing insights and recommendations for how to address the root cause before it impacts the customer.

This post demonstrates how Amazon DevOps Guru can detect an anomaly following a critical AWS Lambda function deployment and its remediation recommendations to fix such behavior.

Solution Overview

Amazon DevOps Guru lets you monitor resources at the region or AWS CloudFormation level. This post will demonstrate how to deploy an AWS Serverless Application Model (AWS SAM) stack, and then enable Amazon DevOps Guru to monitor the stack.

You will utilize the following services:

  • AWS Lambda
  • Amazon EventBridge
  • Amazon DevOps Guru

The architecture diagram shows an AWS SAM stack containing AWS Lambda and Amazon EventBridge resources, as well as Amazon DevOps Guru monitoring the resources in the AWS SAM stack.

Figure 1: Amazon DevOps Guru monitoring the resources in an AWS SAM stack

The architecture diagram shows an AWS SAM stack containing AWS Lambda and Amazon EventBridge resources, as well as Amazon DevOps Guru monitoring the resources in the AWS SAM stack.

This post simulates a real-world scenario where an anomaly is introduced in the AWS Lambda function in the form of latency. While the AWS Lambda function execution time is within its timeout threshold, it is not at optimal performance. This anomalous execution time can result in larger compute times and costs. Furthermore, this post demonstrates how Amazon DevOps Guru identifies this anomaly and provides recommendations for remediation.

Here is an overview of the steps that we will conduct:

  1. First, we will deploy the AWS SAM stack containing a healthy AWS Lambda function with an Amazon EventBridge rule to invoke it on a regular basis.
  2. We will enable Amazon DevOps Guru to monitor the stack, which will show the AWS Lambda function as healthy.
  3. After waiting for a period of time, we will make changes to the AWS Lambda function in order to introduce an anomaly and redeploy the AWS SAM stack. This anomaly will be identified by Amazon DevOps Guru, which will mark the AWS Lambda function as unhealthy, provide insights into the anomaly, and provide remediation recommendations.
  4. After making the changes recommended by Amazon DevOps Guru, we will redeploy the stack and observe Amazon DevOps Guru marking the AWS Lambda function healthy again.

This post also explores utilizing Provisioned Concurrency for AWS Lambda functions and the best practice approach of utilizing Warm Start for variables reuse.


Before beginning, note the costs associated with each resource. The AWS Lambda function will incur a fee based on the number of requests and duration, while Amazon EventBridge is free. With Amazon DevOps Guru, you only pay for the data analyzed. There is no upfront cost or commitment. Learn more about the pricing per resource here.


To complete this post, you need the following prerequisites:

Getting Started

We will set up an application stack in our AWS account that contains an AWS Lambda and an Amazon EventBridge event. The event will regularly trigger the AWS Lambda function, which simulates a high-traffic application. To get started, please follow the instructions below:

  1. In your local terminal, clone the amazon-devopsguru-samples repository.
git clone https://github.com/aws-samples/amazon-devopsguru-samples.git
  1.  In your IDE of choice, open the amazon-devopsguru-samples repository.
  2. In your terminal, change directories into the repository’s subfolder amazon-devopsguru-samples/generate-lambda-devopsguru-insights.
cd amazon-devopsguru-samples/generate-lambda-devopsguru-insights
  1. Utilize the SAM CLI to conduct a guided deployment of lambda-template.yaml.
sam deploy --guided --template lambda-template.yaml
    Stack Name [sam-app]: DevOpsGuru-Sample-AnomalousLambda-Stack
    AWS Region [us-east-1]: us-east-1
    #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    Confirm changes before deploy [y/N]: y
    #SAM needs permission to be able to create roles to connect to the resources in your template
    Allow SAM CLI IAM role creation [Y/n]: y
    Save arguments to configuration file [Y/n]: y
    SAM configuration file [samconfig.toml]: y
    SAM configuration environment [default]: default

You should see a success message in your terminal, such as:

Successfully created/updated stack - DevOpsGuru-Sample-AnomalousLambda-Stack in us-east-1.

Enabling Amazon DevOps Guru

Now that we have deployed our application stack, we can enable Amazon DevOps Guru.

  1. Log in to your AWS Account.
  2. Navigate to the Amazon DevOps Guru service page.
  3. Click “Get started”.
  4. In the “Amazon DevOps Guru analysis coverage” section, select “Choose later”, then click “Enable”.

Amazon DevOps Guru analysis coverage menu which asks which AWS resources to analyze. The “Choose later” option is selected.

Figure 2.1: Amazon DevOps Guru analysis coverage menu

  1. On the left-hand menu, select “Settings”
  2. In the “DevOps Guru analysis coverage” section, click on “Manage”.
  3. Select the “Analyze all AWS resources in the specified CloudFormation stacks in this Region” radio button.
  4. The stack created in the previous section should appear. Select it, click “Save”, and then “Confirm”.

Amazon DevOps Guru analysis coverage menu which asks which AWS resources to analyze. The “Analyze all AWS resources in the specified CloudFormation stacks in this Region” option is selected and CloudFormation stacks are displayed to choose from.

Figure 2.2: Amazon DevOps Guru analysis coverage resource selection

Before moving on to the next section, we must allow Amazon DevOps Guru to baseline the resources and benchmark the application’s normal behavior. For our serverless stack with two resources, we recommend waiting two hours before carrying out the next steps. When enabled in a production environment, depending upon the number of resources selected for monitoring, it can take up to 24 hours for Amazon DevOps Guru to complete baselining.

Once baselining is complete, the Amazon DevOps Guru dashboard, an overview of the health of your resources, will display the application stack, DevOpsGuru-Sample-AnomalousLambda-Stack, and mark it as healthy, shown below.

Amazon DevOps Guru Dashboard displays the system health summary and system health overview of each CloudFormation stack. The DevOpsGuru-Sample-AnomalousLambda-Stack is marked as healthy with 0 reactive insights and 0 proactive insights.

Figure 2.3: Amazon DevOps Guru Healthy Dashboard

Enabling SNS

If you would like to set up notifications upon the detection of an anomaly by Amazon DevOps Guru, then please follow these additional instructions.

Amazon DevOps Guru Specify an SNS topic menu which enables notifications for important DevOps Guru events. No SNS topics are currently configured.

Figure 3: Amazon DevOps Guru Specify an SNS topic

Invoking an Anomaly

Once Amazon DevOps Guru has identified the stack as healthy, we will update the AWS Lambda function with suboptimal code. This update will simulate an update to critical business applications which are causing the anomalous performance.

  1. Open the amazon-devopsguru-samples repository in your IDE.
  2. Open the file generate-lambda-devopsguru-insights/lambda-code.py
  3. Uncomment lines 7-8 and save the file. These lines of code will produce an anomaly due to the function’s increased runtime.
  4. Deploy these updates to your stack by running:
cd generate-lambda-devopsguru-insights 
sam deploy --template lambda-template.yaml -stack-name DevOpsGuru-Sample-AnomalousLambda-Stack

Anomaly Overview

Shortly after, Amazon DevOps Guru will generate a reactive insight from the sample stack. This insight contains recommendations, metrics, and events related to anomalous behavior. View the unhealthy stack status in the Dashboard.

Amazon DevOps Guru Dashboard displays the system health summary and system health overview of each CloudFormation stack. The DevOpsGuru-Sample-AnomalousLambda-Stack is marked as unhealthy with 1 reactive insights and 0 proactive insights.

Figure 4.1: Amazon DevOps Guru Unhealthy Dashboard

By clicking on the “Ongoing reactive insight” within the tile, you will be brought to the Insight Details page. This page contains an array of useful information to help you understand and address anomalous behavior.

Insight overview

Utilize this section to get a high-level overview of the insight. You can see that the status of the insight is ongoing, 1 AWS CloudFormation stack is affected, the insight started on Sept-08-2021, it does not have an end time, and it was last updated on Sept-08-2021.

Amazon DevOps Guru Insight Details page has multiple information sections. The Insight overview is the first section which displays the status is ongoing, there is 1 affected stack, the start time and last updated time. The end time is empty as the insight is ongoing.

Figure 4.2: Amazon DevOps Guru Ongoing Reactive Insight Overview

Aggregated metrics

The Aggregated metrics tab displays metrics related to the insight. The table is grouped by AWS CloudFormation stacks and subsequent resources that created the metrics. In this example, the insight was a product of an anomaly in the “duration p50” metric generated by the “DevOpsGuruSample-AnomalousLambda” AWS Lambda function.

AWS Lambda duration metrics derive from a percentile statistic utilized to exclude outlier values that skew average and maximum statistics. The P50 statistic is typically a great middle estimate. It is defined as 50% of estimates exceed the P50 estimate and 50% of estimates are less than the P50 estimate.

The red lines on the timeline indicate spans of time when the “duration p50” metric emitted unusual values. Click the red line in the timeline in order to view detailed information.

  • Choose View in CloudWatch to see how the metric looks in the CloudWatch console. For more information, see Statistics and Dimensions in the Amazon CloudWatch User Guide.
  • Hover over the graph in order to view details about the anomalous metric data and when it occurred.
  • Choose the box with the downward arrow to download a PNG image of the graph.

Amazon DevOps Guru Insight Details page contains aggregated metrics. The Duration p50 metric is selected and displayed in graph form.

Figure 4.3: Amazon DevOps Guru Ongoing Reactive Insight Aggregated Metrics

Graphed anomalies

The Graphed anomalies tab displays detailed graphs for each of the insight’s anomalies. Because our insight was comprised of a single anomaly, there is one tile with details about unusual behavior detected in related metrics.

  • Choose View all statistics and dimensions in order to see details about the anomaly. In the window that opens, you can:
  • Choose View in CloudWatch in order to see how the metric looks in the CloudWatch console.
  • Hover over the graph to view details about the anomalous metric data and when it occurred.
  • Choose Statistics or Dimension in order to customize the graph’s display. For more information, see Statistics and Dimensions in the Amazon CloudWatch User Guide.

Amazon DevOps Guru Insight Details page contains Graphed anomalies. The p50 metric of the AWS/Lambda duration in displayed in graph form.

Figure 4.4: Amazon DevOps Guru Ongoing Reactive Insight Graphed Anomaly

Related events

In Related events, view AWS CloudTrail events related to your insight. These events help understand, diagnose, and address the underlying cause of the anomalous behavior. In this example, the events are:

  1. CreateFunction – when we created and deployed the AWS SAM template containing our AWS Lambda function.
  2. CreateChangeSet – when we pushed updates to our stack via the AWS SAM CLI.
  3. UpdateFunctionCode – when the AWS Lambda function code was updated.

Continuation of figure 4.4

Figure 4.5: Amazon DevOps Guru Ongoing Reactive Insight Related Events


The final section in the Insight Detail page is Recommendations. You can view suggestions that might help you resolve the underlying problem. When Amazon DevOps Guru detects anomalous behavior, it attempts to create recommendations. An insight might contain one, multiple, or zero recommendations.

In this example, the Amazon DevOps Guru recommendation matches the best resolution to our problem-provisioned concurrency.

Amazon DevOps Guru Insight Details page contains Recommendations. The suggested recommendation is to configure provisioned concurrency for the AWS Lambda.

Figure 4.6: Amazon DevOps Guru Ongoing Reactive Insight Recommendations

Understanding what happened

Amazon DevOps Guru recommends enabling Provisioned Concurrency for the AWS Lambda functions in order to help it scale better when responding to concurrent requests. As mentioned earlier, Provisioned Concurrency keeps functions initialized by creating the requested number of execution environments so that they can respond to invocations. This is a suggested best practice when building high-traffic applications, such as the one that this sample is mimicking.

In the anomalous AWS Lambda function, we have sample code that is causing delays. This is analogous to application initialization logic within the handler function. It is a best practice for this logic to live outside of the handler function. Because we are mimicking a high-traffic application, the expectation is to receive a large number of concurrent requests. Therefore, it may be advisable to turn on Provisioned Concurrency for the AWS Lambda function. For Provisioned Concurrency pricing, refer to the AWS Lambda Pricing page.

Resolving the Anomaly

To resolve the sample application’s anomaly, we will update the AWS Lambda function code and enable provisioned concurrency for the AWS Lambda infrastructure.

  1. Opening the sample repository in your IDE.
  2. Open the file generate-lambda-devopsguru-insights/lambda-code.py.
  3. Move lines 7-8, the code forcing the AWS Lambda function to respond slowly, above the lambda_handler function definition.
  4. Save the file.
  5. Open the file generate-lambda-devopsguru-insights/lambda-template.yaml.
  6. Uncomment lines 15-17, the code enabling provisioned concurrency in the sample AWS Lambda function.
  7. Save the file.
  8. Deploy these updates to your stack.
cd generate-lambda-devopsguru-insights 
sam deploy --template lambda-template.yaml --stack-name DevOpsGuru-Sample-AnomalousLambda-Stack       

After completing these steps, the duration P50 metric will emit more typical results, thereby causing Amazon DevOps Guru to recognize the anomaly as fixed, and then close the reactive insight as shown below.

Amazon DevOps Guru Insight Summary page displays the reactive insight has been closed.

Figure 5: Amazon DevOps Guru Closed Reactive Insight

Clean Up

When you are finished walking through this post, you will have multiple test resources in your AWS account that should be cleaned up or un-provisioned in order to avoid incurring any further charges.

  1. Opening the sample repository in your IDE.
  2. Run the below AWS SAM CLI command to delete the sample stack.
cd generate-lambda-devopsguru-insights 
sam delete --stack-name DevOpsGuru-Sample-AnomalousLambda-Stack 


As seen in the example above, Amazon DevOps Guru can detect anomalous behavior in an AWS Lambda function, tie it to relevant events that introduced that anomaly, and provide recommendations for remediation by using its pre-trained ML models. All of this was possible by simply enabling Amazon DevOps Guru to monitor the resources with minimal configuration changes and no previous ML expertise. Start using Amazon DevOps Guru today.

About the authors

Harish Vaswani

Harish Vaswani is a Senior Cloud Application Architect at Amazon Web Services. He specializes in architecting and building cloud native applications and enables customers with best practices in their cloud journey. He is a DevOps and Machine Learning enthusiast. Harish lives in New Jersey and enjoys spending time with this family, filmmaking and music production.

Caroline Gluck

Caroline Gluck is a Cloud Application Architect at Amazon Web Services based in New York City, where she helps customer design and build cloud native Data Science applications. Caroline is a builder at heart, with a passion for serverless architecture and Machine Learning. In her spare time, she enjoys traveling, cooking, and spending time with family and friends.

Generating DevOps Guru Proactive Insights for Amazon ECS

Post Syndicated from Trishanka Saikia original https://aws.amazon.com/blogs/devops/generate-devops-guru-proactive-insights-in-ecs-using-container-insights/

Monitoring is fundamental to operating an application in production, since we can only operate what we can measure and alert on. As an application evolves, or the environment grows more complex, it becomes increasingly challenging to maintain monitoring thresholds for each component, and to validate that they’re still set to an effective value. We not only want monitoring alarms to trigger when needed, but also want to minimize false positives.

Amazon DevOps Guru is an AWS service that helps you effectively monitor your application by ingesting vended metrics from Amazon CloudWatch. It learns your application’s behavior over time and then detects anomalies. Based on these anomalies, it generates insights by first combining the detected anomalies with suspected related events from AWS CloudTrail, and then providing the information to you in a simple, ready-to-use dashboard when you start investigating potential issues. Amazon DevOpsGuru makes use of the CloudWatch Containers Insights to detect issues around resource exhaustion for Amazon ECS or Amazon EKS applications. This helps in proactively detecting issues like memory leaks in your applications before they impact your users, and also provides guidance as to what the probable root-causes and resolutions might be.

This post will demonstrate how to simulate a memory leak in a container running in Amazon ECS, and have it generate a proactive insight in Amazon DevOps Guru.

Solution Overview

The following diagram shows the environment we’ll use for our scenario. The container “brickwall-maker” is preconfigured as to how quickly to allocate memory, and we have built this container image and published it to our public Amazon ECR repository. Optionally, you can build and host the docker image in your own private repository as described in step 2 & 3.

After creating the container image, we’ll utilize an AWS CloudFormation template to create an ECS Cluster and an ECS Service called “Test” with a desired count of two. This will create two tasks using our “brickwall-maker” container image. The stack will also enable Container Insights for the ECS Cluster. Then, we will enable resource coverage for this CloudFormation stack in Amazon DevOpsGuru in order to start our resource analysis.

Architecture Diagram showing the service “Test” using the container “brickwall-maker” with a desired count of two. The two ECS Task’s vended metrics are then processed by CloudWatch Container Insights. Both, CloudWatch Container Insights and CloudTrail, are ingested by Amazon DevOps Guru which then makes detected insights available to the user. [Image: DevOpsGuruBlog1.png]V1: DevOpsGuruBlog1.drawio (https://api.quip-amazon.com/2/blob/fbe9AAT37Ge/LdkTqbmlZ8uNj7A44pZbnw?name=DevOpsGuruBlog1.drawio&s=cVbmAWsXnynz) V2: DevOpsGuruBlog1.drawio (https://api.quip-amazon.com/2/blob/fbe9AAT37Ge/SvsNTJLEJOHHBls_kV7EwA?name=DevOpsGuruBlog1.drawio&s=cVbmAWsXnynz) V3: DevOpsGuruBlog1.drawio (https://api.quip-amazon.com/2/blob/fbe9AAT37Ge/DqKTxtQvmOLrzM3KcF_oTg?name=DevOpsGuruBlog1.drawio&s=cVbmAWsXnynz)

Source provided on GitHub:

  • DevOpsGuru.yaml
  • EnableDevOpsGuruForCfnStack.yaml
  • Docker container source


1. Create your IDE environment

In the AWS Cloud9 console, click Create environment, give your environment a Name, and click Next step. On the Environment settings page, change the instance type to t3.small, and click Next step. On the Review page, make sure that the Name and Instance type are set as intended, and click Create environment. The environment creation will take a few minutes. After that, the AWS Cloud9 IDE will open, and you can continue working in the terminal tab displayed in the bottom pane of the IDE.

Install the following prerequisite packages, and ensure that you have docker installed:

sudo yum install -y docker
sudo service docker start 
docker --version
Clone the git repository in order to download the required CloudFormation templates and code:

git clone https://github.com/aws-samples/amazon-devopsguru-brickwall-maker

Change to the directory that contains the cloned repository

cd amazon-devopsguru-brickwall-maker

2. Optional : Create ECR private repository

If you want to build your own container image and host it in your own private ECR repository, create a new repository with the following command and then follow the steps to prepare your own image:

aws ecr create-repository —repository-name brickwall-maker

3. Optional: Prepare Docker Image

Authenticate to Amazon Elastic Container Registry (ECR) in the target region

aws ecr get-login-password --region ap-northeast-1 | \
    docker login --username AWS --password-stdin \

In the above command, as well as in the following shown below, make sure that you replace 123456789012 with your own account ID.

Build brickwall-maker Docker container:

docker build -t brickwall-maker .

Tag the Docker container to prepare it to be pushed to ECR:

docker tag brickwall-maker:latest 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/brickwall-maker:latest

Push the built Docker container to ECR

docker push 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/brickwall-maker:latest

4. Launch the CloudFormation template to deploy your ECS infrastructure

To deploy your ECS infrastructure, run the following command (replace your own private ECR URL or use our public URL) in the ParameterValue) to launch the CloudFormation template :

aws cloudformation create-stack --stack-name myECS-Stack \
--template-body file://DevOpsGuru.yaml \
--parameters ParameterKey=ImageUrl,ParameterValue=public.ecr.aws/p8v8e7e5/myartifacts:brickwallv1

5. Enable DevOps Guru to monitor the ECS Application

Run the following command to enable DevOps Guru for monitoring your ECS application:

aws cloudformation create-stack \
--stack-name EnableDevOpsGuruForCfnStack \
--template-body file://EnableDevOpsGuruForCfnStack.yaml \
--parameters ParameterKey=CfnStackNames,ParameterValue=myECS-Stack

6. Wait for base-lining of resources

This step lets DevOps Guru complete the baselining of the resources and benchmark the normal behavior. For this particular scenario, we recommend waiting two days before any insights are triggered.

Unlike other monitoring tools, the DevOps Guru dashboard would not present any counters or graphs. In the meantime, you can utilize CloudWatch Container Insights to monitor the cluster-level, task-level, and service-level metrics in ECS.

7. View Container Insights metrics

  • Open the CloudWatch console.
  • In the navigation pane, choose Container Insights.
  • Use the drop-down boxes near the top to select ECS Services as the resource type to view, then select DevOps Guru as the resource to monitor.
  • The performance monitoring view will show you graphs for several metrics, including “Memory Utilization”, which you can watch increasing from here. In addition, it will show the list of tasks in the lower “Task performance” pane showing the “Avg CPU” and “Avg memory” metrics for the individual tasks.

8. Review DevOps Guru insights

When DevOps Guru detects an anomaly, it generates a proactive insight with the relevant information needed to investigate the anomaly, and it will list it in the DevOps Guru Dashboard.

You can view the insights by clicking on the number of insights displayed in the dashboard. In our case, we expect insights to be shown in the “proactive insights” category on the dashboard.

Once you have opened the insight, you will see that the insight view is divided into the following sections:

  • Insight Overview with a basic description of the anomaly. In this case, stating that Memory Utilization is approaching limit with details of the stack that is being affected by the anomaly.
  • Anomalous metrics consisting of related graphs and a timeline of the predicted impact time in the future.
  • Relevant events with contextual information, such as changes or updates made to the CloudFormation stack’s resources in the region.
  • Recommendations to mitigate the issue. As seen in the following screenshot, it recommends troubleshooting High CPU or Memory Utilization in ECS along with a link to the necessary documentation.

The following screenshot illustrates an example insight detail page from DevOps Guru

 An example of an ECS Service’s Memory Utilization approaching a limit of 100%. The metric graph shows the anomaly starting two days ago at about 22:00 with memory utilization increasing steadily until the anomaly was reported today at 18:08. The graph also shows a forecast of the memory utilization with a predicted impact of reaching 100% the next day at about 22:00.

Potentially related events on a timeline and below them a list of recommendations. Two deployment events are shown without further details on a timeline. The recommendations table links to one document on how to troubleshoot high CPU or memory utilization in Amazon ECS.


This post describes how DevOps Guru continuously monitors resources in a particular region in your AWS account, as well as proactively helps identify problems around resource exhaustion such as running out of memory, in advance. This helps IT operators take preventative actions even before a problem presents itself, thereby preventing downtime.

Cleaning up

After walking through this post, you should clean up and un-provision the resources in order to avoid incurring any further charges.

  1. To un-provision the CloudFormation stacks, on the AWS CloudFormation console, choose Stacks. Select the stack name, and choose Delete.
  2. Delete the AWS Cloud9 environment.
  3. Delete the ECR repository.

About the authors

Trishanka Saikia

Trishanka Saikia is a Technical Account Manager for AWS. She is also a DevOps enthusiast and works with AWS customers to design, deploy, and manage their AWS workloads/architectures.

Gerhard Poul

Gerhard Poul is a Senior Solutions Architect at Amazon Web Services based in Vienna, Austria. Gerhard works with customers in Austria to enable them with best practices in their cloud journey. He is passionate about infrastructure as code and how cloud technologies can improve IT operations.