Tag Archives: automation

Terraform CI/CD and testing on AWS with the new Terraform Test Framework

2024-04-03 Kevon Mayers

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/terraform-ci-cd-and-testing-on-aws-with-the-new-terraform-test-framework/

Graphic created by Kevon Mayers

Introduction

Organizations often use Terraform Modules to orchestrate complex resource provisioning and provide a simple interface for developers to enter the required parameters to deploy the desired infrastructure. Modules enable code reuse and provide a method for organizations to standardize deployment of common workloads such as a three-tier web application, a cloud networking environment, or a data analytics pipeline. When building Terraform modules, it is common for the module author to start with manual testing. Manual testing is performed using commands such as terraform validate for syntax validation, terraform plan to preview the execution plan, and terraform apply followed by manual inspection of resource configuration in the AWS Management Console. Manual testing is prone to human error, not scalable, and can result in unintended issues. Because modules are used by multiple teams in the organization, it is important to ensure that any changes to the modules are extensively tested before the release. In this blog post, we will show you how to validate Terraform modules and how to automate the process using a Continuous Integration/Continuous Deployment (CI/CD) pipeline.

Terraform Test

Terraform test is a new testing framework for module authors to perform unit and integration tests for Terraform modules. Terraform test can create infrastructure as declared in the module, run validation against the infrastructure, and destroy the test resources regardless if the test passes or fails. Terraform test will also provide warnings if there are any resources that cannot be destroyed. Terraform test uses the same HashiCorp Configuration Language (HCL) syntax used to write Terraform modules. This reduces the burden for modules authors to learn other tools or programming languages. Module authors run the tests using the command terraform test which is available on Terraform CLI version 1.6 or higher.

Module authors create test files with the extension *.tftest.hcl. These test files are placed in the root of the Terraform module or in a dedicated tests directory. The following elements are typically present in a Terraform tests file:

Provider block: optional, used to override the provider configuration, such as selecting AWS region where the tests run.
Variables block: the input variables passed into the module during the test, used to supply non-default values or to override default values for variables.
Run block: used to run a specific test scenario. There can be multiple run blocks per test file, Terraform executes run blocks in order. In each run block you specify the command Terraform (plan or apply), and the test assertions. Module authors can specify the conditions such as: length(var.items) != 0. A full list of condition expressions can be found in the HashiCorp documentation.

Terraform tests are performed in sequential order and at the end of the Terraform test execution, any failed assertions are displayed.

Basic test to validate resource creation

Now that we understand the basic anatomy of a Terraform tests file, let’s create basic tests to validate the functionality of the following Terraform configuration. This Terraform configuration will create an AWS CodeCommit repository with prefix name repo-.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description     = "Test repository."
}

Now we create a Terraform test file in the tests directory. See the following directory structure as an example:

├── main.tf 
└── tests 
└── basic.tftest.hcl

For this first test, we will not perform any assertion except for validating that Terraform execution plan runs successfully. In the tests file, we create a variable block to set the value for the variable repository_name. We also added the run block with command = plan to instruct Terraform test to run Terraform plan. The completed test should look like the following:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan
}

Now we will run this test locally. First ensure that you are authenticated into an AWS account, and run the terraform init command in the root directory of the Terraform module. After the provider is initialized, start the test using the terraform test command.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass

Our first test is complete, we have validated that the Terraform configuration is valid and the resource can be provisioned successfully. Next, let’s learn how to perform inspection of the resource state.

Create resource and validate resource name

Re-using the previous test file, we add the assertion block to checks if the CodeCommit repository name starts with a string repo- and provide error message if the condition fails. For the assertion, we use the startswith function. See the following example:

# basic.tftest.hcl

variables {
  repository_name = "MyRepo"
}

run "test_resource_creation" {
  command = plan

  assert {
    condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
    error_message = "CodeCommit repository name ${var.repository_name} did not start with the expected value of ‘repo-****’."
  }
}

Now, let’s assume that another module author made changes to the module by modifying the prefix from repo- to my-repo-. Here is the modified Terraform module.

# main.tf

variable "repository_name" {
  type = string
}
resource "aws_codecommit_repository" "test" {
  repository_name = format("my-repo-%s", var.repository_name)
  description = "Test repository."
}

We can catch this mistake by running the the terraform test command again.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... fail
╷
│ Error: Test assertion failed
│
│ on tests/basic.tftest.hcl line 9, in run "test_resource_creation":
│ 9: condition = startswith(aws_codecommit_repository.test.repository_name, "repo-")
│ ├────────────────
│ │ aws_codecommit_repository.test.repository_name is "my-repo-MyRepo"
│
│ CodeCommit repository name MyRepo did not start with the expected value 'repo-***'.
╵
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... fail

Failure! 0 passed, 1 failed.

We have successfully created a unit test using assertions that validates the resource name matches the expected value. For more examples of using assertions see the Terraform Tests Docs. Before we proceed to the next section, don’t forget to fix the repository name in the module (revert the name back to repo- instead of my-repo-) and re-run your Terraform test.

Testing variable input validation

When developing Terraform modules, it is common to use variable validation as a contract test to validate any dependencies / restrictions. For example, AWS CodeCommit limits the repository name to 100 characters. A module author can use the length function to check the length of the input variable value. We are going to use Terraform test to ensure that the variable validation works effectively. First, we modify the module to use variable validation.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
}

By default, when variable validation fails during the execution of Terraform test, the Terraform test also fails. To simulate this, create a new test file and insert the repository_name variable with a value longer than 100 characters.

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan
}

Notice on this new test file, we also set the command to Terraform plan, why is that? Because variable validation runs prior to Terraform apply, thus we can save time and cost by skipping the entire resource provisioning. If we run this Terraform test, it will fail as expected.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… fail
╷
│ Error: Invalid value for variable
│
│ on main.tf line 1:
│ 1: variable “repository_name” {
│ ├────────────────
│ │ var.repository_name is “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
│
│ The repository name must be less than or equal to 100 characters.
│
│ This was checked by the validation rule at main.tf:3,3-13.
╵
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… fail

Failure! 1 passed, 1 failed.

For other module authors who might iterate on the module, we need to ensure that the validation condition is correct and will catch any problems with input values. In other words, we expect the validation condition to fail with the wrong input. This is especially important when we want to incorporate the contract test in a CI/CD pipeline. To prevent our test from failing due introducing an intentional error in the test, we can use the expect_failures attribute. Here is the modified test file:

# var_validation.tftest.hcl

variables {
  repository_name = “this_is_a_repository_name_longer_than_100_characters_7rfD86rGwuqhF3TH9d3Y99r7vq6JZBZJkhw5h4eGEawBntZmvy”
}

run “test_invalid_var” {
  command = plan

  expect_failures = [
    var.repository_name
  ]
}

Now if we run the Terraform test, we will get a successful result.

❯ terraform test
tests/basic.tftest.hcl… in progress
run “test_resource_creation”… pass
tests/basic.tftest.hcl… tearing down
tests/basic.tftest.hcl… pass
tests/var_validation.tftest.hcl… in progress
run “test_invalid_var”… pass
tests/var_validation.tftest.hcl… tearing down
tests/var_validation.tftest.hcl… pass

Success! 2 passed, 0 failed.

As you can see, the expect_failures attribute is used to test negative paths (the inputs that would cause failures when passed into a module). Assertions tend to focus on positive paths (the ideal inputs). For an additional example of a test that validates functionality of a completed module with multiple interconnected resources, see this example in the Terraform CI/CD and Testing on AWS Workshop.

Orchestrating supporting resources

In practice, end-users utilize Terraform modules in conjunction with other supporting resources. For example, a CodeCommit repository is usually encrypted using an AWS Key Management Service (KMS) key. The KMS key is provided by end-users to the module using a variable called kms_key_id. To simulate this test, we need to orchestrate the creation of the KMS key outside of the module. In this section we will learn how to do that. First, update the Terraform module to add the optional variable for the KMS key.

# main.tf

variable "repository_name" {
  type = string
  validation {
    condition = length(var.repository_name) <= 100
    error_message = "The repository name must be less than or equal to 100 characters."
  }
}

variable "kms_key_id" {
  type = string
  default = ""
}

resource "aws_codecommit_repository" "test" {
  repository_name = format("repo-%s", var.repository_name)
  description = "Test repository."
  kms_key_id = var.kms_key_id != "" ? var.kms_key_id : null
}

In a Terraform test, you can instruct the run block to execute another helper module. The helper module is used by the test to create the supporting resources. We will create a sub-directory called setup under the tests directory with a single kms.tf file. We also create a new test file for KMS scenario. See the updated directory structure:

├── main.tf
└── tests
├── setup
│ └── kms.tf
├── basic.tftest.hcl
├── var_validation.tftest.hcl
└── with_kms.tftest.hcl

The kms.tf file is a helper module to create a KMS key and provide its ARN as the output value.

# kms.tf

resource "aws_kms_key" "test" {
  description = "test KMS key for CodeCommit repo"
  deletion_window_in_days = 7
}

output "kms_key_id" {
  value = aws_kms_key.test.arn
}

The new test will use two separate run blocks. The first run block (setup) executes the helper module to generate a KMS key. This is done by assigning the command apply which will run terraform apply to generate the KMS key. The second run block (codecommit_with_kms) will then use the KMS key ARN output of the first run as the input variable passed to the main module.

# with_kms.tftest.hcl

run "setup" {
  command = apply
  module {
    source = "./tests/setup"
  }
}

run "codecommit_with_kms" {
  command = apply

  variables {
    repository_name = "MyRepo"
    kms_key_id = run.setup.kms_key_id
  }

  assert {
    condition = aws_codecommit_repository.test.kms_key_id != null
    error_message = "KMS key ID attribute value is null"
  }
}

Go ahead and run the Terraform init, followed by Terraform test. You should get the successful result like below.

❯ terraform test
tests/basic.tftest.hcl... in progress
run "test_resource_creation"... pass
tests/basic.tftest.hcl... tearing down
tests/basic.tftest.hcl... pass
tests/var_validation.tftest.hcl... in progress
run "test_invalid_var"... pass
tests/var_validation.tftest.hcl... tearing down
tests/var_validation.tftest.hcl... pass
tests/with_kms.tftest.hcl... in progress
run "create_kms_key"... pass
run "codecommit_with_kms"... pass
tests/with_kms.tftest.hcl... tearing down
tests/with_kms.tftest.hcl... pass

Success! 4 passed, 0 failed.

We have learned how to run Terraform test and develop various test scenarios. In the next section we will see how to incorporate all the tests into a CI/CD pipeline.

Terraform Tests in CI/CD Pipelines

Now that we have seen how Terraform Test works locally, let’s see how the Terraform test can be leveraged to create a Terraform module validation pipeline on AWS. The following AWS services are used:

AWS CodeCommit – a secure, highly scalable, fully managed source control service that hosts private Git repositories.
AWS CodeBuild – a fully managed continuous integration service that compiles source code, runs tests, and produces ready-to-deploy software packages.
AWS CodePipeline – a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates.
Amazon Simple Storage Service (Amazon S3) – an object storage service offering industry-leading scalability, data availability, security, and performance.

Terraform module validation pipeline

In the above architecture for a Terraform module validation pipeline, the following takes place:

A developer pushes Terraform module configuration files to a git repository (AWS CodeCommit).
AWS CodePipeline begins running the pipeline. The pipeline clones the git repo and stores the artifacts to an Amazon S3 bucket.
An AWS CodeBuild project configures a compute/build environment with Checkov installed from an image fetched from Docker Hub. CodePipeline passes the artifacts (Terraform module) and CodeBuild executes Checkov to run static analysis of the Terraform configuration files.
Another CodeBuild project configured with Terraform from an image fetched from Docker Hub. CodePipeline passes the artifacts (repo contents) and CodeBuild runs Terraform command to execute the tests.

CodeBuild uses a buildspec file to declare the build commands and relevant settings. Here is an example of the buildspec files for both CodeBuild Projects:

# Checkov
version: 0.1
phases:
  pre_build:
    commands:
      - echo pre_build starting

  build:
    commands:
      - echo build starting
      - echo starting checkov
      - ls
      - checkov -d .
      - echo saving checkov output
      - checkov -s -d ./ > checkov.result.txt

In the above buildspec, Checkov is run against the root directory of the cloned CodeCommit repository. This directory contains the configuration files for the Terraform module. Checkov also saves the output to a file named checkov.result.txt for further review or handling if needed. If Checkov fails, the pipeline will fail.

# Terraform Test
version: 0.1
phases:
  pre_build:
    commands:
      - terraform init
      - terraform validate

  build:
    commands:
      - terraform test

In the above buildspec, the terraform init and terraform validate commands are used to initialize Terraform, then check if the configuration is valid. Finally, the terraform test command is used to run the configured tests. If any of the Terraform tests fails, the pipeline will fail.

For a full example of the CI/CD pipeline configuration, please refer to the Terraform CI/CD and Testing on AWS workshop. The module validation pipeline mentioned above is meant as a starting point. In a production environment, you might want to customize it further by adding Checkov allow-list rules, linting, checks for Terraform docs, or pre-requisites such as building the code used in AWS Lambda.

Choosing various testing strategies

At this point you may be wondering when you should use Terraform tests or other tools such as Preconditions and Postconditions, Check blocks or policy as code. The answer depends on your test type and use-cases. Terraform test is suitable for unit tests, such as validating resources are created according to the naming specification. Variable validations and Pre/Post conditions are useful for contract tests of Terraform modules, for example by providing error warning when input variables value do not meet the specification. As shown in the previous section, you can also use Terraform test to ensure your contract tests are running properly. Terraform test is also suitable for integration tests where you need to create supporting resources to properly test the module functionality. Lastly, Check blocks are suitable for end to end tests where you want to validate the infrastructure state after all resources are generated, for example to test if a website is running after an S3 bucket configured for static web hosting is created.

When developing Terraform modules, you can run Terraform test in command = plan mode for unit and contract tests. This allows the unit and contract tests to run quicker and cheaper since there are no resources created. You should also consider the time and cost to execute Terraform test for complex / large Terraform configurations, especially if you have multiple test scenarios. Terraform test maintains one or many state files within the memory for each test file. Consider how to re-use the module’s state when appropriate. Terraform test also provides test mocking, which allows you to test your module without creating the real infrastructure.

Conclusion

In this post, you learned how to use Terraform test and develop various test scenarios. You also learned how to incorporate Terraform test in a CI/CD pipeline. Lastly, we also discussed various testing strategies for Terraform configurations and modules. For more information about Terraform test, we recommend the Terraform test documentation and tutorial. To get hands on practice building a Terraform module validation pipeline and Terraform deployment pipeline, check out the Terraform CI/CD and Testing on AWS Workshop.

Authors

Autonomous hardware diagnostics and recovery at scale

2024-03-25 Jet Mariscal

Post Syndicated from Jet Mariscal original https://blog.cloudflare.com/autonomous-hardware-diagnostics-and-recovery-at-scale

Cloudflare’s global network spans more than 310 cities in over 120 countries. That means thousands of servers geographically spread across different data centers, running services that protect and accelerate our customer’s Internet applications. Operating hardware at such a scale means that hardware can break anywhere and at any time. In such cases, our systems are engineered such that these failures cause little to no impact. However, detecting and managing server failure at scale requires automation. This blog aims to provide insights into the difficulties involved in handling broken servers and how we were able to simplify the process through automation.

Challenges dealing with broken servers

When a server is found to have faulty hardware and needs to be removed from production, it is considered broken and its state is set to Repair in the internal database where server status is tracked. In the past, our Data Center Operations team were essentially left to troubleshoot and diagnose broken servers on their own. They had to go through laborious tasks like performing queries to locate and repair servers, conducting diagnostics, reviewing results, evaluating if a server can be restored to production, and creating the necessary tickets for re-enabling servers and executing operations to put them back in production. Such effort can take hours for a single server alone, and can easily consume an engineer’s entire day.

As you can see, addressing server repairs was a labor-intensive process performed manually, Additionally, a lot of these servers remained powered on within the racks, wasting energy. With our fleet expanding rapidly, the attention of Data Center Operations is primarily devoted to supporting this growth, leaving less time to handle servers in need of repair.

It was clear that our infrastructure was growing too fast for us to be able to handle repairs and recovery, so we had to find a better way to handle these sorts of inefficiencies in our operations. This would allow our engineers to focus on the growth of our footprint while not abandoning repair and recovery – after all, these are still huge CapEx investments and wasted capacity that otherwise would have been fully utilized.

Using automation as an autonomous system

As members of the Infrastructure Software Systems and Automation team at Cloudflare, we primarily work on building tools and automation that help reduce excess work in order to ease the pressure on our operations teams, increase productivity, and enable people to execute operations with the highest efficiency.

Our team continuously strives to challenge our existing processes and systems, finding ways we can evolve them and make significant improvements – one of which is to build not just a typical automated system but an autonomous one. Building autonomous automations means creating systems that can operate independently, without the need for constant human intervention or oversight – a perfect example of this is Phoenix.

Introducing Phoenix

Phoenix is an autonomous diagnostics and recovery automation that runs at regular intervals to discover Cloudflare data centers with servers that are broken, performing diagnostics on detection, recovering those that pass diagnostics by re-provisioning, and ultimately re-enabling those that have successfully been re-provisioned in the safest and most unobtrusive way possible – all without requiring any human intervention! Should a server fail at any point in the process, Phoenix will take care of updating relevant tickets, even pinpointing the cause of the failure, and reverting the state of the server accordingly when needed – again, all without any human intervention!

The image below illustrates the whole process:

To better understand exactly how Phoenix works, let’s dive into some details about its core functionality.

Discovery

Discovery runs at a regular interval of 30 minutes, selecting a maximum of two Cloudflare data centers that have broken or repair state servers in its fleet, which are all configurable depending on business and operational needs, against which it can immediately execute diagnostics. At this rate, Phoenix is able to discover and operate on all broken servers in the fleet in about 3 days. On each run, it also detects data centers that may have broken servers already queued for recovery, and takes care of ensuring that the Recovery phase is executed immediately.

Diagnostics

Diagnostics takes care of running various tests across the broken servers of a selected data center in a single run, verifying viability of the hardware components, and identifying the candidates for recovery.

A diagnostic operation includes running the following:

Out-of-Band connectivity check
This check determines the reachability of a device via out-of-band network. We employ IPMI (Intelligent Platform Management Interface) to ensure proper physical connectivity and accessibility of devices. This allows for effective monitoring and management of hardware components, enhancing overall system reliability and performance. Only devices that pass this check can progress to the Node Acceptance Testing phase.
Node Acceptance Tests
We leverage an existing internally-built tool called INAT (Integrated Node Acceptance Testing) that runs various tests suites/cases (Hardware Validation, Performance, etc.).
For every server that needs to be diagnosed, Phoenix will send relevant system instructions to have it boot into a custom Linux boot image, internally called INAT-image. Built into this image are the various tests that need to run when the server boots up, publishing the results to an internal resource in both human-readable (HTML) and machine-readable (JSON) formats, with the latter consumed and interpreted by Phoenix. Upon completion of the boot diagnostics, the server is powered off again to ensure it is not wasting energy.

Our node acceptance tests encompass a range of evaluations, including but not limited to benchmark testing, CPU/Memory/Storage checks, drive wiping, and various other assessments. Look out for an upcoming in-depth blog post covering INAT.

A summarized diagnostics result is immediately added to the tracking ticket, including pinpointing the exact cause of a failure.

Recovery

Recovery executes what we call an expansion operation, which in its first phase will provision the servers that pass diagnostics. The second phase is to re-enable the successfully provisioned servers back to production, where only those that have been re-enabled successfully will start receiving production traffic again.

Once the diagnostics are passed and the broken servers move on towards the first phase of recovery, we change their statuses from Repair to Pending Provision. If the servers don’t fully recover, for example, because there are server configuration errors or issues enabling services, Phoenix assesses the situation. In such cases, it returns those servers to the Repair state for additional evaluation. Additionally, if the diagnostics indicate that the servers need any faulty components replaced, then Phoenix notifies our Data Center operation team for manual repairs as required, ensuring that the server is not repeatedly selected until the required part replacement is completed. This ensures any necessary human intervention can be applied promptly, making the server ready for Phoenix to rediscover in its next iteration.

An autonomous recovery operation requires infusing intelligence into the automated system so that we can fully trust that it’s able to execute an expansion operation in the safest way possible and handle situations on its own without any human interventions. To do this, we’ve made sure Phoenix is automation-aware – this means that it knows when there are other automations executing certain operations such as expansions, and will only execute an expansion when there are no ongoing provisioning operations in the target data center. This ability to execute only when it’s safe to do so is to ensure that the recovery operation will not interfere with any other ongoing operations in the data center. We’ve also adjusted its tolerance with faulty hardware – this means it’s able to gracefully deal with misbehaving servers by letting these quickly drop out of the recovery candidate list upon misbehavior that prevents blocking the operation.

Visibility

While our autonomous system, Phoenix, seamlessly handles operations without human intervention, it doesn’t mean we sacrifice visibility. Transparency is a key feature of Phoenix. It meticulously logs every operation, from executing tasks to providing progress updates, and shares this information in communication channels like chat rooms and Jira tickets. This ensures a clear understanding of what Phoenix is doing at all times.

Tracking of actions taken by automation as well as the state transitions of a server keeps us in the loop and gives us a better understanding of what these actions were and when they were executed, essentially giving us valuable insights that will help us improve not only the system but our processes as well. Having this operational data allows us to generate dashboards that let various teams monitor automation activities and measure their success. We are able to generate dashboards to guide business decisions and even answer common operational questions related to repair and recovery.

Balancing automation and empathy: Error Budgets

When we launched Phoenix, we were well aware that not every broken server can be re-enabled and successfully returned to production, and more importantly, there’s no 100% guarantee that a recovered server will be as stable as the ones with no repair history – there’s a risk that these servers could fail and end up back in Repair status again.

Although there’s no guarantee that these recovered servers won’t fail again, causing additional work for SRE’s due to the monitoring alerts that get triggered, what we can guarantee is that Phoenix immediately stops recoveries without any human intervention if a certain number of failures for a server are reached in a given time window – this is where we applied the concept of an Error Budget.

The Error Budget is the amount of error that automation can accumulate over a certain period of time before our SRE’s start being unhappy due to the excessive server failures or unreliability of the system. It is empathy embedded in automation.

In the figure above, the y-axis represents the error budget. In this context, the error budget applies to the number of recovered servers that failed and were moved back to Repair state again. The x-axis represents the time unit allocated to the error budget – in this case, 24 hours. To ensure that Phoenix is strict enough in mitigating possible issues, we divide the time unit into three consecutive buckets of the same duration – representing the three “follow the sun” SRE shifts in a day. With this, Phoenix can only execute recoveries if the number of server failures is no more than 2. Additionally, Phoenix will also have to compensate succeeding time buckets by deducting the error budget of any excess failures in a given time bucket.

Phoenix will immediately stop recoveries if it exhausts its error budget prematurely. In this context, prematurely means before the end of the time unit for which the error budget was granted. Regardless of the error budget depletion rate within a time unit, the error budget is fully replenished at the beginning of each time unit, meaning the budget resets every day.

The Error Budget has helped us define and manage our tolerance for hardware failures without causing significant harm to the system or too much noise for SREs, and gave us opportunities to improve our diagnostics system. It provides a common incentive that allows both the Infrastructure Engineering and SRE teams to focus on finding the right balance between innovation and reliability.

Where we go from here

With Phoenix, we’ve not only witnessed the significant and far-reaching potential of having an autonomous automated system in our infrastructure, we’re actually reaping its benefits as well. It provides a win-win situation by successfully recovering hardware and ensuring that broken devices are powered off, thus preventing them from consuming unnecessary power while being idle in our racks. This not only reduces energy wastage but also contributes to sustainability efforts and cost savings. Automated processes that operate independently have not only freed our colleagues on various Infrastructure teams from doing mundane and repetitive tasks, allowing them to focus more on areas where they can use their skill sets for more interesting and productive work, but have also led us to evolving our old processes for handling hardware failures and repairs, making us much more efficient than ever.

Autonomous automation is a reality that is now beginning to shape the future of how we are building better and smarter systems here at Cloudflare, and we will continue to invest engineering time for these initiatives.

A huge thank you to Elvin Tan for his awesome work on INAT, and to Graeme, Darrel and David for INAT’s continuous improvements.

The Next Step in Personalization: Dynamic Sizzles

2023-11-08 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/the-next-step-in-personalization-dynamic-sizzles-4dc4ce2011ef

Authors:Bruce Wobbe, Leticia Kwok

Additional Credits:Sanford Holsapple, Eugene Lok, Jeremy Kelly

Introduction

At Netflix, we strive to give our members an excellent personalized experience, helping them make the most successful and satisfying selections from our thousands of titles. We already personalize artwork and trailers, but we hadn’t yet personalized sizzle reels — until now.

A sizzle reel is a montage of video clips from different titles strung together into a seamless A/V asset that gets members excited about upcoming launches (for example, our Emmys nominations or holiday collections). Now Netflix can create a personalized sizzle reel dynamically in real time and on demand. The order of the clips and included titles are personalized per member, giving each a unique and effective experience. These new personalized reels are called Dynamic Sizzles.

In this post, we will dive into the exciting details of how we create Dynamic Sizzles with minimal human intervention, including the challenges we faced and the solutions we developed.

An example of a Dynamic Sizzle created for Chuseok, the Korean mid-autumn harvest festival collection.

Overview

In the past, each sizzle reel was created manually. The time and cost of doing this prevents scaling and misses the invaluable benefit of personalization, which is a bedrock principle at Netflix. We wanted to figure out how to efficiently scale sizzle reel production, while also incorporating personalization — all in an effort to yield greater engagement and enjoyment for our members.

Enter the creation of Dynamic Sizzles. We developed a systems-based approach that uses our interactive and creative technology to programmatically stitch together multiple video clips alongside a synced audio track. The process involves compiling personalized multi-title/multi-talent promotional A/V assets on the fly into a Mega Asset. A Mega Asset is a large A/V asset made up of video clips from various titles, acting as a library from which the Dynamic Sizzle pulls media. These clips are then used to construct a personalized Dynamic Sizzle according to a predefined cadence.

With Dynamic Sizzles, we can utilize more focused creative work from editors and generate a multitude of personalized sizzle reels efficiently and effectively — up to 70% in terms of time and cost savings than a manually created one. This gives us the ability to create thousands, if not millions, of combinations of video clips and assets that result in optimized and personalized sizzle reel experiences for Netflix members.

Creating the Mega Asset

Where To Begin

Our first challenge was figuring out how to create the Mega Asset, as each video clip needs to be precise in its selection and positioning. A Mega Asset can contain any number of clips, and millions of unique Dynamic Sizzles can be produced from a single Mega Asset.

We accomplished this by using human editors to select the clips — ensuring that they are well-defined from both a creative and technical standpoint — then laying them out in a specific known order in a timeline. We also need each clip marked with an index to its location — an extremely tedious and time consuming process for an editor. To solve this, we created an Adobe Premiere plug-in to automate the process. Further verifications can also be done programmatically via ingestion of the timecode data, as we can validate the structure of the Mega Asset by looking at the timecodes.

An example of a title’s video clips layout.

The above layout shows how a single title’s clips are ordered in a Mega Asset and in 3 different lengths: 160, 80 and 40 frame rates. Each clip should be unique per title; however, when using multiple titles, they may share the same frame rate. This gives us more variety to choose from while maintaining a structured order in the layout.

Cadence

The cadence is a predetermined collection of clip lengths that indicates when, where, and for how long a title shows within a Dynamic Sizzle. The cadence ensures that when a Dynamic Sizzle is played, it will show a balanced view of any titles chosen, while still giving more time to a member’s higher ranked titles. Cadence is something we can personalize or randomize, and will continue to evolve as needed.

In the above sample cadence, Title A refers to the highest ranked title in a member’s personalized sort, Title B the second highest, and so on. The cadence is made up of 3 distinct segments with 5 chosen titles (A-E) played in sequence using various clip lengths. Each clip in the cadence refers to a different clip in the Mega Asset. For example, the 80 frame clip for title A in the first (red) segment is different from the 80 frame clip for title A in the third (purple) segment.

Composing the Dynamic Sizzle

Personalization

When a request comes in for a sizzle reel, our system determines what titles are in the Mega Asset and based on the request, a personalized list of titles is created and sorted. The top titles for a member are then used to construct the Dynamic Sizzle by leveraging the clips in the Mega Asset. Higher ranked titles get more weight in placement and allotted time.

Finding Timecodes

For the Dynamic Sizzle process, we have to quickly and dynamically determine the timecodes for each clip in the Mega Asset and make sure they are easily accessed at runtime. We accomplish this by utilizing Netflix’s Hollow technology. Hollow allows us to store timecodes for quick searches and use timecodes as a map — a key can be used to find the timecodes needed as defined by the cadence. The key can be as simple as titleId-clip-1.

Building The Reel

The ordering of the clips are set by the predefined cadence, which dictates the final layout and helps easily build the Dynamic Sizzle. For example, if the system knows to use title 17 within the Mega Asset, we can easily calculate the time offset for all the clips because of the known ordering of the titles and clips within the Mega Asset. This all comes together in the following way:

The result is a series of timecodes indicating the start and stop times for each clip. These codes appear in the order they should be played and the player uses them to construct a seamless video experience as seen in the examples below:

The Beautiful Game Sizzle — The Beautiful Game Dynamic Sizzle

With Dynamic Sizzles, each member experiences a personalized sizzle reel.

Example of what 2 different profiles might see for the same sizzle

Playing the Dynamic Sizzle

Delivering To The Player

The player leverages the Mega Asset by using timecodes to know where to start and stop each clip, and then seamlessly plays each one right after the other. This required a change in the API that devices normally use to get trailers. The API change was twofold. First, on the request we need the device to indicate that it can support Dynamic Sizzles. Second, on the response the timecode list needs to be sent. (Changing the API and rolling it out took time, so this all had to be implemented before Dynamic Sizzles could actually be used, tested, and productized.)

Challenges With The Player

There were two main challenges with the player. First, in order to support features like background music across multiple unique video segments, we needed to support asymmetrical segment streaming from discontiguous locations in the Mega Asset. This involved modifying existing schemas and adding corresponding support to the player to allow for the stitching of the video and audio together separately while still keeping the timecodes in sync. Second, we needed to optimize our streaming algorithms to account for these much shorter segments, as some of our previous assumptions were incorrect when dealing with dozens of discontiguous tiny segments in the asset.

Building Great Things Together

We are just getting started on this journey to build truly great experiences. While the challenges may seem endless, the work is incredibly fulfilling. The core to bringing these great engineering solutions to life is the direct collaboration we have with our colleagues and innovating together to solve these challenges.

If you are interested in working on great technology like Dynamic Sizzles, we’d love to talk to you! We are hiring: jobs.netflix.com

The Next Step in Personalization: Dynamic Sizzles was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Perform continuous vulnerability scanning of AWS Lambda functions with Amazon Inspector

2023-07-31 Manjunath Arakere

Post Syndicated from Manjunath Arakere original https://aws.amazon.com/blogs/security/perform-continuous-vulnerability-scanning-of-aws-lambda-functions-with-amazon-inspector/

This blog post demonstrates how you can activate Amazon Inspector within one or more AWS accounts and be notified when a vulnerability is detected in an AWS Lambda function.

Amazon Inspector is an automated vulnerability management service that continually scans workloads for software vulnerabilities and unintended network exposure. Amazon Inspector scans mixed workloads like Amazon Elastic Compute Cloud (Amazon EC2) instances and container images located in Amazon Elastic Container Registry (Amazon ECR). At re:Invent 2022, we announced Amazon Inspector support for Lambda functions and Lambda layers to provide a consolidated solution for compute types.

Only scanning your functions for vulnerabilities before deployment might not be enough since vulnerabilities can appear at any time, like the widespread Apache Log4j vulnerability. So it’s essential that workloads are continuously monitored and rescanned in near real time as new vulnerabilities are published or workloads are changed.

Amazon Inspector scans are intelligently initiated based on the updates to Lambda functions or when new Common Vulnerabilities and Exposures (CVEs) are published that are relevant to your function. No agents are needed for Amazon Inspector to work, which means you don’t need to install a library or agent in your Lambda functions or layers. When Amazon Inspector discovers a software vulnerability or network configuration issue, it creates a finding which describes the vulnerability, identifies the affected resource, rates the severity of the vulnerability, and provides remediation guidance.

In addition, Amazon Inspector integrates with several AWS services, such as Amazon EventBridge and AWS Security Hub. You can use EventBridge to build automation workflows like getting notified for a specific vulnerability finding or performing an automatic remediation with the help of Lambda or AWS Systems Manager.

In this blog post, you will learn how to do the following:

Activate Amazon Inspector in a single AWS account and AWS Region.
See how Amazon Inspector automated discovery and continuous vulnerability scanning works by deploying a new Lambda function with a vulnerable package dependency.
Receive a near real-time notification when a vulnerability with a specific severity is detected in a Lambda function with the help of EventBridge and Amazon Simple Notification Service (Amazon SNS).
Remediate the vulnerability by using the recommendation provided in the Amazon Inspector dashboard.
Activate Amazon Inspector in multiple accounts or Regions through AWS Organizations.

Solution architecture

Figure 1 shows the AWS services used in the solution and how they are integrated.

Figure 1: Solution architecture overview

The workflow for the solution is as follows:

Deploy a new Lambda function by using the AWS Serverless Application Model (AWS SAM).
Amazon Inspector scans when a new vulnerability is published or when an update to an existing Lambda function or a new Lambda function is deployed. Vulnerabilities are identified in the deployed Lambda function.
Amazon EventBridge receives the events from Amazon Inspector and checks against the rules for specific events or filter conditions.
In this case, an EventBridge rule exists for the Amazon Inspector findings, and the target is defined as an SNS topic to send an email to the system operations team.
The EventBridge rule invokes the target SNS topic with the event data, and an email is sent to the confirmed subscribers in the SNS topic.
The system operations team receives an email with detailed information on the vulnerability, the fixed package versions, the Amazon Inspector score to prioritize, and the impacted Lambda functions. By using the remediation information from Amazon Inspector, the team can now prioritize actions and remediate.

Prerequisites

To follow along with this demo, we recommend that you have the following in place:

An AWS account.
A command line interface: AWS CloudShell or AWS CLI. In this post, we recommend the use of CloudShell because it already has Python and AWS SAM. However, you can also use your CLI with AWS CLI, SAM, and Python.
An AWS Region where Amazon Inspector Lambda code scanning is available.
An IAM role in that account with administrator privileges.

The solution in this post includes the following AWS services: Amazon Inspector, AWS Lambda, Amazon EventBridge, AWS Identity and Access Management (IAM), Amazon SNS, AWS CloudShell and AWS Organizations for activating Amazon Inspector at scale (multi-accounts).

Step 1: Activate Amazon Inspector in a single account in the Region

The first step is to activate Amazon Inspector in your account in the Region you are using.

To activate Amazon Inspector

Sign in to the AWS Management Console.
Open AWS CloudShell. CloudShell inherits the credentials and permissions of the IAM principal who is signed in to the AWS Management Console. CloudShell comes with the CLIs and runtimes that are needed for this demo (AWS CLI, AWS SAM, and Python).
Use the following command in CloudShell to get the status of the Amazon Inspector activation.
```
aws inspector2 batch-get-account-status
```
Use the following command to activate Inspector in the default Region for resource type LAMBDA. Other allowed values for resource types are EC2, ECR and LAMDA_CODE.
```
aws inspector2 enable --resource-types '["LAMBDA"]'
```
Use the following command to verify the status of the Amazon Inspector activation.
```
aws inspector2 batch-get-account-status
```

You should see a response that shows that Amazon Inspector is enabled for Lambda resources, as shown in Figure 2.

Figure 2: Amazon Inspector status after you enable Lambda scanning

Step 2: Create an SNS topic and subscription for notification

Next, create the SNS topic and the subscription so that you will be notified of each new Amazon Inspector finding.

To create the SNS topic and subscription

Use the following command in CloudShell to create the SNS topic and its subscription and replace <REGION_NAME>, <AWS_ACCOUNTID> and <[email protected]> by the relevant values.

aws sns create-topic --name amazon-inspector-findings-notifier; 

aws sns subscribe \
--topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier \
--protocol email --notification-endpoint <[email protected]>

Check the email inbox you entered for <[email protected]>, and in the email from Amazon SNS, choose Confirm subscription.
In the CloudShell console, use the following command to list the subscriptions, to verify the topic and email subscription.
```
aws sns list-subscriptions
```
You should see a response that shows subscription details like the email address and ARN, as shown in Figure 3.

Figure 3: Subscribed email address and SNS topic

Use the following command to send a test message to your subscribed email and verify that you receive the message by replacing <REGION_NAME> and <AWS_ACCOUNTID>.

aws sns publish \
    --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
    --message "Hello from Amazon Inspector2"

Step 3: Set up Amazon EventBridge with a custom rule and the SNS topic as target

Create an EventBridge rule that will invoke your previously created SNS topic whenever Amazon Inspector finds a new vulnerability with a critical severity.

To set up the EventBridge custom rule

In the CloudShell console, use the following command to create an EventBridge rule named amazon-inspector-findings with filters InspectorScore greater than 8 and severity state set to CRITICAL.
```
aws events put-rule \
    --name "amazon-inspector-findings" \
    --event-pattern "{\"source\": [\"aws.inspector2\"],\"detail-type\": [\"Inspector2 Finding\"],\"detail\": {\"inspectorScore\": [ { \"numeric\": [ \">\", 8] } ],\"severity\": [\"CRITICAL\"]}}"
```
Refer to the topic Amazon EventBridge event schema for Amazon Inspector events to customize the event pattern for your application needs.
To verify the rule creation, go to the EventBridge console and in the left navigation bar, choose Rules.
Choose the rule with the name amazon-inspector-findings. You should see the event pattern as shown in Figure 4.

Figure 4: Event pattern for the EventBridge rule to filter on CRITICAL vulnerabilities.
Add the SNS topic you previously created as the target to the EventBridge rule. Replace <REGION_NAME>, <AWS_ACCOUNTID>, and <RANDOM-UNIQUE-IDENTIFIER-VALUE> with the relevant values. For RANDOM-UNIQUE-IDENTIFIER-VALUE, create a memorable and unique string.
```
aws events put-targets \
    --rule amazon-inspector-findings \
    --targets "Id"="<RANDOM-UNIQUE-IDENTIFIER-VALUE>","Arn"="arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier"
```
Important: Save the target ID. You will need this in order to delete the target in the last step.

Provide permission to enable Amazon EventBridge to publish to SNS topic amazon-inspector-findings-notifier

aws sns set-topic-attributes --topic-arn "arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier" \
--attribute-name Policy \
--attribute-value "{\"Version\":\"2012-10-17\",\"Id\":\"__default_policy_ID\",\"Statement\":[{\"Sid\":\"PublishEventsToMyTopic\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"events.amazonaws.com\"},\"Action\":\"sns:Publish\",\"Resource\":\"arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier\"}]}"

Step 4: Deploy the Lambda function to the AWS account by using AWS SAM

In this step, you will use Serverless Application Manager (SAM) quick state templates to build and deploy a Lambda function with a vulnerable library, in order to generate findings. Learn more about AWS SAM.

To deploy the Lambda function with a vulnerable library

In the CloudShell console, use a prebuilt “hello-world” AWS SAM template to deploy the Lambda function.
```
sam init --runtime python3.7 --dependency-manager pip --app-template hello-world --name sam-app
```
Use the following command to add the vulnerable package python-jwt==3.3.3 to the Lambda function.
```
cd sam-app;
echo -e 'requests\npython-jwt==3.3.3' > hello_world/requirements.txt
```
Use the following command to build the application.
```
sam build
```
Use the following command to deploy the application with the guided option.
```
sam deploy --guided
```
This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the:
1. Stack name you want
2. Set the default options, except for the
  1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
  2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.

Step 5: View Amazon Inspector findings

Amazon Inspector will automatically generate findings when scanning the Lambda function previously deployed. To view those findings, follow the steps below.

To view Amazon Inspector findings for the vulnerability

Navigate to the Amazon Inspector console.
In the left navigation menu, choose All findings to see all of the Active findings, as shown in Figure 5.
Due to the custom event pattern rule in Amazon EventBridge, even though there are multiple findings for the vulnerable package python-jwt==3.3.3, you will be notified only for the finding that has InspectorScore greater than 8 and severity CRITICAL.
Choose the title of each finding to see detailed information about the vulnerability.

Figure 5: Example of findings from the Amazon Inspector console

Step 6: Remediate the vulnerability by applying the fixed package version

Now you can remediate the vulnerability by updating the package version as suggested by Amazon Inspector.

To remediate the vulnerability

In the Amazon Inspector console, in the left navigation menu, choose All Findings.
Choose the title of the vulnerability to see the finding details and the remediation recommendations.

Figure 6: Amazon Inspector finding for python-jwt, with the associated remediation
To remediate, use the following command to update the package version to the fixed version as suggested by Amazon Inspector.
```
cd /home/cloudshell-user/sam-app;
echo -e "requests\npython-jwt==3.3.4" > hello_world/requirements.txt
```
Use the following command to build the application.
```
sam build
```
Use the following command to deploy the application with the guided option.
```
sam deploy --guided
```
This command packages and deploys the application to your AWS account. It provides a series of prompts. You may respond to the prompts with the
1. Stack name you want
2. Set the default options, except for the
  1. HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: prompt. Here, input y and press Enter and
  2. Deploy this changeset? [y/N]: prompt. Here, input y and press Enter.
Amazon Inspector automatically rescans the function after its deployment and reevaluates the findings. At this point, you can navigate back to the Amazon Inspector console, and in the left navigation menu, choose All findings. In the Findings area, you can see that the vulnerabilities are moved from Active to Closed status.
Due to the custom event pattern rule in Amazon EventBridge, you will be notified by email with finding status as CLOSED.

Figure 7: Inspector rescan results, showing no open findings after remediation

(Optional) Step 7: Activate Amazon Inspector in multiple accounts and Regions

To benefit from Amazon Inspector scanning capabilities across the accounts that you have in AWS Organizations and in your selected Regions, use the following steps:

To activate Amazon Inspector in multiple accounts and Regions

In the CloudShell console, use the following command to clone the code from the aws-samples inspector2-enablement-with-cli GitHub repo.

cd /home/cloudshell-user;
git clone https://github.com/aws-samples/inspector2-enablement-with-cli.git;
cd inspector2-enablement-with-cli

Follow the instructions from the README.md file.
Configure the file param_inspector2.json with the relevant values, as follows:
- inspector2_da: The delegated administrator account ID for Amazon Inspector to manage member accounts.
- scanning_type: The resource types (EC2, ECR, LAMBDA) to be enabled by Amazon Inspector.
- auto_enable: The resource types to be enabled on every account that is newly attached to the delegated administrator.
- regions: Because Amazon Inspector is a regional service, provide the list of AWS Regions to enable.
Select the AWS account that would be used as the delegated administrator account (<DA_ACCOUNT_ID>).
Delegate an account as the admin for Amazon Inspector by using the following command.
```
./inspector2_enablement_with_awscli.sh -a delegate_admin -da <DA_ACCOUNT_ID>
```

Activate the delegated admin by using the following command:

./inspector2_enablement_with_awscli.sh -a activate -t <DA_ACCOUNT_ID> -s all

Associate the member accounts by using the following command:

./inspector2_enablement_with_awscli.sh -a associate -t members

Wait five minutes.
Enable the resource types (EC2, ECR, LAMBDA) on your member accounts by using the following command:
```
./inspector2_enablement_with_awscli.sh -a activate -t members
```
Enable Amazon Inspector on the new member accounts that are associated with the organization by using the following command:
```
./inspector2_enablement_with_awscli.sh -auto_enable
```
Check the Amazon Inspector status in your accounts and in multiple selected Regions by using the following command:
```
./inspector2_enablement_with_awscli.sh -a get_status
```

There are other options you can use to enable Amazon Inspector in multiple accounts, like AWS Control Tower and Terraform. For the reference architecture for Control Tower, see the AWS Security Reference Architecture Examples on GitHub. For more information on the Terraform option, see the Terraform aws_inspector2_enabler resource page.

Step 8: Delete the resources created in the previous steps

AWS offers a 15-day free trial for Amazon Inspector so that you can evaluate the service and estimate its cost.

To avoid potential charges, delete the AWS resources that you created in the previous steps of this solution (Lambda function, EventBridge target, EventBridge rule, and SNS topic), and deactivate Amazon Inspector.

To delete resources

In the CloudShell console, enter the sam-app folder.
```
cd /home/cloudshell-user/sam-app
```
Delete the Lambda function and confirm by typing “y” when prompted for confirmation.
```
sam delete
```
Remove the SNS target from the Amazon EventBridge rule.
```
aws events remove-targets --rule "amazon-inspector-findings" --ids <RANDOM-UNIQUE-IDENTIFIER-VALUE>
```
Note: If you don’t remember the target ID, navigate to the Amazon EventBridge console, and in the left navigation menu, choose Rules. Select the rule that you want to delete. Choose CloudFormation, and copy the ID.

Delete the EventBridge rule.

aws events delete-rule --name amazon-inspector-findings

Delete the SNS topic.

aws sns delete-topic --topic-arn arn:aws:sns:<REGION_NAME>:<AWS_ACCOUNTID>:amazon-inspector-findings-notifier

Disable Amazon Inspector.
```
aws inspector2 disable --resource-types '["LAMBDA"]'
```
Follow the new few steps to roll back changes only if you have performed the activities listed in Step 7: Activate Amazon Inspector in multiple accounts and Regions.
In the CloudShell console, enter the folder inspector2-enablement-with-cli.
```
cd /home/cloudshell-user/inspector2-enablement-with-cli
```
Deactivate the resource types (EC2, ECR, LAMBDA) on your member accounts.
```
./inspector2_enablement_with_awscli.sh -a deactivate -t members -s all
```

Disassociate the member accounts.

./inspector2_enablement_with_awscli.sh -a disassociate -t members

Deactivate the delegated admin account.

./inspector2_enablement_with_awscli.sh -a deactivate -t <DA_ACCOUNT_ID> -s all

Remove the delegated account as the admin for Amazon Inspector.

./inspector2_enablement_with_awscli.sh -a remove_admin -da <DA_ACCOUNT_ID>

Conclusion

In this blog post, we discussed how you can use Amazon Inspector to continuously scan your Lambda functions, and how to configure an Amazon EventBridge rule and SNS to send out notification of Lambda function vulnerabilities in near real time. You can then perform remediation activities by using AWS Lambda or AWS Systems Manager. We also showed how to enable Amazon Inspector at scale, activating in both single and multiple accounts, in default and multiple Regions.

As of the writing this post, a new feature to perform code scans for Lambda functions is available. Amazon Inspector can now also scan the custom application code within a Lambda function for code security vulnerabilities such as injection flaws, data leaks, weak cryptography, or missing encryption, based on AWS security best practices. You can use this additional scanning functionality to further protect your workloads.

If you have feedback about this blog post, submit comments in the Comments section below. If you have question about this blog post, start a new thread on the Amazon Inspector forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Forward Zabbix Events to Event-Driven Ansible and Automate your Workflows

2023-06-21 Aleksandr Kotsegubov

Post Syndicated from Aleksandr Kotsegubov original https://blog.zabbix.com/forward-zabbix-events-to-event-driven-ansible-and-automate-your-workflows/25893/

Zabbix is highly regarded for its ability to integrate with a variety of systems right out of the box. That list of systems has recently been expanded with the addition of Event-Driven Ansible. Bringing Zabbix and Event-Driven Ansible together lets you completely automate your IT processes, with Zabbix being the source of events and Ansible serving as the executor. This article will explore in detail how to send events from Zabbix to Event-Driven Ansible.

What is Event-Driven Ansible?

Currently available in developer preview, Event-Driven Ansible is an event-based automation solution that automatically matches each new event to the conditions you specified. This eliminates routine tasks and lets you spend your time on more important issues. And because it’s a fully automated system, it doesn’t get sick, take lunch breaks, or go on vacation – by working around the clock, it can speed up important IT processes.

Sending an event from Zabbix to Event-Driven Ansible

From the Zabbix side, the implementation is a media type that uses a webhook – a tool that’s already familiar to most users. This solution allows you to take advantage of the flexibility of setting up alerts from Zabbix using actions. This media type is delivered to Zabbix out of the box, and if your installation doesn’t have it, you can import it yourself from our integrations page.

On the Event-Driven Ansible side, the webhook plugin from the ansible.eda standard collection is used. If your system doesn’t have this collection, you can get it by running the following command:

ansible-galaxy collection install ansible.eda

Let’s look at the process of sending events in more detail with the diagram below.

From the Zabbix side:

An event is created in Zabbix.
The Zabbix server checks the created event according to the conditions in the actions. If all the conditions in an action configured to send an event to Event-Driven Ansible are met, the next step (running the operations configured in the action) is executed.
Sending through the “Event-Driven Ansible” media type is configured as an operation. The address specified by the service user for the “Event-Driven Ansible” media is taken as the destination.
The media type script processes all the information about the event, generates a JSON, and sends it to Event-Driven Ansible.

From the Ansible side:

An event sent from Zabbix arrives at the specified address and port. The webhook plugin listens on this port.
After receiving an event, ansible-rulebook starts checking the conditions in order to find a match between the received event and the set of rules in ansible-rulebook.
If the conditions for any of the rules match the incoming event, then the ansible-rulebook performs the specified action. It can be either a single command or a playbook launch.

Let’s look at the setup process from each side.

Sending events from Zabbix

Setting up sending alerts is described in detail on the Zabbix – Ansible integration page. Here are the basic steps:

Import the media type of the required version if it is not present in your system.
Create a service user. Select “Event-Driven Ansible” as the media and specify the address of your server and the port which the webhook plugin will listen in on as the destination in the format xxx.xxx.xxx.xxx:port. This article will use the value 5001 as the port. This value will still be needed to configure ansible-rulebook.
Configure an action to send notifications. As an operation, specify sending via “Event-Drive Ansible.” Specify the service user created in the previous step as the recipient.

Receiving events in Event-Driven Ansible

First things first – you need to have an eda-server installed. You can find detailed installation and configuration instructions here.

After installing an eda-server, you can make your first ansible-rulebook. To do this, you need to create a file with the “yml” extension. Call it zabbix-test.yml and put the following code in it:

---
- name: Zabbix test rulebook
  hosts: all
  sources:
    - ansible.eda.webhook:
        host: 0.0.0.0
        port: 5001
  rules:
    - name: debug
      condition: event.payload is defined
      action:
        debug:

Ansible-rulebook, as you may have noticed, uses the yaml format. In this case, it has 4 parameters – name, hosts, source, and rules.

Name and Host parameters

The first 2 parameters are typical for Ansible users. The name parameter contains the name of the ansible-rulebook. The hosts parameter specifies which hosts the ansible-rulebook applies to. Hosts are usually listed in the inventory file. You can learn more about the inventory file in the ansible documentation. The most interesting options are source and rules, so let’s take a closer look at them.

Source parameter

The source parameter specifies the origin of events for the ansible-rulebook. In this case, the ansible.eda.webhook plugin is specified as the event source. This means that after the start of the ansible-rulebook, the webhook plugin starts listening in on the port to receive the event. This also means that it needs 2 parameters to work:

Parameter “host” – a value of 0.0.0.0 used to receive events from all addresses.
Parameter “port” – with 5001 as the value. This plugin will accept all incoming messages received on this particular port. The value of the port parameter must match the port you specified when creating the service user in Zabbix.

Rules parameter

The rules parameter contains a set of rules with conditions for matching with an incoming event. If the condition matches the received event, then the action specified in the actions section will be performed. Since this ansible-rulebook is only for reference, it is enough to specify only one rule. For simplicity, you can use event.payload is defined as a condition. This simple condition means that the rule will check for the presence of the “event.payload” field in the incoming event. When you specify debug in the action, ansible-rulebook will show you the full text of the received event. With debug you can also understand which fields will be passed in the event and set the conditions you need.

The name, host, source parameters only affect the event source. In our case, the webhook plugin will always be the event source. Accordingly, these parameters will not change and in all the following examples they will be skipped. As an example, only the value of the rules parameter will be specified.

To start your ansible-rulebook you can use the command:

ansible-rulebook --rulebook /path/to/your/rulebook/zabbix-test.yml –verbose

The line Waiting for events in the output indicates that the ansible-rulebook has successfully loaded and is ready to receive events.

Examples

Ansible-rulebook provides a wide variety of opportunities for handling incoming events. We will look into some of the possible conditions and scenarios for using ansible-rulebook, but please remember that a more detailed list of all supported conditions and examples can be found on the official documentation page. For a general understanding of the principles of working with ansible-rulebook, please read the documentation.

Let’s see how to build conditions for precise event filtering in more detail with a few examples.

Example #1

You need to run a playbook to change the NGINX configuration at the Berlin office when you receive an event from Zabbix. The host is in three groups:

Linux servers
Web servers
Berlin.

And it has 3 tags:

target: nginx
class: software
component: configuration.

You can see all these parameters in the diagram below:

On the left side you can see a host with configured monitoring. To determine whether an event belongs to a given rule, you will work with two fields – host groups and tags. These parameters will be used to determine whether the event belongs to the required server and configuration. According to the diagram, all event data is sent to the media type script to generate and send JSON. On the Ansible side, the webhook receives an event with JSON from Zabbix and passes it to the ansible-rulebook to check the conditions. If the event matches all the conditions, the ansible-rulebook starts the specified action. In this case, it’s the start of the playbook.

In accordance with the specified settings for host groups and tags, the event will contain information as in the block below. However, only two fields from the output are needed – “host_groups” and “event_tags.”

{
    ...,
    "host_groups": [
        "Berlin",
        "Linux servers",
        "Web servers"],
    "event_tags": {
        "class": ["os"],
        "component": ["configuration"],
        "target": ["nginx"]},
    ...
}

Search by host groups

First, you need to determine that the host is a web server. You can understand this by the presence of the “Web servers” group in the host in the diagram above. The second point that you can determine according to the scheme is that the host also has the group “Berlin” and therefore refers to the office in Berlin. To filter the event on the Event-Driven Ansible side, you need to build a condition by checking for the presence of two host groups in the received event – “Web servers” and “Berlin.” The “host_groups” field in the resulting JSON is a list, which means that you can use the is select construct to find an element in the list.

Search by tag value

The third condition for the search applies if this event belongs to a configuration. You can understand this by the fact that the event has a “component” tag with a value of “configuration.” However, the event_tags field in the resulting JSON is worth looking at in more detail. It is a dictionary containing tag names as keys, and because of that, you can refer to each tag separately on the Ansible side. What’s more, each tag will always contain a list of tag values, as tag names can be duplicated with different values. To search by the value of a tag, you can refer to a specific tag and use the is select construction for locating an element in the list.

To solve this example, specify the following rules block in ansible-rulebook:

  rules:
    - name: Run playbook for office in Berlin
      condition: >-
        event.payload.host_groups is select("==","Web servers") and
        event.payload.host_groups is select("==","Berlin") and
        event.payload.event_tags.component is select("==","configuration")
      action:
        run_playbook:
          name: deploy-nginx-berlin.yaml

Solution

The condition field contains 3 elements, and you can see all conditions on the right side of the diagram. In all three cases, you can use the is select construct and check if the required element is in the list.

The first two conditions check for the presence of the required host groups in the list of groups in “event.payload.host_groups.” In the diagram, you can see with a green dotted line how the first two conditions correspond to groups on the host in Zabbix. According to the condition of the example, this host must belong to both required groups, meaning that you need to set the logical operation and between the two conditions.

In the last condition, the event_tags field is a dictionary. Therefore, you can refer to the tag by specifying its name in the “event.payload.event_tags.component“ path and check for the presence of “configuration” among the tag values. In the diagram, you can see the relationship between the last condition and the tags on the host with a dotted line.

Since all three conditions must match according to the condition of the example, you once again need to put the logical operation and between them.

Action block

Let’s analyze the action block. If both conditions match, the ansible-rulebook will perform the specified action. In this case, that means the launch of the playbook using the run_playbook construct. Next, the name block contains the name of the playbook to run: deploy-nginx-berlin.yaml.

Example #2

Here is an example using the standard template Docker by Zabbix agent 2. For events triggered by “Container {#NAME}: Container has been stopped with error code”, the administrator additionally configured an action to send it to Event-Driven Ansible as well. Let’s assume that in the case of stopping the container “internal_portal” with the status “137”, its restart requires preparation, with the logic of that preparation specified in the playbook.

There are more details in the diagram above. On the left side, you can see a host with configured monitoring. The event from the example will have many parameters, but you will work with two – operational data and all tags of this event. According to the general concept, all this data will go into the media type script, which will generate JSON for sending to Event-Driven Ansible. On the Ansible side, the ansible-rulebook checks the received event for compliance with the specified conditions. If the event matches all the conditions, the ansible-rulebook starts the specified action, in this case, the start of the playbook.

In the block below you can see part of the JSON to send to Event-Driven Ansible. To solve the task, you need to be concerned only with two fields from the entire output: “event_tags” and “operation_data”:

{
    ...,
    "event_tags": {
        "class": ["software"],
        "component": ["system"],
        "container": ["/internal_portal"],
        "scope": ["availability"],
        "target": ["docker"]},
    "operation_data": "Exit code: 137",
    ...
}

Search by tag value

The first step is to determine that the event belongs to the required container. Its name is displayed in the “container” tag, so you need to add a condition to search for the name of the container “/internal_portal” in the tag. However, as discussed in the previous example, the event_tags field in the resulting JSON is a dictionary containing tag names as keys. By referring to the key to a specific tag, you can get a list of its values. Since tags can be repeated with different values, you can get all the values of this tag by key in the received JSON, and this field will always be a list. Therefore, to search by value, you can always refer to a specific tag and use the is select construction.

Search by operational data field

The second step is to check the exit code. According to the trigger settings, this information is displayed in the operational data and passed to Event-Driven Ansible in the “operation_data” field. This field is a string, and you need to check with a regular expression if this field contains the value “Exit code: 137.” On the ansible-rulebook side, the is regex construct will be used to search for a regular expression.

To solve this example, specify the following rules block in ansible-rulebook:

  rules:
    - name: Run playbook for container "internal_portal"
      condition: >-
        event.payload.event_tags.container is select("==","/internal_portal") and
        event.payload.operation_data is regex("Exit code.*137")
      action:
        run_playbook:
          name: restart_internal_portal.yaml

Solution

In the first condition, the event_tags field is a dictionary and you are referring to a specific tag, so the final path will contain the tag name, including “event.payload.event_tags.container.” Next, using the is select construct, the list of tag values is checked. This allows you to check that the required “internal_portal” container is present as the value of the tag. If you refer to the diagram, you can see the green dotted line relationship between the condition in the ansible-rulebook and the tags in the event from the Zabbix side.

In the second condition, access the event.payload.operation_data field using the is regex construct and the regular expression “Exit code.*137.” This way you check for the presence of the status “137” as a value. You can also see he link between the green dotted line of the condition on the ansible-rulebook side and the operational data of the event in Zabbix in the diagram.

Since both conditions must match, you can specify the and logical operation between the conditions.

Action block

Taking a look at the action block, if both conditions match, the ansible-rulebook will perform the specified action. In this case, it’s the launch of the playbook using the run_playbook construct. Next, the name block contains the name of the playbook to run:restart_internal_portal.yaml.

Conclusion

It’s clear that both tools (and especially their interconnected work) are great for implementing automation. Zabbix is a powerful monitoring solution, and Ansible is a great orchestration software. Both of these tools complement each other, creating an excellent tandem that takes on all routine tasks. This article has shown how to send events from Zabbix to Event-Driven Ansible and how to configure it on each side, and it has also proven that it’s not as difficult as it might initially seem. But remember – we’ve only looked at the simplest examples. The rest depends only on your imagination.

Questions

Q: How can I get the full list of fields in an event?

A: The best way is to make an ansible-rulebook with action “debug” and condition “event.payload is defined.” In this case, all events from Zabbix will be displayed. This example is described in the section “Receiving Events in Event-Driven Ansible.”

Q: Does the list of sent fields depend on the situation?

A: No. The list of fields in the sent event is always the same. If there are no objects in the event, the field will be empty. The case with tags is a good example – the tags may not be present in the event, but the “tags” field will still be sent.

Q: What events can be sent from Zabbix to Event-Drive Ansible?

A: In the current version (Zabbix 6.4)n, only trigger-based events and problems can be sent.

Q: Is it possible to use the values of received events in the ansible-playbook?

A: Yes. On the ansible-playbook side, you can get values using the ansible_eda namespace. To access the values in an event, you need to specify ansible_eda.event.

For example, to display all the details of an event, you can use:

  tasks:
    - debug:
        msg: "{{ ansible_eda.event }}"

To get the name of the container from example #2 of this article, you can use the following code:

  tasks:
    - debug:
        msg: "{{ ansible_eda.event.payload.event_tags.container }}"

The post Forward Zabbix Events to Event-Driven Ansible and Automate your Workflows appeared first on Zabbix Blog.

AWS Security Hub launches a new capability for automating actions to update findings

2023-06-13 Stuart Gregg

Post Syndicated from Stuart Gregg original https://aws.amazon.com/blogs/security/aws-security-hub-launches-a-new-capability-for-automating-actions-to-update-findings/

If you’ve had discussions with a security organization recently, there’s a high probability that the word automation has come up. As organizations scale and consume the benefits the cloud has to offer, it’s important to factor in and understand how the additional cloud footprint will affect operations. Automation is a key enabler for efficient operations and can help drive down the number of repetitive tasks that the operational teams have to perform.

Alert fatigue is caused when humans work on the same repetitive tasks day in and day out and also have a large volume of alerts that need to be addressed. The repetitive nature of these tasks can cause analysts to become numb to the importance of the task or make errors due to manual processing. This can lead to misclassification of security alerts or higher-severity alerts being overlooked due to investigation times. Automation is key here to reduce the number of repetitive tasks and give analysts time to focus on other areas of importance.

In this blog post, we’ll walk you through new capabilities within AWS Security Hub that you can use to take automated actions to update findings. We’ll show you some example scenarios that use this capability and set you up with the knowledge you need to get started with creating automation rules.

Automation rules in Security Hub

AWS Security Hub is available globally and is designed to give you a comprehensive view of your security posture across your AWS accounts. With Security Hub, you have a single place that aggregates, organizes, and prioritizes your security alerts, or findings, from multiple AWS services, including Amazon GuardDuty, Amazon Inspector, Amazon Macie, AWS Firewall Manager, AWS Systems Manager Patch Manager, AWS Config, AWS Health, and AWS Identity and Access Management (IAM) Access Analyzer, as well as from over 65 AWS Partner Network (APN) solutions.

Previously, Security Hub could take automated actions on findings, but this involved going to the Amazon EventBridge console or API, creating an EventBridge rule, and then building an AWS Lambda function, an AWS Systems Manager Automation runbook, or an AWS Step Functions step as the target of that rule. If you wanted to set up these automated actions in the administrator account and home AWS Region and run them in member accounts and in linked Regions, you would also need to deploy the correct IAM permissions to enable the actions to run across accounts and Regions. After setting up the automation flow, you would need to maintain the EventBridge rule, Lambda function, and IAM roles. Such maintenance could include upgrading the Lambda versions, verifying operational efficiency, and checking that everything is running as expected.

With Security Hub, you can now use rules to automatically update various fields in findings that match defined criteria. This allows you to automatically suppress findings, update findings’ severities according to organizational policies, change findings’ workflow status, and add notes. As findings are ingested, automation rules look for findings that meet defined criteria and update the specified fields in findings that meet the criteria. For example, a user can create a rule that automatically sets the finding’s severity to “Critical” if the finding account ID is of a known business-critical account. A user could also automatically suppress findings for a specific control in an account where the finding represents an accepted risk.

With automation rules, Security Hub provides you a simplified way to build automations directly from the Security Hub console and API. This reduces repetitive work for cloud security and DevOps engineers and can reduce the mean time to response.

Use cases

In this section, we’ve put together some examples of how Security Hub automation rules can help you. There’s a lot of flexibility in how you can use the rules, and we expect there will be many variations that your organization will use when contextual information about security risk has been added.

Scenario 1: Elevate finding severity for specific controls based on account IDs

Security Hub offers protection by using hundreds of security controls that create findings that have a severity associated with them. Sometimes, you might want to elevate that severity according to your organizational policies or according to the context of the finding, such as the account it relates to. With automation rules, you can now automatically elevate the severity for specific controls when they are in a specific account.

For example, the AWS Foundational Security Best Practices control GuardDuty.1 has a “High” severity by default. But you might consider such a finding to have “Critical” severity if it occurs in one of your top production accounts. To change the severity automatically, you can choose GeneratorId as a criteria and check that it’s equal to aws-foundational-security-best-practices/v/1.0.0/GuardDuty.1, and also add AwsAccountId as a criteria and check that it’s equal to YOUR_ACCOUNT_IDs. Then, add an action to update the severity to “Critical,” and add a note to the person who will look at the finding that reads “Urgent — look into these production accounts.”

You can set up this automation rule through the AWS CLI, the console, the Security Hub API, or the AWS SDK for Python (Boto3), as follows.

To set up the automation rule for Scenario 1 (AWS CLI)

In the AWS CLI, run the following command to create a new automation rule with a specific Amazon Resource Name (ARN). Note the different modifiable parameters:
- Rule-name — The name of the rule that will be created.
- Rule-status — An optional parameter. Specify whether you want Security Hub to activate and start applying the rule to findings after creation. If no value is specified, the default value is ENABLED. A value of DISABLED means that the rule will be paused after creation.
- Rule-order — Provide the processing order for the rule. Security Hub applies rules with a lower numerical value for this parameter first.
- Criteria — Provide the criteria that you want Security Hub to use to filter your findings. The rule action will be applied to findings that match the criteria. For a list of supported criteria, see Criteria and actions for automation rules. In this example, the criteria are placeholders and should be replaced.
- Actions — Provide the actions that you want Security Hub to take when there’s a match between a finding and your defined criteria. For a list of supported actions, see Criteria and actions for automation rules. In this example, the actions are placeholders and should be replaced.
aws securityhub create-automation-rule \—rule-name "Elevate severity for findings in production accounts - GuardDuty.1" \—rule-status "ENABLED"" \—rule-order 1 \—description "Elevate severity for findings in production accounts - GuardDuty.1" \—criteria '{"GeneratorId": [{"Value": "aws-foundational-security-best-practices/v/1.0.0/GuardDuty.1","Comparison": "EQUALS"}, "AwsAccountId": [{"Value": "<111122223333>","Comparison": "EQUALS"},]}' \—actions '[{"Type": "FINDING_FIELDS_UPDATE","FindingFieldsUpdate": {"Severity": {"Label": "CRITICAL"},"Note": {"Text": "Urgent – look into these production accounts","UpdatedBy": "sechub-automation"}}}]' \—region us-east-1

To set up the automation rule for Scenario 1 (console)

Open the Security Hub console, and in the left navigation pane, choose Automations.

Figure 1: Automation rules in the Security Hub console
Choose Create rule, and then choose Create a custom rule to get started with creating a rule of your choice. Add a rule name and description.

Figure 2: Create a new custom rule
Under Criteria, add the following information.
- Key 1
  - Key = GeneratorID
  - Operator = EQUALS
  - Value = aws-foundational-security-best-practices/v/1.0.0/GuardDuty.1
- Key 2
  - Key = AwsAccountId
  - Operator = EQUALS
  - Value = Your AWS account ID
Figure 3: Information added for the rule criteria
You can preview which findings will match the criteria by looking in the preview section.

Figure 4: Preview section
Next, under Automated action, specify which finding value to update automatically when findings match your criteria.

Figure 5: Automated action to be taken against the findings that match the criteria
For Rule status, choose Enabled, and then choose Create rule.

Figure 6: Set the rule status to Enabled
After you choose Create rule, you will see the newly created rule within the Automations portal.

Figure 7: Newly created rule within the Security Hub Automations page

Note: In figure 7, you can see multiple automation rules. When you create automation rules, you assign each rule an order number. This determines the order in which Security Hub applies your automation rules. This becomes important when multiple rules apply to the same finding or finding field. When multiple rule actions apply to the same finding field, the rule with the highest numerical value for rule order is applied last and has the ultimate effect on that field.

Additionally, if your preferred deployment method is to use the API or AWS SDK for Python (Boto3), we have information on how you can use these means of deployment in our public documentation.

Scenario 2: Change the finding severity to high if a resource is important, based on resource tags

Imagine a situation where you have findings associated to a wide range of resources. Typically, organizations will attempt to prioritize which findings to remediate first. You can achieve this prioritization through Security Hub and the contextual fields that are available for you to use — for example, by using the severity of the finding or the account ID the resource is sitting in. You might also have your own prioritization based on other factors. You could add this additional context to findings by using a tagging strategy. With automation rules, you can now automatically elevate the severity for specific findings based on the tag value associated to the resource.

For example, if a finding comes into Security Hub with the severity rating “Medium,” but the resource in question is critical to the business and has the tag production associated to it, you could automatically raise the severity rating to “High.”

Note: This will work only for findings where there is a resource tag associated with the finding.

Scenario 3: Suppress GuardDuty findings with a severity of “Informational”

GuardDuty provides an overarching view of the state of threats to deployed resources in your organization’s cloud environment. After evaluation, GuardDuty produces findings related to these threats. The findings produced by GuardDuty have different severities, to help organizations with prioritization. Some of these findings will be given an “Informational” severity. “Informational” indicates that no issue was found and the content of the finding is purely to give information. After you have evaluated the context of the finding, you might want to suppress any additional findings that match the same criteria.

For example, you might want to set up a rule so that new findings with the generator ID that produced “Informational” findings are suppressed, keeping only the findings that need action.

Templates

When you create a new rule, you can also choose to create a rule from a template. These templates are regularly updated with use cases that are applicable for many customers.

To set up an automation rule by using a template from the console

In the Security Hub console, choose Automations, and then choose Create rule.
Choose Create a rule from a template to get started with creating a rule of your choice.
Select a rule template from the drop-down menu.

Figure 8: Select an automation rule template
(Optional) If necessary, modify the Rule, Criteria, and Automated action sections.
For Rule status, choose whether you want the rule to be enabled or disabled after it’s created.
(Optional) Expand the Additional settings section. Choose Ignore subsequent rules for findings that match these criteria if you want this rule to be the last rule applied to findings that match the rule criteria.
(Optional) For Tags, add tags as key-value pairs to help you identify the rule.
Choose Create rule.

Multi-Region deployment

For organizations that operate in multiple AWS Regions, we’ve provided a solution that you can use to replicate rules created in your central Security Hub admin account into these additional Regions. You can find the sample code for this solution in our GitHub repo.

Conclusion

In this blog post, we’ve discussed the importance of automation and its ability to help organizations scale operations within the cloud. We’ve introduced a new capability in AWS Security Hub, automation rules, that can help reduce the repetitive tasks your operational teams may be facing, and we’ve showcased some example use cases to get you started. Start using automation rules in your environment today. We’re excited to see what use cases you will solve with this feature and as always, are happy to receive any feedback.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Get custom data into Amazon Security Lake through ingesting Azure activity logs

2023-05-30 Adam Plotzker

Post Syndicated from Adam Plotzker original https://aws.amazon.com/blogs/security/get-custom-data-into-amazon-security-lake-through-ingesting-azure-activity-logs/

Amazon Security Lake automatically centralizes security data from both cloud and on-premises sources into a purpose-built data lake stored on a particular AWS delegated administrator account for Amazon Security Lake.

In this blog post, I will show you how to configure your Amazon Security Lake solution with cloud activity data from Microsoft Azure Monitor activity log, which you can query alongside your existing AWS CloudTrail data. I will walk you through the required steps — from configuring the required AWS Identity and Access Management (IAM) permissions, AWS Glue jobs, and Amazon Kinesis Data Streams required on the AWS side to forwarding that data from within Azure.

When you turn on Amazon Security Lake, it begins to collect actionable security data from various AWS sources. However, many enterprises today have complex environments that include a mix of different cloud resources in addition to on-premises data centers.

Although the AWS data sources in Amazon Security Lake encompass a large amount of the necessary security data needed for analysis, you may miss the full picture if your infrastructure operates across multiple cloud venders (for example, AWS, Azure, and Google Cloud Platform) and on-premises at the same time. By querying data from across your entire infrastructure, you can increase the number of indicators of compromise (IOC) that you identify, and thus increase the likelihood that those indicators will lead to actionable outputs.

Solution architecture

Figure 1 shows how to configure data to travel from an Azure event hub to Amazon Security Lake.

Figure 1: Solution architecture

As shown in Figure 1, the solution involves the following steps:

An AWS user instantiates the required AWS services and features that enable the process to function, including AWS Identity and Access Management (IAM) permissions, Kinesis data streams, AWS Glue jobs, and Amazon Simple Storage Service (Amazon S3) buckets, either manually or through an AWS CloudFormation template, such as the one we will use in this post.
In response to the custom source created from the CloudFormation template, a Security Lake table is generated in AWS Glue.
From this point on, Azure activity logs in their native format are stored within an Azure cloud event hub within an Azure account. An Azure function is deployed to respond to new events within the Azure event hub and forward these logs over the internet to the Kinesis data stream that was created in the preceding step.
The Kinesis data stream forwards the data to an AWS Glue streaming job fronted by the Kinesis data.
The AWS Glue job then performs the extract, transfer, and load (ETL) mapping to the appropriate Open Cybersecurity Schema Framework (OCSF) (specified for API Activity events at OCSF API Activity Mappings).
The Azure events are partitioned with respect to the required partitioning requirements in Amazon Security Lake tables and stored in S3.
The user can query these tables by using Amazon Athena alongside the rest of their data inside Amazon Security Lake.

Prerequisites

Before you implement the solution, complete the following prerequisites:

Verify that you have enabled Amazon Security Lake in the AWS Regions that correspond to the Azure Activity logs that you will forward. For more information, see What is Amazon Security Lake?
Preconfigure the custom source logging for the source AZURE_ACTIVITY in your Region. To configure this custom source in Amazon Security Lake, open the Amazon Security Lake console, navigate to Create custom data source, and do the following, as shown in Figure 2:
- For Data source name, enter AZURE_ACTIVITY.
- For Event class, select API_ACTIVITY.
- For Account Id, enter the ID of the account which is authorized to write data to your data lake.
- For External Id, enter “AZURE_ACTIVITY-<YYYYMMDD>“
Figure 2: Configure custom data source

For more information on how to configure custom sources for Amazon Security Lake, see Collecting data from custom sources.

Step 1: Configure AWS services for Azure activity logging

The first step is to configure the AWS services for Azure activity logging.

To configure Azure activity logging in Amazon Security Lake, first prepare the assets required in the target AWS account. You can automate this process by using the provided CloudFormation template — Security Lake CloudFormation — which will do the heavy lifting for this portion of the setup.

Note: I have predefined these scripts to create the AWS assets required to ingest Azure activity logs, but you can generalize this process for other external log sources, as well.

The CloudFormation template has the following components:
- securitylakeGlueStreamingRole — includes the following managed policies:
  - AWSLambdaKinesisExecutionRole
  - AWSGlueServiceRole
- securitylakeGlueStreamingPolicy — includes the following attributes:
  - “s3:GetObject”
  - “s3:PutObject”
- securitylakeAzureActivityStream — This Kinesis data stream is the endpoint that acts as the connection point between Azure and AWS and the frontend of the AWS Glue stream that feeds Azure activity logs to Amazon Security Lake.
- securitylakeAzureActivityJob — This is an AWS Glue streaming job that is used to take in feeds from the Kinesis data stream and map the Azure activity logs within that stream to OCSF.
- securitylake-glue-assets S3 bucket — This is the S3 bucket that is used to store the ETL scripts used in the AWS Glue job to map Azure activity logs.
Running the CloudFormation template will instantiate the aforementioned assets in your AWS delegated administrator account for Amazon Security Lake.
The CloudFormation template creates a new S3 bucket with the following syntax: securityLake-glue-assets-<ACCOUNT-ID>–<REGION>. After the CloudFormation run is complete, navigate to this bucket within the S3 console.
Within the S3 bucket, create a scripts and temporary folder in the S3 bucket, as shown in Figure 4.

Figure 4: Glue assets bucket
Update the Azure AWS Glue Pyspark script by replacing the following values in the file. You will attach this script to your AWS Glue job and use it to generate the AWS assets required for the implementation.
- Replace <AWS_REGION_NAME> with the Region that you are operating in — for example, us-east-2.
- Replace <AWS_ACCOUNT_ID> with the account ID of your delegated administrator account for Amazon Security Lake — for example, 111122223333.
- Replace <SECURITYLAKE-AZURE-STREAM-ARN> with the Kinesis stream name created through the CloudFormation template. To find the stream name, open the Kinesis console, navigate to the Kinesis stream with the name securityLakeAzureActivityStream — <STREAM-UID>, and copy the Amazon Resource Name (ARN), as shown in the following figure.
  
  Figure 5: Kinesis stream ARN
- Replace <SECURITYLAKE-BUCKET-NAME> with the name of your data lake S3 bucket root name — for example, s3://aws-security-data-lake-DOC-EXAMPLE-BUCKET.
After you replace these values, navigate within the scripts folder and upload the AWS Glue PySpark Python script named azure-activity-pyspark.py, as shown in Figure 6.

Figure 6: AWS Glue script
Within your AWS Glue job, choose Job details and configure the job as follows:
- For Type, select Spark Streaming.
- For Language, select Python 3.
- For Script path, select the S3 path that you created in the preceding step.
- For Temporary path, select the S3 path that you created in the preceding step.
Save the changes, and run the AWS Glue job by selecting Save and then Run.
Choose the Runs tab, and make sure that the Run status of the job is Running.

Figure 7: AWS Glue job status

At this point, you have finished the configurations from AWS.

Step 2: Configure Azure services for Azure activity log forwarding

You will complete the next steps in the Azure Cloud console. You need to configure Azure to export activity logs to an Azure cloud event hub within your desired Azure account or organization. Additionally, you need to create an Azure function to respond to new events within the Azure event hub and forward those logs over the internet to the Kinesis data stream that the CloudFormation template created in the initial steps of this post.

For information about how to set up and configure Azure Functions to respond to event hubs, see Azure Event Hubs Trigger for Azure Functions in the Azure documentation.

Configure the following Python script — Azure Event Hub Function — in an Azure function app. This function is designed to respond to event hub events, create a connection to AWS, and forward those events to Kinesis as deserialized JSON blobs.

In the script, replace the following variables with your own information:

For <SECURITYLAKE-AZURE-STREAM-ARN>, enter the Kinesis data stream ARN.
For <SECURITYLAKE-AZURE-STREAM-NAME>, enter the Kinesis data stream name.
For <SECURITYLAKE-AZURE-STREAM-KEYID>, enter the AWS Key Management Service (AWS KMS) key ID created through the CloudFormation template.

The <SECURITYLAKE-AZURE-STREAM-ARN> and securityLakeAzureActivityStream—<STREAM-UID> are the same variables that you obtained earlier in this post (see Figure 5).

You can find the AWS KMS key ID within the AWS KMS managed key policy associated with securityLakeAzureActivityStream. For example, in the key policy shown in Figure 8, the <SECURITYLAKE-AZURE-STREAM-KEYID> is shown in line 3.

Figure 8: Kinesis data stream inputs

Important: When you are working with KMS keys retrieved from the AWS console or AWS API keys within Azure, you should be extremely mindful of how you approach key management. Improper or poor handling of keys could result in the interception of data from the Kinesis stream or Azure function.

It’s a best security practice to use a trusted key management architecture that uses sufficient encryption and security protocols when working with keys that safeguard sensitive security information. Within Azure, consider using services such as the AWS Azure AD integration for seamless and ephemeral credential usage inside of the azure function. See – Azure AD Integration – for more information on how the Azure AD Integration works to safeguard and manage stored security keys and help make sure that no keys are accessible to unauthorized parties or stored as unencrypted text outside the AWS console.

Step 3: Validate the workflow and query Athena

After you complete the preceding steps, your logs should be flowing. To make sure that the process is working correctly, complete the following steps.

In the Kinesis Data Streams console, verify that the logs are flowing to your data stream. Open the Kinesis stream that you created previously, choose the Data viewer tab, and then choose Get records, as shown in Figure 9.

Figure 9: Kinesis data stream inputs
Verify that the logs are partitioned and stored within the correct Security Lake bucket associated with the configured Region. The log partitions within the Security Lake bucket should have the following syntax — “region=<region>/account_id=<account_id>/eventDay=<YYYYMMDD>/”, and they should be stored with the expected parquet compression.

Figure 10: S3 bucket with object

Assuming that CloudTrail logs exist within your Amazon Security Lake instance as well, you can now create a query in Athena that pulls data from the newly created Azure activity table and examine it alongside your existing CloudTrail logs by running queries such as the following:

SELECT 
    api.operation,
    actor.user.uid,
    actor.user.name,
    src_endpoint.ip,
    time,
    severity,
    metadata.version,
    metadata.product.name,
    metadata.product.vendor_name,
    category_name,
    activity_name,
    type_uid,
FROM {SECURITY-LAKE-DB}.{SECURITY-LAKE-AZURE-TABLE}
UNION ALL
SELECT 
    api.operation,
    actor.user.uid,
    actor.user.name,
    src_endpoint.ip,
    time,
    severity,
    metadata.version,
    metadata.product.name,
    metadata.product.vendor_name,
    category_name,
    activity_name,
    type_uid,
FROM {SECURITY-LAKE-DB}.{SECURITY-LAKE-CLOUDTRAIL-TABLE}

Figure 11: Query Azure activity and CloudTrail together in Athena

For additional guidance on how to configure access and query Amazon Security Lake in Athena, see the following resources:

Conclusion

In this blog post, you learned how to create and deploy the AWS and Microsoft Azure assets needed to bring your own data to Amazon Security Lake. By creating an AWS Glue streaming job that can transform Azure activity data streams and by fronting that AWS Glue job with a Kinesis stream, you can open Amazon Security Lake to intake from external Azure activity data streams.

You also learned how to configure Azure assets so that your Azure activity logs can stream to your Kinesis endpoint. The combination of these two creates a working, custom source solution for Azure activity logging.

To get started with Amazon Security Lake, see the Getting Started page, or if you already use Amazon Security Lake and want to read additional blog posts and articles about this service, see Blog posts and articles.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on Amazon Security Lake re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

DevSecOps with Amazon CodeGuru Reviewer CLI and Bitbucket Pipelines

2023-04-28 Bineesh Ravindran

Post Syndicated from Bineesh Ravindran original https://aws.amazon.com/blogs/devops/devsecops-with-amazon-codeguru-reviewer-cli-and-bitbucket-pipelines/

DevSecOps refers to a set of best practices that integrate security controls into the continuous integration and delivery (CI/CD) workflow. One of the first controls is Static Application Security Testing (SAST). SAST tools run on every code change and search for potential security vulnerabilities before the code is executed for the first time. Catching security issues early in the development process significantly reduces the cost of fixing them and the risk of exposure.

This blog post, shows how we can set up a CI/CD using Bitbucket Pipelines and Amazon CodeGuru Reviewer . Bitbucket Pipelines is a cloud-based continuous delivery system that allows developers to automate builds, tests, and security checks with just a few lines of code. CodeGuru Reviewer is a cloud-based static analysis tool that uses machine learning and automated reasoning to generate code quality and security recommendations for Java and Python code.

We demonstrate step-by-step how to set up a pipeline with Bitbucket Pipelines, and how to call CodeGuru Reviewer from there. We then show how to view the recommendations produced by CodeGuru Reviewer in Bitbucket Code Insights, and how to triage and manage recommendations during the development process.

Bitbucket Overview

Bitbucket is a Git-based code hosting and collaboration tool built for teams. Bitbucket’s best-in-class Jira and Trello integrations are designed to bring the entire software team together to execute a project. Bitbucket provides one place for a team to collaborate on code from concept to cloud, build quality code through automated testing, and deploy code with confidence. Bitbucket makes it easy for teams to collaborate and reduce issues found during integration by providing a way to combine easily and test code frequently. Bitbucket gives teams easy access to tools needed in other parts of the feedback loop, from creating an issue to deploying on your hardware of choice. It also provides more advanced features for those customers that need them, like SAML authentication and secrets storage.

Solution Overview

Bitbucket Pipelines uses a Docker container to perform the build steps. You can specify any Docker image accessible by Bitbucket, including private images, if you specify credentials to access them. The container starts and then runs the build steps in the order specified in your configuration file. The build steps specified in the configuration file are nothing more than shell commands executed on the Docker image. Therefore, you can run scripts, in any language supported by the Docker image you choose, as part of the build steps. These scripts can be stored either directly in your repository or an Internet-accessible location. This solution demonstrates an easy way to integrate Bitbucket pipelines with AWS CodeReviewer using bitbucket-pipelines.yml file.

You can interact with your Amazon Web Services (AWS) account from your Bitbucket Pipeline using the OpenID Connect (OIDC) feature. OpenID Connect is an identity layer above the OAuth 2.0 protocol.

Now that you understand how Bitbucket and your AWS Account securely communicate with each other, let’s look into the overall summary of steps to configure this solution.

Fork the repository
Configure Bitbucket Pipelines as an IdP on AWS.
Create an IAM role.
Add repository variables needed for pipeline
Adding the CodeGuru Reviewer CLI to your pipeline
Review CodeGuru recommendations

Now let’s look into each step in detail. To configure the solution, follow steps mentioned below.

Step 1: Fork this repo

https://bitbucket.org/aws-samples/amazon-codeguru-samples

Figure 1 : Fork amazon-codeguru-samples bitbucket repository.

Step 2: Configure Bitbucket Pipelines as an Identity Provider on AWS

Configuring Bitbucket Pipelines as an IdP in IAM enables Bitbucket Pipelines to issue authentication tokens to users to connect to AWS.
In your Bitbucket repo, go to Repository Settings > OpenID Connect. Note the provider URL and the Audience variable on that screen.

The Identity Provider URL will look like this:

https://api.bitbucket.org/2.0/workspaces/YOUR_WORKSPACE/pipelines-config/identity/oidc – This is the issuer URL for authentication requests. This URL issues a token to a requester automatically as part of the workflow. See more detail about issuer URL in RFC . Here “YOUR_WORKSPACE” need to be replaced with name of your bitbucket workspace.

And the Audience will look like:

ari:cloud:bitbucket::workspace/ari:cloud:bitbucket::workspace/84c08677-e352-4a1c-a107-6df387cfeef7 – This is the recipient the token is intended for. See more detail about audience in Request For Comments (RFC) which is memorandum published by the Internet Engineering Task Force(IETF) describing methods and behavior for securely transmitting information between two parties usinf JSON Web Token ( JWT).

Figure 2 : Configure Bitbucket Pipelines as an Identity Provider on AWS

Next, navigate to the IAM dashboard > Identity Providers > Add provider, and paste in the above info. This tells AWS that Bitbucket Pipelines is a token issuer.

Step 3: Create a custom policy

You can always use the CLI with Admin credentials but if you want to have a specific role to use the CLI, your credentials must have at least the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "codeguru-reviewer:ListRepositoryAssociations",
                "codeguru-reviewer:AssociateRepository",
                "codeguru-reviewer:DescribeRepositoryAssociation",
                "codeguru-reviewer:CreateCodeReview",
                "codeguru-reviewer:DescribeCodeReview",
                "codeguru-reviewer:ListRecommendations",
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucket*",
                "s3:List*",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::codeguru-reviewer-cli-<AWS ACCOUNT ID>*",
                "arn:aws:s3:::codeguru-reviewer-cli-<AWS ACCOUNT ID>*/*"
            ],
            "Effect": "Allow"
        }
    ]
}

To create an IAM policy, navigate to the IAM dashboard > Policies > Create Policy

Now then paste the above mentioned json document into the json tab as shown in screenshot below and replace <AWS ACCOUNT ID> with your own AWS Account ID

Figure 3 : Create a Policy.

Name your policy; in our example, we name it CodeGuruReviewerOIDC.

Figure 4 : Review and Create a IAM policy.

Step 4: Create an IAM Role

Once you’ve enabled Bitbucket Pipelines as a token issuer, you need to configure permissions for those tokens so they can execute actions on AWS.
To create an IAM web identity role, navigate to the IAM dashboard > Roles > Create Role, and choose the IdP and audience you just created.

Figure 5 : Create an IAM role

Next, select the “CodeGuruReviewerOIDC “ policy to attach to the role.

Figure 6 : Assign policy to role

Figure 7 : Review and Create role

Name your role; in our example, we name it CodeGuruReviewerOIDCRole.

After adding a role, copy the Amazon Resource Name (ARN) of the role created:

The Amazon Resource Name (ARN) will look like this:

arn:aws:iam::000000000000:role/CodeGuruReviewerOIDCRole

we will need this in a later step when we create AWS_OIDC_ROLE_ARN as a repository variable.

Step 5: Add repository variables needed for pipeline

Variables are configured as environment variables in the build container. You can access the variables from the bitbucket-pipelines.yml file or any script that you invoke by referring to them. Pipelines provides a set of default variables that are available for builds, and can be used in scripts .Along with default variables we need to configure few additional variables called Repository Variables which are used to pass special parameter to the pipeline.

Figure 8 : Create repository variables

Figure 8 Create repository variables

Below mentioned are the few repository variables that need to be configured for this solution.

1.AWS_DEFAULT_REGION Create a repository variableAWS_DEFAULT_REGION with value “us-east-1”

2.BB_API_TOKEN Create a new repository variable BB_API_TOKEN and paste the below created App password as the value

App passwords are user-based access tokens for scripting tasks and integrating tools (such as CI/CD tools) with Bitbucket Cloud.These access tokens have reduced user access (specified at the time of creation) and can be useful for scripting, CI/CD tools, and testing Bitbucket connected applications while they are in development.
To create an App password:

- Select your avatar (Your profile and settings) from the navigation bar at the top of the screen.
- Under Settings, select Personal settings.
- On the sidebar, select App passwords.
- Select Create app password.
- Give the App password a name, usually related to the application that will use the password.
- Select the permissions the App password needs. For detailed descriptions of each permission, see: App password permissions.
- Select the Create button. The page will display the New app password dialog.
- Copy the generated password and either record or paste it into the application you want to give access. The password is only displayed once and can’t be retrieved later.

3.BB_USERNAME Create a repository variable BB_USERNAME and add your bitbucket username as the value of this variable

4.AWS_OIDC_ROLE_ARN

After adding a role in Step 4, copy the Amazon Resource Name (ARN) of the role created:

The Amazon Resource Name (ARN) will look something like this:

arn:aws:iam::000000000000:role/CodeGuruReviewerOIDCRole

and create AWS_OIDC_ROLE_ARN as a repository variable in the target Bitbucket repository.

Step 6: Adding the CodeGuru Reviewer CLI to your pipeline

In order to add CodeGuruRevewer CLi to your pipeline update the bitbucket-pipelines.yml file as shown below

#  Template maven-build

 #  This template allows you to test and build your Java project with Maven.
 #  The workflow allows running tests, code checkstyle and security scans on the default branch.

 # Prerequisites: pom.xml and appropriate project structure should exist in the repository.

 image: docker-public.packages.atlassian.com/atlassian/bitbucket-pipelines-mvn-python3-awscli

 pipelines:
  default:
    - step:
        name: Build Source Code
        caches:
          - maven
        script:
          - cd $BITBUCKET_CLONE_DIR
          - chmod 777 ./gradlew
          - ./gradlew build
        artifacts:
          - build/**
    - step: 
        name: Download and Install CodeReviewer CLI   
        script:
          - curl -OL https://github.com/aws/aws-codeguru-cli/releases/download/0.2.3/aws-codeguru-cli.zip
          - unzip aws-codeguru-cli.zip
        artifacts:
          - aws-codeguru-cli/**
    - step:
        name: Run CodeGuruReviewer 
        oidc: true
        script:
          - export AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
          - export AWS_ROLE_ARN=$AWS_OIDC_ROLE_ARN
          - export S3_BUCKET=$S3_BUCKET

          # Setup aws cli
          - export AWS_WEB_IDENTITY_TOKEN_FILE=$(pwd)/web-identity-token
          - echo $BITBUCKET_STEP_OIDC_TOKEN > $(pwd)/web-identity-token
          - aws configure set web_identity_token_file "${AWS_WEB_IDENTITY_TOKEN_FILE}"
          - aws configure set role_arn "${AWS_ROLE_ARN}"
          - aws sts get-caller-identity

          # setup codegurureviewercli
          - export PATH=$PATH:./aws-codeguru-cli/bin
          - chmod 777 ./aws-codeguru-cli/bin/aws-codeguru-cli

          - export SRC=$BITBUCKET_CLONE_DIR/src
          - export OUTPUT=$BITBUCKET_CLONE_DIR/test-reports
          - export CODE_INSIGHTS=$BITBUCKET_CLONE_DIR/bb-report

          # Calling Code Reviewer CLI
          - ./aws-codeguru-cli/bin/aws-codeguru-cli --region $AWS_DEFAULT_REGION  --root-dir $BITBUCKET_CLONE_DIR --build $BITBUCKET_CLONE_DIR/build/classes/java --src $SRC --output $OUTPUT --no-prompt --bitbucket-code-insights $CODE_INSIGHTS        
        artifacts:
          - test-reports/*.* 
          - target/**
          - bb-report/**
    - step: 
        name: Upload Code Insights Artifacts to Bitbucket Reports 
        script:
          - chmod 777 upload.sh
          - ./upload.sh bb-report/report.json bb-report/annotations.json
    - step:
        name: Upload Artifacts to Bitbucket Downloads       # Optional Step
        script:
          - pipe: atlassian/bitbucket-upload-file:0.3.3
            variables:
              BITBUCKET_USERNAME: $BB_USERNAME
              BITBUCKET_APP_PASSWORD: $BB_API_TOKEN
              FILENAME: '**/*.json'
    - step:
          name: Validate Findings     #Optional Step
          script:
            # Looking into CodeReviewer results and failing if there are Critical recommendations
            - grep -o "Critical" test-reports/recommendations.json | wc -l
            - count="$(grep -o "Critical" test-reports/recommendations.json | wc -l)"
            - echo $count
            - if (( $count > 0 )); then
            - echo "Critical findings discovered. Failing."
            - exit 1
            - fi
          artifacts:
            - '**/*.json'

Let’s look into the pipeline file to understand various steps defined in this pipeline

Figure 9 : Bitbucket pipeline execution steps

Step 1) Build Source Code

In this step source code is downloaded into a working directory and build using Gradle.All the build artifacts are then passed on to next step

Step 2) Download and Install Amazon CodeGuru Reviewer CLI
In this step Amazon CodeGuru Reviewer is CLI is downloaded from a public github repo and extracted into working directory. All artifacts downloaded and extracted are then passed on to next step

Step 3) Run CodeGuruReviewer

This step uses flag oidc: true which declares you are using the OIDC authentication method, while AWS_OIDC_ROLE_ARN declares the role created in the previous step that contains all of the necessary permissions to deal with AWS resources.
Further repository variables are exported, which is then used to set AWS CLI .Amazon CodeGuruReviewer CLI which was downloaded and extracted in previous step is then used to invoke CodeGuruReviewer along with some parameters .

Following are the parameters that are passed on to the CodeGuruReviewer CLI
--region $AWS_DEFAULT_REGION The AWS region in which CodeGuru Reviewer will run (in this blog we used us-east-1).

--root-dir $BITBUCKET_CLONE_DIR The root directory of the repository that CodeGuru Reviewer should analyze.

--build $BITBUCKET_CLONE_DIR/build/classes/java Points to the build artifacts. Passing the Java build artifacts allows CodeGuru Reviewer to perform more in-depth bytecode analysis, but passing the build artifacts is not required.

--src $SRC Points the source code that should be analyzed. This can be used to focus the analysis on certain source files, e.g., to exclude test files. This parameter is optional, but focusing on relevant code can shorten analysis time and cost.

--output $OUTPUT The directory where CodeGuru Reviewer will store its recommendations.

--no-prompt This ensures that CodeGuru Reviewer does run in interactive mode where it pauses for user input.

–-bitbucket-code-insights $CODE_INSIGHTS The location where recommendations in Bitbucket CodeInsights format should be written to.

Once Amazon CodeGuruReviewer scans the code based on the above parameters, it generates two json files (reports.json and annotations.json) Code Insight Reports which is then passed on as artifacts to the next step.

Step 4) Upload Code Insights Artifacts to Bitbucket Reports
In this step code Insight Report generated by Amazon CodeGuru Reviewer is then uploaded to Bitbucket Reports. This makes the report available in the reports section in the pipeline as displayed in the screenshot

Figure 10 : CodeGuru Reviewer Report

Step 5) [Optional] Upload the copy of these reports to Bitbucket Downloads
This is an Optional step where you can upload the artifacts to Bitbucket Downloads. This is especially useful because the artifacts inside a build pipeline gets deleted after 14 days of the pipeline run. Using Bitbucket Downloads, you can store these artifacts for a much longer duration.

Figure 11 : Bitbucket downloads

Step 6) [Optional] Validate Findings by looking into results and failing is there are any Critical Recommendations
This is an optional step showcasing how the results for CodeGururReviewer can be used to trigger the success and failure of a Bitbucket pipeline. In this step the pipeline fails, if a critical recommendation exists in report.

Step 7: Review CodeGuru recommendations

CodeGuru Reviewer supports different recommendation formats, including CodeGuru recommendation summaries, SARIF, and Bitbucket CodeInsights.

Keeping your Pipeline Green

Now that CodeGuru Reviewer is running in our pipeline, we need to learn how to unblock ourselves if there are recommendations. The easiest way to unblock a pipeline after is to address the CodeGuru recommendation. If we want to validate on our local machine that a change addresses a recommendation using the same CLI that we use as part of our pipeline.
Sometimes, it is not convenient to address a recommendation. E.g., because there are mitigations outside of the code that make the recommendation less relevant, or simply because the team agrees that they don’t want to block deployments on recommendations unless they are critical. For these cases, developers can add a .codeguru-ignore.yml file to their repository where they can use a variety of criteria under which a recommendation should not be reported. Below we explain all available criteria to filter recommendations. Developers can use any subset of those criteria in their .codeguru-ignore.yml file. We will give a specific example in the following sections.

version: 1.0 # The version number is mandatory. All other entries are optional.

# The CodeGuru Reviewer CLI produces a recommendations.json file which contains deterministic IDs for each
# recommendation. This ID can be excluded so that this recommendation will not be reported in future runs of the
# CLI.
 ExcludeById:
 - '4d2c43618a2dac129818bef77093730e84a4e139eef3f0166334657503ecd88d'
# We can tell the CLI to exclude all recommendations below a certain severity. This can be useful in CI/CD integration.
 ExcludeBelowSeverity: 'HIGH'
# We can exclude all recommendations that have a certain tag. Available Tags can be found here:
# https://docs.aws.amazon.com/codeguru/detector-library/java/tags/
# https://docs.aws.amazon.com/codeguru/detector-library/python/tags/
 ExcludeTags:
  - 'maintainability'
# We can also exclude recommendations by Detector ID. Detector IDs can be found here:
# https://docs.aws.amazon.com/codeguru/detector-library
 ExcludeRecommendations:
# Ignore all recommendations for a given Detector ID 
  - detectorId: 'java/[email protected]'
# Ignore all recommendations for a given Detector ID in a provided set of locations.
# Locations can be written as Unix GLOB expressions using wildcard symbols.
  - detectorId: 'java/[email protected]'
    Locations:
      - 'src/main/java/com/folder01/*.java'
# Excludes all recommendations in the provided files. Files can be provided as Unix GLOB expressions.
 ExcludeFiles:
  - tst/**

The recommendations will still be reported in the CodeGuru Reviewer console, but not by the CodeGuru Reviewer CLI and thus they will not block the pipeline anymore.

Conclusion

In this post, we outlined how you can set up a CI/CD pipeline using Bitbucket Pipelines, and Amazon CodeGuru Reviewer and we outlined how you can integrate Amazon CodeGuru Reviewer CLI with the Bitbucket cloud-based continuous delivery system that allows developers to automate builds, tests, and security checks with just a few lines of code. We showed you how to create a Bitbucket pipeline job and integrate the CodeGuru Reviewer CLI to detect issues in your Java and Python code, and access the recommendations for remediating these issues.

We presented an example where you can stop the build upon finding critical violations. Furthermore, we discussed how you could upload these artifacts to BitBucket downloads and store these artifacts for a much longer duration. The CodeGuru Reviewer CLI offers you a one-line command to scan any code on your machine and retrieve recommendations .You can use the CLI to integrate CodeGuru Reviewer into your favorite CI tool, as a pre-commit hook, in your workflow. In turn, you can combine CodeGuru Reviewer with Dynamic Application Security Testing (DAST) and Software Composition Analysis (SCA) tools to achieve a hybrid application security testing method that helps you combine the inside-out and outside-in testing approaches, cross-reference results, and detect vulnerabilities that both exist and are exploitable.

If you need hands-on keyboard support, then AWS Professional Services can help implement this solution in your enterprise, and introduce you to our AWS DevOps services and offerings.

About the authors:

Building GitHub with Ruby and Rails

2023-04-06 Adam Hess

Post Syndicated from Adam Hess original https://github.blog/2023-04-06-building-github-with-ruby-and-rails/

Since the beginning, GitHub.com has been a Ruby on Rails monolith. Today, the application is nearly two million lines of code and more than 1,000 engineers collaborate on it daily. We deploy as often as 20 times a day, and nearly every week one of those deploys is a Rails upgrade.

Upgrading Rails weekly

Every Monday a scheduled GitHub Action workflow triggers an automated pull request, which bumps our Rails version to the latest commit on the Rails main branch for that day. All our builds run on this new version of Rails. Once all the builds pass, we review the changes and ship it the next day. Starting an upgrade on Monday you will already have an open pull request linking the changes this Rails upgrade proposes and a completed build.

This process is a far stretch from how we did Rails upgrades only a few years ago. In the past, we spent months migrating from our custom fork of Rails to a newer stable release, and then we maintained two Gemfiles to ensure we’d remain compatible with the upcoming release. Now, upgrades take under a week. You can read more about this process in this 2018 blog post. We work closely with the community to ensure that each Rails release is running in production before the release is officially cut.

There are real tangible benefits to running the latest version of Rails:

We give developers at GitHub the very best version of our tools by providing the latest version of Rails. This ensures users can take advantage of all the latest improvements including better database connection handling, faster view rendering, and all the amazing work happening in Rails every day.
We have removed nearly all of our Rails patches. Since we are running on the latest version of Rails, instead of patching Rails and waiting for a change, developers can suggest the patch to Rails itself.
Working on Rails is now easier than ever to share with your team! Instead of telling your team you found something in Rails that will be fixed in the next release, you can work on something in Rails and see it the following week!
Maintaining more up-to-date dependencies gives us a better security posture. Since we already do weekly upgrades, adding an upgrade when there is a security advisory is standard practice and doesn’t require any extra work.
There are no “big bang” migrations. Since each Rails upgrade incorporates only a small number of changes, it’s easier to understand and dig into if there are incompatibilities. The worst issues from a tough upgrade are unexpected changes from an unknown location. These issues can be mitigated by this upgrade strategy.
Catching bugs in the main branch and contributing back strengthens our engineering team and helps our developers deepen their expertise and understanding of our application and its dependencies.

Testing Ruby continuously

Naturally, we have a similar process for Ruby upgrades. In February 2022, shortly after upgrading to Ruby 3.1, we started building and testing Ruby shas from 3.2-alpha in a parallel build. When CI runs for the GitHub Rails application, two versions of the builds run: one build uses the Ruby version we are running in production and one uses the latest Ruby commit including the latest changes in Ruby, which we update weekly.

While we build Ruby with every change, GitHub only ships numbered Ruby versions to production. The builds help us maintain compatibility with the upcoming Ruby version and give us insight into what Ruby changes are coming.

In early December 2022, with CI giving us confidence we were compatible before the usual Christmas release of Ruby 3.2, we were able to test Ruby release candidates with a portion of production traffic and give the Ruby team insights into any changes we noticed. For example, we could reproduce an increase in allocations due to keyword argument handling that was fixed before the release of Ruby 3.2 due to this process. We also identified a subtle change when to_str and #to_i is applied. Because we upgrade all the time, identifying and resolving these issues was standard practice.

This weekly upgrade process for Ruby allowed us to upgrade our monolith from Ruby 3.1 to Ruby 3.2 within a month of release. After all, we had already tested and run it in production! At this point, this was the fastest Ruby upgrade we had ever done. We broke this record with the release of Ruby 3.2.1, which we adopted on release day.

This upgrade process has proved to be invaluable for our collaboration with the Ruby core team. A nice side effect of having these builds is that we are able to easily test and profile our own Ruby changes before we suggest them upstream. This can make it easier for us to identify regressions in our own application and better understand the impact of changes on a production environment.

Should I do it, too?

Our ability to do frequent Ruby and Rails upgrades is due to some engineering maturity at GitHub. Doing weekly Rails upgrades requires a thorough test suite with many great engineers working to maintain and improve it. We also gain confidence from having great test environments along with progressive rollout deploys. Our test suite is likely to catch problems, and if it doesn’t, we are confident we will catch it during deploy before it reaches customers.

If you have these tools, you should also upgrade Rails weekly and test using the latest Ruby. GitHub is a better Rails app because of it and it has enabled work from my team that I am really proud of.

Ruby champion Eileen Uchitelle explains why investing in Rails is important in her Rails Conf 2022 Keynote:

Ultimately, if more companies treated the framework as an extension of the application, it would result in higher resilience and stability. Investment in Rails ensures your foundation will not crumble under the weight of your application. Treating it as an unimportant part of your application is a mistake and many, many leaders make this mistake.

Thanks to contributions from people around the world, using Ruby is better than ever. GitHub, along with hundreds of other companies, benefits from Ruby and Rails continuing to improve. Upgrading regularly and investing in our frameworks is a staple of the work we do on the Ruby Architecture team at GitHub. We are always grateful for the Ruby community and glad that we can give back in a way that improves our application and tools as much as it improves them for everyone else.

Publish Amazon DevOps Guru Insights to ServiceNow for Incident Management

2023-03-29 Abdullahi Olaoye

Post Syndicated from Abdullahi Olaoye original https://aws.amazon.com/blogs/devops/publish-amazon-devops-guru-insights-to-servicenow-for-incident-management/

Amazon DevOps Guru is a fully managed AIOps service that uses machine learning (ML) to quickly identify when applications are behaving outside of their normal operating patterns and generates insights from its findings. These insights generated by Amazon DevOps Guru can be used to alert on-call teams to react to anomalies for mission critical workloads. Various customers already utilize Incident management systems like ServiceNow to identify, analyze and resolve critical incidents which could impact business operations. ServiceNow is an IT Service Management (ITSM) platform that enables enterprise organizations to improve operational efficiencies. Among its products is Incident Management which provides a single pane view to customers and allows customers restore services and resolve issues quickly.

This blog post will show you how to integrate Amazon DevOps Guru insights with ServiceNow to automatically create and manage Incidents. We will demonstrate how an insight generated by Amazon DevOps Guru for an anomaly can automatically create a ServiceNow Incident, update the incident when there are new anomalies or recommendations from Amazon DevOps Guru, and close the ServiceNow Incident once the insight is resolved by Amazon DevOps Guru.

Overview of solution

This solution uses a combination of event driven architecture and Serverless technologies, to integrate DevOps Guru insights with ServiceNow. When an Amazon DevOps Guru insight is created, an Amazon EventBridge rule is used to capture the insight as an event and routed to an AWS Lambda Function target. The lambda function interacts with ServiceNow using a REST API to create, update and close an incident for corresponding DevOps Guru events captured by EventBridge.

The EventBridge rule can be customized to capture all DevOps Guru insights or narrowed down to specific insights. In this blog, we will be capturing all DevOps Guru insights and will be performing actions on ServiceNow for the below DevOps Guru events:

DevOps Guru New Insight Open
DevOps Guru New Anomaly Association
DevOps Guru Insight Severity Upgraded
DevOps Guru New Recommendation Created
DevOps Guru Insight Closed

Figure 1: Amazon DevOps Guru Integration with ServiceNow using Amazon EventBridge and AWS Lambda

Solution Implementation Steps

Prerequisites

Before you deploy the solution and proceed with this walkthrough, you should have the following prerequisites:

Gather the hostname for your ServiceNow cloud instance. If you do not have a ServiceNow instance, you can request a developer instance through the ServiceNow Developer page.
Gather the credentials of a ServiceNow user who has permissions to make REST API calls to ServiceNow, specifically to the Table API. If you don’t have a user provisioned, you can create one by following the steps in Getting started with the REST API in the ServiceNow documentation.
Create a secret in Secrets Manager to store the ServiceNow credentials created in previous step. You can choose any name for the secret but it should have two key/value pairs, one for username and other for password.
Enable DevOps Guru for your applications by following these steps or you can follow this blog to deploy a sample serverless application that can be used to generate DevOps Guru insights for anomalies detected in the application.
Install and set up SAM CLI – Install the SAM CLI
Download and set up Java. The version should be matching to the runtime that you defined in the SAM template.yaml Serverless function configuration – Install the Java SE Development Kit 11
Maven – Install Maven
Docker – Install Docker community edition

You have two options to deploy this solution, one options is to deploy from the AWS Serverless Repository and other from the Command Line Interface (CLI).

Option 1: Deploy sample ServiceNow Connector App from AWS Serverless Repository

The DevOps Guru ServiceNow Connector application is available in the AWS Serverless Application Repository which is a managed repository for serverless applications. The application is packaged with an AWS Serverless Application Model (SAM) template, definition of the AWS resources used and the link to the source code. Follow the steps below to quickly deploy this serverless application in your AWS account.

Follow the steps below to quickly deploy this serverless application in your AWS account:

Login to the AWS management console of the account to which you plan to deploy this solution.
Go to the DevOps Guru ServiceNow Connector application in the AWS Serverless Repository and click on “Deploy”.

Figure 2: Deploy solution through AWS Serverless Repository
The Lambda application deployment screen will be displayed where you can enter the ServiceNow hostname (do not include the https prefix) and the Secret Name you created in the prerequisite steps. Click on the ‘Deploy’ button.

Figure 3: AWS Lambda Application Settings

After successful deployment the AWS Lambda Application page will display the “Create complete” status for the serverlessrepo-DevOps-Guru-ServiceNow-Connector application. The CloudFormation template creates four resources:
1. Lambda function which has the logic to integrate to the ServiceNow
2. Event Bridge rule for the DevOps Guru Insights
3. Lambda permission
4. IAM role
5. Now you can skip Option 2 and follow the steps in the “Test the Solution” section to trigger some DevOps Guru insights and validate that the incidents are created and updated in ServiceNow.

Option 2: Build and Deploy sample ServiceNow Connector App using AWS SAM Command Line Interface

As you have seen above, you can directly deploy the sample serverless application from the Serverless Repository with one click deployment. Alternatively, you can choose to clone the github source repository and deploy using the SAM CLI from your terminal.

The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality for building and testing serverless applications. The CLI provides commands that enable you to verify that AWS SAM template files are written according to the specification, invoke Lambda functions locally, step-through debug Lambda functions, package and deploy serverless applications to the AWS Cloud, and so on. For details about how to use the AWS SAM CLI, including the full AWS SAM CLI Command Reference, see AWS SAM reference – AWS Serverless Application Model.

Before you proceed, make sure you have completed the Prerequisites section in the beginning which should set up the AWS SAM CLI, Maven and Java on your local terminal. You also need to install and set up Docker to run your functions in an Amazon Linux environment that matches Lambda.

Follow the steps below to build and deploy this serverless application using AWS SAM CLI in your AWS account:

Clone the source code from the github repo

$ git clone https://github.com/aws-samples/amazon-devops-guru-connector-servicenow.git

Before you build the resources defined in the SAM template, you can use the below validate command which will run cfn-lint validations on your SAM JSON/YAML template

$ sam validate –-lint --template template.yaml

3. Build the application with SAM CLI

$ cd amazon-devops-guru-connector-servicenow
$ sam build

If everything is set up correctly, you should have a success message like shown below:

Build Succeeded

Built Artifacts : .aws-sam/build
Built Template : .aws-sam/build/template.yaml

Commands you can use next
=========================
[*] Validate SAM template: sam validate
[*] Invoke Function: sam local invoke
[*] Test Function in the Cloud: sam sync --stack-name {{stack-name}} --watch
[*] Deploy: sam deploy –guided

4. Deploy the application with SAM CLI

$ sam deploy –-guided

This command will package and deploy your application to AWS, with a series of prompts that you should respond to as shown below:

Stack Name: The name of the stack to deploy to CloudFormation. This should be unique to your account and region, and a good starting point would be something matching your project name – amazon-devops-guru-connector-servicenow
AWS Region: The AWS region you want to deploy your application to.
Parameter ServiceNowHost []: The ServiceNow host name/instance URL you set up. Example: dev92031.service-now.com
Parameter SecretName []: The secret name that you set up for ServiceNow credentials in the Prerequisites.
Confirm changes before deploy: If set to yes, any change sets will be shown to you before execution for manual review. If set to no, the AWS SAM CLI will automatically deploy application changes.
Allow SAM CLI IAM role creation: Many AWS SAM templates, including this example, create AWS IAM roles required for the AWS Lambda function(s) included to access AWS services. By default, these are scoped down to minimum required permissions. To deploy an AWS CloudFormation stack which creates or modifies IAM roles, the CAPABILITY_IAM value for capabilities must be provided. If permission isn’t provided through this prompt, to deploy this example you must explicitly pass --capabilities CAPABILITY_IAM to the sam deploy command.
Disable rollback [y/N]: If set to Y, preserves the state of previously provisioned resources when an operation fails.
Save arguments to configuration file (samconfig.toml): If set to yes, your choices will be saved to a configuration file inside the project, so that in the future you can just re-run sam deploy without parameters to deploy changes to your application.

After you enter your parameters, you should see something like this if you have provided Y to view and confirm ChangeSets. Proceed here by providing ‘Y’ for deploying the resources.

Initiating deployment
=====================
Uploading to amazon-devops-guru-connector-servicenow/46bb4841f8f37fd41d3f40f86f31c4d7.template 1918 / 1918 (100.00%)

Waiting for changeset to be created..
CloudFormation stack changeset
-----------------------------------------------------------------------------------------------------------------------------------------------------
Operation LogicalResourceId ResourceType Replacement
-----------------------------------------------------------------------------------------------------------------------------------------------------
+ Add FunctionsDevOpsGuruPermission AWS::Lambda::Permission N/A
+ Add FunctionsDevOpsGuru AWS::Events::Rule N/A
+ Add FunctionsRole AWS::IAM::Role N/A
+ Add Functions AWS::Lambda::Function N/A
-----------------------------------------------------------------------------------------------------------------------------------------------------

Changeset created successfully. arn:aws:cloudformation:us-east-1:123456789012:changeSet/samcli-deploy1669232233/7c97b7f5-369d-400d-89cd-ebabefaa0b57

Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]:

Once the deployment succeeds, you should be able to see the successful creation of your resources

CloudFormation events from stack operations (refresh every 0.5 seconds)
-----------------------------------------------------------------------------------------------------------------------------------------------------
ResourceStatus ResourceType LogicalResourceId ResourceStatusReason
-----------------------------------------------------------------------------------------------------------------------------------------------------
CREATE_IN_PROGRESS AWS::CloudFormation::Stack amazon-devops-guru-connector- User Initiated
servicenow
CREATE_IN_PROGRESS AWS::IAM::Role FunctionsRole -
CREATE_IN_PROGRESS AWS::IAM::Role FunctionsRole Resource creation Initiated
CREATE_COMPLETE AWS::IAM::Role FunctionsRole -
CREATE_IN_PROGRESS AWS::Lambda::Function Functions -
CREATE_IN_PROGRESS AWS::Lambda::Function Functions Resource creation Initiated
CREATE_COMPLETE AWS::Lambda::Function Functions -
CREATE_IN_PROGRESS AWS::Events::Rule FunctionsDevOpsGuru -
CREATE_IN_PROGRESS AWS::Events::Rule FunctionsDevOpsGuru Resource creation Initiated
CREATE_COMPLETE AWS::Events::Rule FunctionsDevOpsGuru -
CREATE_IN_PROGRESS AWS::Lambda::Permission FunctionsDevOpsGuruPermission -
CREATE_IN_PROGRESS AWS::Lambda::Permission FunctionsDevOpsGuruPermission Resource creation Initiated
CREATE_COMPLETE AWS::Lambda::Permission FunctionsDevOpsGuruPermission -
CREATE_COMPLETE AWS::CloudFormation::Stack amazon-devops-guru-connector- -
servicenow
-----------------------------------------------------------------------------------------------------------------------------------------------------

Successfully created/updated stack - amazon-devops-guru-connector-servicenow in us-east-1

You can also use the below command to list the resources deployed by passing in the stack name.

$ sam list resources --stack-name amazon-devops-guru-connector-servicenow

You can also choose to test and debug your function locally with sample events using the SAM CLI local functionality. Test a single function by invoking it directly with a test event. An event is a JSON document that represents the input that the function receives from the event source. Refer the Invoking Lambda functions locally – AWS Serverless Application Model link here for more details.

Follow the below steps for testing the lambda with the SAM CLI local. You have to create an env.json file with the correct values for your ServiceNow Host and SecretManager secret name that was created in the previous step.

Make sure you have created the AWS Secrets Manager secret with the desired name as mentioned in the prerequisites, which should be used here for SECRET_NAME.
Create env.json as below, by replacing the values for SERVICE_NOW_HOST and SECRET_NAME with your real value. These will be set as the local Lambda execution environment variables.

{"Parameters": {"SERVICE_NOW_HOST": "SNOW_HOST","SECRET_NAME": "SNOW_CREDS"}}

Run the command below to validate locally that with a sample DevOps Guru payload, to trigger Lambda locally and invoke. Remember for this to work, you should have Docker instance running and also the Secret Name created in your AWS account.

$ sam local invoke Functions --event Functions/src/test/Events/CreateIncident.json --env-vars Functions/src/test/Events/env.json

Once you are done with the above steps, move on to “Test the Solution” section below to trigger sample DevOps Guru insights and validate that the incidents are created and updated in ServiceNow.

Test the Solution

To test the solution, we will simulate a DevOps Guru insight. You can also simulate an insight by following the steps in this blog. After an anomaly is detected in the application, DevOps Guru creates an insight as seen below.

Sample DevOps Guru insights page with anomalous behavior of DynamoDB ThrottledRequests from the application deployed with the workshop link.

Figure 4: DevOps Guru Insight created for anomalous behavior

For the DevOps Guru insight shown above, a corresponding incident is automatically created on ServiceNow as shown below. In addition to the incident creation, any new anomalies and recommendations from DevOps Guru is also associated with the incident.

ServiceNow incident detail page with the DevOps Guru insight information.

Figure 5: Corresponding ServiceNow Incident is created for the DevOps Guru Insight

When the anomalous behavior that generated the DevOps Guru insight is resolved, DevOps Guru automatically closes the insight. The corresponding ServiceNow incident that was created for the insight is also closed as seen below

ServiceNow incident Notes section showing Incident as resolved due to the insight being closed in Amazon DevOps Guru.

Figure 6: ServiceNow Incident created for DevOps Guru Insight is resolved due to insight closure

Cleaning up

To avoid incurring future charges, delete the resources.

To delete the sample application that you created, use the AWS CLI command below and pass the stack name you provided in the sam deploy step.

$ aws cloudformation delete-stack --stack-name amazon-devops-guru-connector-servicenow

You could also use the AWS CloudFormation Console to delete the stack:

AWS CloudFormation console with Delete option to clean up the deployed stack.

Figure 7: AWS Stack Console with Delete action

Conclusion

This blog post showcased how DevOps Guru continuously monitor resources in a particular region in your AWS account and automatically detects operational issues, predicts impending resource exhaustion, details likely cause, and recommends remediation actions. This post described a custom solution using serverless integration pattern with AWS Lambda and Amazon EventBridge which enabled integration of the DevOps Guru insights with customer’s most popular ITSM and Change management tool ServiceNow thus streamlining the Service Management governance and oversight over AWS services. Using this solution helps Customer’s with ServiceNow to improve their operational efficiencies, and get customized insights and real time incident alerts and management directly from DevOps Guru which provides a single pane of glass to restore services and systems quickly.

This solution was created to help customers who already use ServiceNow Incident Management, if you are already using Incident Manager from AWS Systems Manager, check out how that works with Amazon DevOps Guru here.

To learn more about Amazon DevOps Guru, join us for a free hands-on Immersion Day. Events are virtual and hosted at three global time zones. Register here: April 12th.

About the authors:

Enabling branch deployments through IssueOps with GitHub Actions

2023-02-02 Grant Birkinbine

Post Syndicated from Grant Birkinbine original https://github.blog/2023-02-02-enabling-branch-deployments-through-issueops-with-github-actions/

At GitHub, the branch deploy model is ubiquitous and it is the standard way we ship code to production, and it has been for years. We released details about how we perform branch deployments with ChatOps all the way back in 2015.

We are able to use ChatOps to perform branch deployments for most of our repositories, but there are a few situations where ChatOps simply won’t work for us. What if developers want to leverage branch deployments but don’t have a full ChatOps stack integrated with their repositories? We wanted to set out to find a way for all developers to be able to take advantage of branch deployments with ease, right from their GitHub repository, and so the branch-deploy Action was born!

Gif demonstrating how to us the branch-deploy Action.

How Does GitHub use this Action?

GitHub primarily uses ChatOps with Hubot to facilitate branch deployments where we can. If ChatOps isn’t an option, we use this branch-deploy Action instead. The majority of our use cases include Infrastructure as Code (IaC) repositories where we use Terraform to deploy infrastructure changes. GitHub uses this Action in many internal repositories and so does npm. There are also many other public, open source, and corporate organizations adopting this Action, as well, to help ship their code to production!

Understanding the branch deploy model

Before we dive into the branch-deploy Action, let’s first understand what the branch deploy model is and why it is so useful.

To really understand the branch deploy model, let’s first take a look at a traditional deploy → merge model. It goes like this:

Create a branch.
Add commits to your branch.
Open a pull request.
Gather feedback plus peer reviews.
Merge your branch.
A deployment starts from the main branch.

Diagram outlining the steps of the traditional deploy model, enumerated in the numbered list above.

Now, let’s take a look at the branch deploy model:

Create a branch.
Add commits to your branch.
Open a pull request.
Gather feedback plus peer reviews.
Deploy your change.
Validate.
Merge your branch to the main / master branch.

Diagram outlining the steps of the branch deploy model, enumerated in the list above.

The merge deploy model is inherently riskier because the main branch is never truly a stable branch. If a deployment fails, or we need to roll back, we follow the entire process again to roll back our changes. However, in the branch deploy model, the main branch is always in a “good” state and we can deploy it at any time to revert the deployment from a branch deploy. In the branch deploy model, we only merge our changes into main once the branch has been successfully deployed and validated.

Note: this is sometimes referred to as the GitHub flow.

Key concepts

Key concepts of the branch deploy model:

The main branch is always considered to be a stable and deployable branch.
All changes are deployed to production before they are merged to the main branch.
To roll back a branch deployment, you deploy the main branch.

By now you may be sold on the branch deploy methodology. How do we implement it? Introducing IssueOps with GitHub Actions!

IssueOps

The best way to define IssueOps is to compare it to something similar, ChatOps. You may be familiar with the concept, ChatOps, already; if not, here is a quick definition:

ChatOps is the process of interacting with a chat bot to execute commands directly in a chat platform. For example, with ChatOps you might do something like .ping example.org to check the status of a website.

IssueOps adopts the same mindset but through a different medium. Rather than using a chat service (Discord, Slack, etc.) to invoke the commands we use comments on a GitHub Issue or pull request. GitHub Actions is the runtime that executes our desired logic when an IssueOps command is invoked.

GitHub Actions

How does it work? This section will go into detail about how this Action works and hopefully inspire you to leverage it in your own projects. The full source code and further documentation can be found on GitHub.

Let’s walk through the process using the demo configuration of a branch-deploy Action below.

1. Create this file under `.github/workflows/branch-deploy.yml` in your GitHub repository:

name: "branch deploy demo"

# The workflow will execute on new comments on pull requests - example: ".deploy" as a comment
on:
  issue_comment:
    types: [created]

jobs:
  demo:
    if: ${{ github.event.issue.pull_request }} # only run on pull request comments (no need to run on issue comments)
    runs-on: ubuntu-latest
    steps:
      # Execute IssueOps branch deployment logic, hooray!
      # This will be used to "gate" all future steps below and conditionally trigger steps/deployments
      - uses: github/[email protected] # replace X.X.X with the version you want to use
        id: branch-deploy # it is critical you have an id here so you can reference the outputs of this step
        with:
          trigger: ".deploy" # the trigger phrase to look for in the comment on the pull request

      # Run your deployment logic for your project here - examples seen below

      # Checkout your project repository based on the ref provided by the branch-deploy step
      - uses: actions/[email protected]
        if: ${{ steps.branch-deploy.outputs.continue == 'true' }} # skips if the trigger phrase is not found
        with:
          ref: ${{ steps.branch-deploy.outputs.ref }} # uses the detected branch from the branch-deploy step

      # Do some fake "noop" deployment logic here
      # conditionally run a noop deployment
      - name: fake noop deploy
        if: ${{ steps.branch-deploy.outputs.continue == 'true' && steps.branch-deploy.outputs.noop == 'true' }} # only run if the trigger phrase is found and the branch-deploy step detected a noop deployment
        run: echo "I am doing a fake noop deploy"

      # Do some fake "regular" deployment logic here
      # conditionally run a regular deployment
      - name: fake regular deploy
        if: ${{ steps.branch-deploy.outputs.continue == 'true' && steps.branch-deploy.outputs.noop != 'true' }} # only run if the trigger phrase is found and the branch-deploy step detected a regular deployment
        run: echo "I am doing a fake regular deploy"

2. Trigger a noop deploy by commenting `.deploy noop` on a pull request.

A noop deployment is detected so this action outputs the noop variable to true. If you have the correct permissions to execute the IssueOps command, the action outputs the continue variable to true as well. The step named fake noop deploy runs, while the fake regular deploy step is skipped.

3. After your noop deploy completes, you would typically run `.deploy` to execute the actual deployment, `fake regular deploy`.

Features

The best part about the branch-deploy Action is that it is highly customizable for any deployment targets and use cases. Here are just a few of the features that this Action comes bundled with:

Detects when IssueOps commands are used on a pull request.
Configurable: choose your command syntax, environment, noop trigger, base branch, reaction, and more.
Respects your branch protection settings configured for the repository.
Comments and reacts to your IssueOps commands.
Triggers GitHub deployments for you with simple configuration.
Deploy locks to prevent multiple deployments from clashing.
Configurable environment targets.

The repository also comes with a usage guide, which can be referenced by you and your team to quickly get familiar with available IssueOps commands and how they work.

Examples

The branch-deploy Action is customizable and suited for a wide range of projects. Here are a few examples of how you can use the branch-deploy Action to deploy to different services:

Conclusion

If you are looking to enhance your DevOps experience, have better reliability in your deployments, or ship changes faster, then branch deployments are for you!

Hopefully, you now have a better understanding of why the branch deploy model is a great option for shipping your code to production.

By using GitHub plus Actions plus IssueOps you can leverage the branch deploy model in any repository!

Source code: GitHub

AWS Local Zones and AWS Outposts, choosing the right technology for your edge workload

2022-12-01 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/aws-local-zones-and-aws-outposts-choosing-the-right-technology-for-your-edge-workload/

This blog post is written by Joe Sacco, Senior Technical Account Manager.

The AWS Global Cloud Infrastructure includes 30 Launched Regions, 96 Availability Zones (AZs), 410+ Points of Presence with 400+ Edge Locations, and 13 Regional Edge Caches. With over 200 AWS services, most customer workloads can run in the AWS Regions. However, for some location-sensitive workloads with low-latency or data residency requirements, and when an AWS Region isn’t close enough, AWS offers two additional infrastructure options: AWS Local Zones and AWS Outposts. Although Local Zones and Outposts solve for similar problems, we’ll review use cases as well as the services and features available that can help you decide which offering best suits your needs.

Let’s start with an overview of Local Zones and Outposts.

What are Local Zones?

Local Zones are a new type of infrastructure deployment that places AWS compute, storage, database, and other select AWS services in large metropolitan areas closer to end users. This gives you access to single-digit millisecond latency with the use of AWS Direct Connect and the ability to meet data residency requirements. Local Zones are also connected to their parent Region via AWS’s redundant and high bandwidth private network. This gives applications running in Local Zones fast, secure, and seamless access to a complete list of services in the parent Region.

Unlike Outposts, which you deploy within your datacenter or a co-location of your choice, Local Zones are owned, managed, and operated by AWS. Local Zones eliminate the need for you to manage power, connectivity, and capacity. Furthermore, you can provision workloads on a Local Zone from your AWS Management Console just as you would for AZs and Regions today.

What is Outposts?

Outposts is a family of fully managed solutions delivering AWS infrastructure and services to virtually any on-premises or edge location for a truly consistent hybrid experience. Outposts lets you run some AWS services locally and connect to a broad range of services available in the local AWS Region. Outposts comes in two types of offerings: Outposts rack and Outposts servers, with which you can run applications and workloads on-premises using the same AWS infrastructure, services, tools, and APIs as in AWS Regions.

The Outposts rack is available as an industry standard 42U form factor. It provides the same AWS infrastructure, services, tools, and APIs to your data center or co-location space that you would find in an AWS Region.

The Outposts servers come in a 1U or 2U form factor and are designed for locations that have limited space or smaller capacity requirements. Both support different compute instances, as detailed in the Outposts servers feature page.

Customer use cases

Now that we have an overview of both Local Zones and Outposts service offerings, let’s dive into use cases, the differences between them, and how your business can leverage each to accomplish your workloads requirements.

Low latency

Customers today require low latency computing for workloads, such as medical imaging, transaction processing for Enterprise Resource Planning (ERP) applications, enterprise migration with hybrid architecture, real-time multiplayer gaming, telco network function virtualization, and regulated gaming workloads.

Outposts can meet ultra-low latency requirements. This is accomplished by bringing AWS services on premises and to the edge at Outpost Sites. An Outpost site is the physical location where your Outpost operates, and it can be local within one of your data centers or at a co-location facility of your choice.

When accessing from within the same metro, Local Zones will provide you with a low, single millisecond latency experience when communicating with your applications. Latency between Local Zones and AWS Regions or Local Zones and on-premises environments varies, and these will depend on how close the nearest Local Zone is as well as the type of modality used for the connection (Public Internet, VPN, and AWS Direct Connect). You should always choose the closest Local Zone location to achieve the lowest possible latency. For use cases such as mobile gaming, you can utilize Local Zones by deploying your applications to a Local Zone location nearest to your end users. Local Zones are generally available in 17 metros across the US, 4 outside the US, and we are continuing to launch Local Zones in 30 cities across 25 countries. Check out updates for more general availability of Local Zones.

Data residency

On occasion, data must remain in a specific geographic region for regulatory or information security reasons. Healthcare and other regulated industries, such as financial services or Oil & Gas, have specific data residency requirements.

Outposts helps meet a customer’s data residency requirements because it’s installed on premises and essentially brings AWS to where the data currently resides. This allows you to pick and control where your workloads run, and where your data will stay. Check out the full list of countries and territories where Outposts is available on the FAQs page of Outposts rack and the FAQs page of Outposts servers.

Local Zones bring AWS closer or within a customer’s geographic boundary in a fully AWS owned and operated mode. Although Local Zones can help meet data residency use cases in some scenarios, data residency requirements vary depending on the jurisdictions. Therefore, you should work closely with your compliance and information security teams when choosing the Local Zone location in which to deploy your regulated workloads.

Migration and modernization

When trying to migrate to the cloud and modernize your stack, some workloads can be challenging. Often there are on-premises applications which are difficult to move into Regions due to latency-sensitive system intermittencies between their various components. As dependencies arise, you may choose to segment these migrations into smaller pieces. Then this will require latency-sensitive connectivity between the various parts of the application.

Outposts and Local Zones both allow for a gradual migration and modernization of your stack. You can choose to migrate parts of their workloads while still maintaining latency-sensitive connectivity between components until the entirety is ready to move.

Factors in selecting Local Zones or Outposts

Choosing between Local Zones and Outposts will depend on the following factors, and you should examine all of them together when selecting a service for your use case.

Latency requirements

Local Zones can achieve low single millisecond latency when accessing within the same metro. On the other hand, Outposts can achieve ultra-low latency requirements when deployed within your datacenter or at a co-location facility of your choice. When selecting one over the other, you must work backward from your goal and workload requirements.

If you’re conducting a migration and modernization strategy which requires ultra-low latency between a workloads application and database tiers that are difficult to migrate to the AWS Regions, then Outposts would be the right solution for you.

Alternatively, if your workload involves streaming live broadcasts to end users which requires low single millisecond latency, but your end users are located where an AWS Region isn’t available, then Local Zones distributed across various metros would work best to serve your content.

Availability of services needed to support your workload

Local Zones and Outposts differ with their list of supported AWS services, and you must review your workload’s service requirements when determining the best fit for you. For example, if a customer has a computer vision workload that requires storing and retrieving large volumes of images locally using Amazon Simple Storage Service (Amazon S3), then Outposts and certain Local Zones meet this requirement while other Local Zones don’t. Learn how you can use Amazon S3 on Outposts for computer vision workloads.

Outposts rack and servers support different sets of AWS services locally. You can view comparisons between them, or visit the Outposts servers and Outposts rack feature sites for more details.

Local Zones’ features vary depending on the location in which you choose to deploy. You can view more details and a full list of supported features and services per location on our Local Zones features page.

Investment and management of infrastructure on-premises

Management of the infrastructure and prerequisites are another factor when considering which AWS service best suits your needs.

Outposts is ordered through AWS, and it requires installation in a customer’s on-premises datacenter or co-location provider of their choice. Outposts rack installation is handled by AWS, while Outposts servers installation is done by the customer or a third-party of their choosing. There are power and redundant networking requirements for the Outpost Site, as well as a required subscription to AWS Enterprise Support or On-Ramp Support.

Local Zones infrastructure is fully-managed by AWS, including the power, networking, and capacity. This reduces operational management as well as the overhead cost for customers. An Enterprise support agreement isn’t required to utilize Local Zones.

You should always choose Regions or Local Zones if your use case allows, and use Outposts when a Region or Local Zone isn’t a good fit. If both Outposts and Local Zones fit a customer’s use case and requirements, then Local Zones will be the preferred choice.

Regulations, compliance, and information security

If a Local Zone is either unavailable or unable to meet your residency requirements within your geographic boundary consider Outposts, which can be deployed to a data center or co-location facility of your choice. Data residency requirements can be a factor based on your industry and the regulations to which your workload must adhere. Furthermore, you should work closely with your compliance and information security teams when choosing between Local Zones or Outposts.

Conclusion

Whether you’re dealing with latency-sensitive applications, data residency requirements, or a migration and modernization strategy, AWS provides options and flexibility for you to leverage the same AWS infrastructure, services, APIs, and tools to metro areas and on-premises locations with Local Zones and Outposts.

The decision of which technology to use will depend on several factors that we discussed above. You must work across teams within your organization to make sure that the latency requirements (low single millisecond latency within a metro for Local Zones vs the ultra low latency of Outposts when deployed close to or within your datacenter), data reseidency needs, installation prerequisites, and availability of services to support your workload are met.

Once these factors are taken into account, and you have made a choice, visit our product pages for Outposts and Local Zones with information on how you can get started.

Analyze Amazon Cognito advanced security intelligence to improve visibility and protection

2022-10-17 Diana Alvarado

Post Syndicated from Diana Alvarado original https://aws.amazon.com/blogs/security/analyze-amazon-cognito-advanced-security-intelligence-to-improve-visibility-and-protection/

As your organization looks to improve your security posture and practices, early detection and prevention of unauthorized activity quickly becomes one of your main priorities. The behaviors associated with unauthorized activity commonly follow patterns that you can analyze in order to create specific mitigations or feed data into your security monitoring systems.

This post shows you how you can analyze security intelligence from Amazon Cognito advanced security features logs by using AWS native services. You can use the intelligence data provided by the logs to increase your visibility into sign-in and sign-up activities from users, this can help you with monitoring, decision making, and to feed other security services in your organization, such as a web application firewall or security information and event management (SIEM) tool. The data can also enrich available security feeds like fraud detection systems, increasing protection for the workloads that you run on AWS.

Amazon Cognito advanced security features overview

Amazon Cognito provides authentication, authorization, and user management for your web and mobile apps. Your users can sign in to apps directly with a user name and password, or through a third party such as social providers or standard enterprise providers through SAML 2.0/OpenID Connect (OIDC). Amazon Cognito includes additional protections for users that you manage in Amazon Cognito user pools. In particular, Amazon Cognito can add risk-based adaptive authentication and also flag the use of compromised credentials. For more information, see Checking for compromised credentials in the Amazon Cognito Developer Guide.

With adaptive authentication, Amazon Cognito examines each user pool sign-in attempt and generates a risk score for how likely the sign-in request is from an unauthorized user. Amazon Cognito examines a number of factors, including whether the user has used the same device before or has signed in from the same location or IP address. A detected risk is rated as low, medium, or high, and you can determine what actions should be taken at each risk level. You can choose to allow or block the request, require a second authentication factor, or notify the user of the risk by email. Security teams and administrators can also submit feedback on the risk through the API, and users can submit feedback by using a link that is sent to the user’s email. This feedback can improve the risk calculation for future attempts.

To add advanced security features to your existing Amazon Cognito configuration, you can get started by using the steps for Adding advanced security to a user pool in the Amazon Cognito Developer Guide. Note that there is an additional charge for advanced security features, as described on our pricing page. These features are applicable only to native Amazon Cognito users; they aren’t applicable to federated users who sign in with an external provider.

Solution architecture

Figure 1: Solution architecture

Figure 1 shows the high-level architecture for the advanced security solution. When an Amazon Cognito sign-in event is recorded by AWS CloudTrail, the solution uses an Amazon EventBridge rule to send the event to an Amazon Simple Queue Service (Amazon SQS) queue and batch it, to then be processed by an AWS Lambda function. The Lambda function uses the event information to pull the sign-in security information and send it as logs to an Amazon Simple Storage Service (Amazon S3) bucket and Amazon CloudWatch Logs.

Prerequisites and considerations for this solution

This solution assumes that you are using Amazon Cognito with advanced security features already enabled, the solution does not create a user pool and does not activate the advanced security features on an existing one.

The following list describes some limitations that you should be aware of for this solution:

This solution does not apply to events in the hosted UI, but the same architecture can be adapted for that environment, with some changes to the events processor.
The Amazon Cognito advanced security features support only native users. This solution is not applicable to federated users.
The admin API used in this solution has a default rate limit of 30 requests per second (RPS). If you have a higher rate of authentication attempts, this API call might be throttled and you will need to implement a re-try pattern to confirm that your requests are processed.

Implement the solution

You can deploy the solution automatically by using the following AWS CloudFormation template.

Choose the following Launch Stack button to launch a CloudFormation stack in your account and deploy the solution.

You’ll be redirected to the CloudFormation service in the US East (N. Virginia) Region, which is the default AWS Region, to deploy this solution. You can change the Region to align it to where your Cognito User Pool is running.

This template will create multiple cloud resources including, but not limited to, the following:

An EventBridge rule for sending the Amazon Cognito events
An Amazon SQS queue for sending the events to Lambda
A Lambda function for getting the advanced security information based on the authentication events from CloudTrail
An S3 bucket to store the logs

In the wizard, you’ll be asked to modify or provide one parameter, the existing Cognito user pool ID. You can get this value from the Amazon Cognito console or the Cognito API.

Now, let’s break down each component of the solution in detail.

Sending the authentication events from CloudTrail to Lambda

Cognito advanced security features supports the CloudTrail events: SignUp, ConfirmSignUp, ForgotPassword, ResendConfirmationCode, InitiateAuth and RespondToAuthChallenge. This solution will focus on the sign-in event InitiateAuth as an example.

The solution creates an EventBridge rule that will run when an event is identified in CloudTrail and send the event to an SQS queue. This is useful so that events can be batched up and decoupled for Lambda to process.

The EventBridge rule uses Amazon SQS as a target. The queue is created by the solution and uses the default settings, with the exception that Receive message wait time is set to 20 seconds for long polling. For more information about long polling and how to manually set up an SQS queue, see Consuming messages using long polling in the Amazon SQS Developer Guide.

When the SQS queue receives the messages from EventBridge, these are sent to Lambda for processing. Let’s now focus on understanding how this information is processed by the Lambda function.

Using Lambda to process Amazon Cognito advanced security features information

In order to get the advanced security features evaluation information, you need authentication details that can only be obtained by using the Amazon Cognito identity provider (IdP) API call admin_list_user_auth_events. This API call requires a username to fetch all the authentication event details for a specific user. For security reasons, the username is not logged in CloudTrail and must be obtained by using other event information.

You can use the Lambda function in the sample solution to get this information. It’s composed of three main sequential actions:

The Lambda function gets the sub identifiers from the authentication events recorded by CloudTrail.
Each sub identifier is used to get the user name through an API call to list_users.
3. The sample function retrieves the last five authentication event details from advanced security features for each of these users by using the admin_list_user_auth_events API call. You can modify the function to retrieve a different number of events, or use other criteria such as a timestamp or a specific time period.

Getting the user name information from a CloudTrail event

The following sample authentication event shows a sub identifier in the CloudTrail event information, shown as sub under additionalEventData. With this sub identifier, you can use the ListUsers API call from the Cognito IdP SDK to get the user name details.

{
"eventVersion": "1.XX",
"userIdentity": {
"type": "Unknown",
"principalId": "Anonymous"
},
"eventTime": "2022-01-01T11:11:11Z",
"eventSource": "cognito-idp.amazonaws.com",
"eventName": "InitiateAuth",
"awsRegion": "us-east-1",
"sourceIPAddress": "xx.xx.xx.xx",
"userAgent": "Mozilla/5.0 (xxxx)",
"requestParameters": {
"authFlow": "USER_SRP_AUTH",
"authParameters": "HIDDEN_DUE_TO_SECURITY_REASONS",
"clientMetadata": {},
"clientId": "iiiiiiiii"
},
"responseElements": {
"challengeName": "PASSWORD_VERIFIER",
"challengeParameters": {
"SALT": "HIDDEN_DUE_TO_SECURITY_REASONS",
"SECRET_BLOCK": "HIDDEN_DUE_TO_SECURITY_REASONS",
"USER_ID_FOR_SRP": "HIDDEN_DUE_TO_SECURITY_REASONS",
"USERNAME": "HIDDEN_DUE_TO_SECURITY_REASONS",
"SRP_B": "HIDDEN_DUE_TO_SECURITY_REASONS"
}
},
"additionalEventData": {
"sub": "11110b4c-1f4264cd111"
},
"requestID": "xxxxxxxx",
"eventID": "xxxxxxxxxx",
"readOnly": false,
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "xxxxxxxxxxxxx",
"eventCategory": "Management"
}

Listing authentication events information

After the Lambda function obtains the username, it can then use the Cognito IdP API call admin_list_user_auth_events to get the advanced security feature risk evaluation information for each of the authentication events for that user. Let’s look into the details of that evaluation.

The authentication event information from Amazon Cognito advanced security provides information for each of the categories evaluated and logs the results. Those results can then be used to decide whether the authentication attempt information is useful for the security team to be notified or take action. It’s recommended that you limit the number of events returned, in order to keep performance optimized.

The following sample event shows some of the risk information provided by advanced security features; the options for the response syntax can be found in the CognitoIdentityProvider API documentation.

}
]
at the bottom, so
"AuthEvents": [
{
"EventId": "1111111”,
"EventType": "SignIn",
"CreationDate": 111111.111,
"EventResponse": "Pass",
"EventRisk": {
"RiskDecision": "NoRisk",
"CompromisedCredentialsDetected": false
},
"ChallengeResponses": [
{
"ChallengeName": "Password",
"ChallengeResponse": "Success"
}
],
"EventContextData": {
"IpAddress": "72.xx.xx.xx",
"DeviceName": "Firefox xx
"City": "Axxx",
"Country": "United States"
}
}
]

The event information that is returned includes the details that are highlighted in this sample event, such as CompromisedCredentialsDetected, RiskDecision, and RiskLevel, which you can evaluate to decide whether the information can be used to enrich other security monitoring services.

Logging the authentication events information

You can use a Lambda extensions layer to send logs to an S3 bucket. Lambda still sends logs to Amazon CloudWatch Logs, but you can disable this activity by removing the required permissions to CloudWatch on the Lambda execution role. For more details on how to set this up, see Using AWS Lambda extensions to send logs to custom destinations.

Figure 2 shows an example of a log sent by Lambda. It includes execution information that is logged by the extension, as well as the information returned from the authentication evaluation by advanced security features.

Figure 2: Sample log information sent to S3

Note that the detailed authentication information in the Lambda execution log is the same as the preceding sample event. You can further enhance the information provided by the Lambda function by modifying the function code and logging more information during the execution, or by filtering the logs and focusing only on high-risk or compromised login attempts.

After the logs are in the S3 bucket, different applications and tools can use this information to perform automated security actions and configuration updates or provide further visibility. You can query the data from Amazon S3 by using Amazon Athena, feed the data to other services such as Amazon Fraud Detector as described in this post, mine the data by using artificial intelligence/machine learning (AI/ML) managed tools like AWS Lookout for Metrics, or enhance visibility with AWS WAF.

Sample scenarios

You can start to gain insights into the security information provided by this solution in an existing environment by querying and visualizing the log data directly by using CloudWatch Logs Insights. For detailed information about how you can use CloudWatch Logs Insights with Lambda logs, see the blog post Operating Lambda: Using CloudWatch Logs Insights.

The CloudFormation template deploys the CloudWatch Logs Insights queries. You can view the queries for the sample solution in the Amazon CloudWatch console, under Queries.

To access the queries in the CloudWatch console

In the CloudWatch console, under Logs, choose Insights.
Choose Select log group(s). In the drop-drown list, select the Lambda log group.
The query box should show the pre-created query. Choose Run query. You should then see the query results in the bottom-right panel.
(Optional) Choose Add to dashboard to add the widget to a dashboard.

CloudWatch Logs Insights discovers the fields in the auth event log automatically. As shown in Figure 3, you can see the available fields in the right-hand side Discovered fields pane, which includes the Amazon Cognito information in the event.

Figure 3: The fields available in CloudWatch Logs Insights

The first query, shown in the following code snippet, will help you get a view of the number of requests per IP, where the advanced security features have determined the risk decision as Account Takeover and the CompromisedCredentialsDetected as true.

fields @message
| filter @message like /INFO/
| filter AuthEvents.0.EventType like 'SignIn'
| filter AuthEvents.0.EventRisk.RiskDecision like "AccountTakeover" and 
AuthEvents.0.EventRisk.CompromisedCredentialsDetected =! "false"
| stats count(*) as RequestsperIP by AuthEvents.2.EventContextData.IpAddress as IP
| sort desc

You can view the results of the query as a table or graph, as shown in Figure 4.

Figure 4: Sample query results for CompromisedCredentialsDetected

Using the same approach and the convenient access to the fields for query, you can explore another use case, using the following query, to view the number of requests per IP for each type of event (SignIn, SignUp, and forgot password) where the risk level was high.

fields @message
| filter @message like /INFO/
| filter AuthEvents.0.EventRisk.RiskLevel like "High"
| stats count(*) as RequestsperIP by AuthEvents.0.EventContextData.IpAddress as IP, 
AuthEvents.0.EventType as EventType
| sort desc

Figure 5 shows the results for this EventType query.

Figure 5: The sample results for the EventType query

In the final sample scenario, you can look at event context data and query for the source of the events for which the risk level was high.

fields @message
| filter @message like /INFO/
| filter AuthEvents.0.EventRisk.RiskLevel like 'High'
| stats count(*) as RequestsperCountry by AuthEvents.0.EventContextData.Country as Country
| sort desc

Figure 6 shows the results for this RiskLevel query.

Figure 6: Sample results for the RiskLevel query

As you can see, there are many ways to mix and match the filters to extract deep insights, depending on your specific needs. You can use these examples as a base to build your own queries.

Conclusion

In this post, you learned how to use security intelligence information provided by Amazon Cognito through its advanced security features to improve your security posture and practices. You used an advanced security solution to retrieve valuable authentication information using CloudTrail logs as a source and a Lambda function to process the events, send this evaluation information in the form of a log to CloudWatch Logs and S3 for use as an additional security feed for wider organizational monitoring and visibility. In a set of sample use cases, you explored how to use CloudWatch Logs Insights to quickly and conveniently access this information, aggregate it, gain deep insights and use it to take action.

To learn more, see the blog post How to Use New Advanced Security Features for Amazon Cognito User Pools.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

How to automatically build forensic kernel modules for Amazon Linux EC2 instances

2022-09-26 Jonathan Nguyen

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/how-to-automatically-build-forensic-kernel-modules-for-amazon-linux-ec2-instances/

In this blog post, we will walk you through the EC2 forensic module factory solution to deploy automation to build forensic kernel modules that are required for Amazon Elastic Compute Cloud (Amazon EC2) incident response automation.

When an EC2 instance is suspected to have been compromised, it’s strongly recommended to investigate what happened to the instance. You should look for activities such as:

Open network connections
List of running processes
Processes that contain injected code
Memory-resident infections
Other forensic artifacts

When an EC2 instance is compromised, it’s important to take action as quickly as possible. Before you shut down the EC2 instance, you first need to capture the contents of its volatile memory (RAM) in a memory dump because it contains the instance’s in-progress operations. This is key in determining the root cause of compromise.

In order to capture volatile memory in Linux, you can use a tool like Linux Memory Extractor (LiME). This requires you to have the kernel modules that are specific to the kernel version of the instance for which you want to capture volatile memory. We also recommend that you limit the actions you take on the instance where you are trying to capture the volatile memory in order to minimize the set of artifacts created as part of the capture process, so you need a method to build the tools for capturing volatile memory outside the instance under investigation. After you capture the volatile memory, you can use a tool like Volatility2 to analyze it in a dedicated forensics environment. You can use tools like LiME and Volatility2 on EC2 instances that use x86, x64, and Graviton instance types.

Prerequisites

This solution has the following prerequisites:

The Target EC2 instance is using Amazon Linux 2 operating system.
An AWS Identity and Access Management (IAM) role with permissions to deploy the required resources in an AWS account. More details about these permissions follow in the next section.

Solution overview

The EC2 forensic module factory solution consists of the following resources:

One AWS Step Functions workflow
Two AWS Lambda functions
One AWS Systems Manager document (SSM document)

Important: The SSM document clones the LiME and Volatility2 GitHub repositories, and these tools use version 2.0 of the GNU General Public License. This SSM document can be updated to include your preferred tools, like fmem or Volatility3, for forensic analysis and capture.
One Amazon Simple Storage Service (Amazon S3) bucket
One Amazon Virtual Private Cloud (Amazon VPC)
One security group for the EC2 instance that is provisioned during the automation
The solution uses the following VPC endpoints for AWS services:
- ec2_endpoint
- ec2_msg_endpoint
- kms_endpoint
- ssm_endpoint
- ssm_msg_endpoint
- s3_endpoint

Figure 1 shows an overview of the EC2 forensic module factory solution workflow.

Figure 1: Automation to build forensic kernel modules for an Amazon Linux EC2 instance

The EC2 forensic module factory solution workflow in Figure 1 includes the following numbered steps:

A Step Functions workflow is started, which creates a Step Functions task token and invokes the first Lambda function, createEC2module, to create EC2 forensic modules.
1. A Step Functions task token is used to allow long-running processes to complete and to avoid a Lambda timeout error. The createEC2module function runs for approximately 9 minutes. The run time for the function can vary depending on any customizations to the createEC2module function or the SSM document.
The createEC2module function launches an EC2 instance based on the Amazon Machine Image (AMI) provided.
Once the EC2 instance is running, an SSM document is run, which includes the following steps:
1. If a specific kernel version is provided in step 1, this kernel version will be installed on the EC2 instance. If no kernel version is provided, the default kernel version on the EC2 instance will be used to create the modules.
2. If a specific kernel version was selected and installed, the system is rebooted to use this kernel version.
3. The prerequisite build tools are installed, as well as the LiME and Volatility2 packages.
4. The LiME kernel module and the Volatility2 profile are built.
The kernel modules for LiME and Volatility2 are put into the S3 bucket.
Upon completion, the Step Functions task token is sent to the Step Functions workflow to invoke the second cleanupEC2module Lambda function to terminate the EC2 instance that was launched in step 2.

Solution deployment

You can deploy the EC2 forensic module factory solution by using either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

Option 1: Deploy the solution with AWS CloudFormation (console)

Sign in to your preferred security tooling account in the AWS Management Console, and choose the following Launch Stack button to open the AWS CloudFormation console pre-loaded with the template for this solution. It will take approximately 10 minutes for the CloudFormation stack to complete.

Option 2: Deploy the solution by using the AWS CDK

You can find the latest code for the EC2 forensic module factory solution in the ec2-forensic-module-factory GitHub repository, where you can also contribute to the sample code. For instructions and more information on using the AWS CDK, see Get Started with AWS CDK.

To deploy the solution by using the AWS CDK

To build the app when navigating to the project’s root folder, use the following commands.
npm install -g aws-cdk npm install
Run the following commands in your terminal while authenticated in your preferred security tooling AWS account. Be sure to replace <INSERT_AWS_ACCOUNT> with your account number, and replace <INSERT_REGION> with the AWS Region that you want the solution deployed to.
cdk bootstrap aws://<INSERT_AWS_ACCOUNT>/<INSERT_REGION> cdk deploy

Run the solution to build forensic kernel objects

Now that you’ve deployed the EC2 forensic module factory solution, you need to invoke the Step Functions workflow in order to create the forensic kernel objects. The following is an example of manually invoking the workflow, to help you understand what actions are being performed. These actions can also be integrated and automated with an EC2 incident response solution.

To manually invoke the workflow to create the forensic kernel objects (console)

In the AWS Management Console, sign in to the account where the solution was deployed.
In the AWS Step Functions console, select the state machine named create_ec2_volatile_memory_modules.
Choose Start execution.
At the input prompt, enter the following JSON values.
{ "AMI_ID": "ami-0022f774911c1d690", "kernelversion":"kernel-4.14.104-95.84.amzn2.x86_64" }
Choose Start execution to start the workflow, as shown in Figure 2.

Figure 2: Step Functions step input example to build custom kernel version using Amazon Linux 2 AMI ID

Workflow progress

You can use the AWS Management Console to follow the progress of the Step Functions workflow. If the workflow is successful, you should see the image when you view the status of the Step Functions workflow, as shown in Figure 3.

Figure 3: Step Functions workflow success example

Note: The Step Functions workflow run time depends on the commands that are being run in the SSM document. The example SSM document included in this post runs for approximately 9 minutes. For information about possible Step Functions errors, see Error handling in Step Functions.

To verify that the artifacts are built

After the Step Functions workflow has successfully completed, go to the S3 bucket that was provisioned in the EC2 forensic module factory solution.
Look for two prefixes in the bucket for LiME and Volatility2, as shown in Figure 4.

Figure 4: S3 bucket prefix for forensic kernel modules
Open each tool name prefix in S3 to find the actual module, such as in the following examples:
- LiME example: lime-4.14.104-95.84.amzn2.x86_64.ko
- Volatility2 example: 4.14.104-95.84.amzn2.x86_64.zip

Now that the objects have been created, the solution has successfully completed.

Incorporate forensic module builds into an EC2 AMI pipeline

Each organization has specific requirements for allowing application teams to use various EC2 AMIs, and organizations commonly implement an EC2 image pipeline using tools like EC2 Image Builder. EC2 Image Builder uses recipes to install and configure required components in the AMI before application teams can launch EC2 instances in their environment.

The EC2 forensic module factory solution we implemented here makes use of an existing EC2 instance AMI. As mentioned, the solution uses an SSM document to create forensic modules. The logic in the SSM document could be incorporated into your EC2 image pipeline to create the forensic modules and store them in an S3 bucket. S3 also allows additional layers of protection such as enforcing default bucket encryption with an AWS Key Management Service Customer Managed Key (CMK), verifying S3 object integrity with checksum, S3 Object Lock, and restrictive S3 bucket policies. These protections can help you to ensure that your forensic modules have not been modified and are only accessible by authorized entities.

It is important to note that incorporating forensic module creation into an EC2 AMI pipeline will build forensic modules for the specific kernel version used in that AMI. You would still need to employ this EC2 forensic module solution to build a specific forensic module version if it is missing from the S3 bucket where you are creating and storing these forensic modules. The need to do this can arise if the EC2 instance is updated after the initial creation of the AMI.

Incorporate the solution into existing EC2 incident response automation

There are many existing solutions to automate incident response workflow for quarantining and capturing forensic evidence for EC2 instances, but the majority of EC2 incident response automation solutions have a single dependency in common, which is the use of specific forensic modules for the target EC2 instance kernel version. The EC2 forensic module factory solution in this post enables you to be both proactive and reactive when building forensic kernel modules for your EC2 instances.

You can use the EC2 forensic module factory solution in two different ways:

Ad-hoc – In this post, you walked through the solution by running the Step Functions workflow with specific parameters. You can do this to build a repository of kernel modules.
Automated – Alternatively, you can incorporate this solution into existing automation by invoking the Step Functions workflow and passing the AMI ID and kernel version. An example could be the following:
1. An existing EC2 incident response solution attempts to get the forensic modules to capture the volatile memory from an S3 bucket.
2. If the specific kernel version is missing in the S3 bucket, the solution updates the automation to StartExecution on the create_ec2_volatile_memory_modules state machine.
3. The Step Functions workflow builds the specific forensic modules.
4. After the Step Functions workflow is complete, the EC2 incident response solution restarts its workflow to get the forensic modules to capture the volatile memory on the EC2 instance.

Now that you have the kernel modules, you can both capture the volatile memory by using LiME, and then conduct analysis on the memory dump by using a Volatility2 profile.

To capture and analyze volatile memory on the target EC2 instance (high-level steps)

Copy the LiME module from the S3 bucket holding the module repository to the target EC2 instance.
Capture the volatile memory by using the LiME module.
Stream the volatile memory dump to a S3 bucket.
Launch an EC2 forensic workstation instance, with Volatility2 installed.
Copy the Volatility2 profile from the S3 bucket to the appropriate location.
Copy the volatile memory dump to the EC2 forensic workstation.
Run analysis on the volatile memory with Volatility2 by using the specific Volatility2 profile created for the target EC2 instance.

Automated self-service AWS solution

AWS has also released the Automated Forensics Orchestrator for Amazon EC2 solution that you can use to quickly set up and configure a dedicated forensics orchestration automation solution for your security teams. The Automated Forensics Orchestrator for Amazon EC2 allows you to capture and examine the data from EC2 instances and attached Amazon Elastic Block Store (Amazon EBS) volumes in your AWS environment. This data is collected as forensic evidence for analysis by the security team.

The Automated Forensics Orchestrator for Amazon EC2 creates the foundational components to enable the EC2 forensic module factory solution’s memory forensic acquisition workflow and forensic investigation and reporting service. Both the Automated Forensics Orchestrator for Amazon EC2, and the EC2 forensic module factory, are hosted in different GitHub projects. And you will need to reconcile the expected S3 bucket locations for the associated modules:

Automated Forensics Orchestrator for Amazon EC2 modules: S3 bucket location for LiME and S3 bucket location for Volatility2
EC2 forensic module factory modules: S3 bucket location for LiME and S3 bucket location for Volatility2

Customize the EC2 forensic module factory solution

The SSM document pulls open-source packages to build tools for the specific Linux kernel version. You can update the SSM document to your specific requirements for forensic analysis, including expanding support for other operating systems, versions, and tools.

You can also update the S3 object naming convention and object tagging, to allow external solutions to reference and copy the appropriate kernel module versions to enable the forensic workflow.

Clean up

If you deployed the EC2 forensic module factory solution by using the Launch Stack button in the AWS Management Console or the CloudFormation template ec2_module_factory_cfn, do the following to clean up:

In the AWS CloudFormation console for the account and Region where you deployed the solution, choose the Ec2VolModules stack.
Choose the option to Delete the stack.

If you deployed the solution by using the AWS CDK, run the following command.

cdk destroy

Conclusion

In this blog post, we walked you through the deployment and use of the EC2 forensic module factory solution to use AWS Step Functions, AWS Lambda, AWS Systems Manager, and Amazon EC2 to create specific versions of forensic kernel modules for Amazon Linux EC2 instances.

The solution provides a framework to create the foundational components required in an EC2 incident response automation solution. You can customize the solution to your needs to fit into an existing EC2 automation, or you can deploy this solution in tandem with the Automated Forensics Orchestrator for Amazon EC2.

If you have feedback about this post, submit comments in the Comments section below. If you have any questions about this post, start a thread on re:Post.

Want more AWS Security news? Follow us on Twitter.

Automatic rule backtesting with large quantities of data

2022-09-08 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/automatic-rule-backtesting

Introduction

Analysts need to analyse and simulate a rule on historical data to check the performance and accuracy of the rule. Backtesting enables analysts to run simulations of the rules and manage the results from the rule engine UI.

Backtesting helps analysts to:

Define the desired impact of the rule for our business and users.
Evaluate the accuracy of the rule based on historical data.
Compare and analyse results with data points, such as known false positives, user segments, risk profile of a user or transaction, and so on.

Currently, the analytics process to test performance of a rule is not standardised, and is inaccurate and inefficient. Analysts from different teams have different approaches:

Offline process using Presto tables. This process is lengthy and inaccurate.
Offline process based on the rule engine payload. The setup takes time, and the process is not streamlined.
Running rules in shadow mode. This process takes days to get the desired result.
A team in Grab uses different rule engines to manage rules and do backtesting. This doubles the effort for analysts and engineers.

In our vision for backtesting, it should allow analysts to:

Efficiently run and manage their jobs.
Create custom metrics, reports and dimensions for backtesting.
Add external data points and metrics to do a deep dive.

For the purpose of establishing a minimum viable product (MVP), backtesting will support basic capabilities and enable analysts to access required metrics and data points. Thus, analysts can:

Run backtesting jobs from the rule engine UI.
Get fixed reports and dimensions for every checkpoint.
Get access to relevant data to analyse backtesting results.

Background

Assume a simple use case: A rule to detect the transaction risk.

Each transaction has a transaction_id, user_id, currency, amount, timestamp. The rule engine also provides a treatment (Approve or Decline) based on the rule logic for the transaction.

In this specific use case, we would like to see what will be the aggregation number of the total transactions, total distinct users, and the sum of the amount, based on the dimensions of date, treatment, and currency in the last couple of weeks.

The result may look like the following data:

Dimension	Dimension	Dimension	metric	metric	metric
Date	Treatment	Currency	Total tx	Distinct user	Total amount
2020-05-1	Approve	SGD	100	80	10020
2020-05-1	Decline	SGD	50	40	450
2020-05-1	Approve	MYR	110	100	1200
2020-05-1	Decline	MYR	30	15	400

* This data does not reflect actual Grab data and is for illustrative purposes only.

Solution

Use a cloud-agnostic Spark-based data pipeline to replay any existing or proposed rule to check performance.
Use a Web Portal to:
- Create or select a rule to replay, with replay time range.
- Display and download the result, such as total events and hit counts.
Replay any existing or proposed rule for checking performance.
Allow users to create or select a rule to replay in the rule engine UI, with provided replay time range.
Display the replay result in the rule engine UI, such as total events and hit counts.
Provide a way to download all testing results in the rule engine UI (for example, all rule responses).
Remove dependency on the specific cloud provider stack, so other teams in Grab can use it instead of Google Cloud Platform (GCP).

Architecture details

The rule editor UI reacts to the user input. Its engine sends a job command to the Amazon Simple Queue Service (SQS) to initialise the job. After that, the rule editor also performs the following processes in the background:

Lambda listens to the request SQS queue and invokes a job via the Spark jobs API.
The job fetches the executable artifacts, data source. After the job is completed, the job script saves the result sheet as required to S3.
The Spark script pushes the job final status (success, failure, timeout) through the shutdown hook to respond to the SQS queue.
The rule editor engine listens to response callback messages, and processes the job metadata to the database, or sends notifications.
The rule editor displays the job metadata on the UI.
The package pipeline builds and deploys the executable artifacts to S3 as a manageable structure.
The Spark script takes the filter logic as its input parameters.

Workflow

Historical data preparation

The historical events are published by the rule engine through Kafka, and stored into the S3 bucket based on time. The Backtesting system then fetches these data for testing based on the time range requested.

By using a Kubernetes stream pipeline, we also save the trust inference stream to Trust AWS subaccount. With the customer bucket and file format, we can improve the efficiency of the data processing, and also avoid any delay from the data lake.

Engineering specifications

Target location:

    s3a://stg-trust-inference-event/<engine-name>/<predict-name>/<YYYY>/MM/DD/hh/mm/ss/<000001>.snappy.parquet
    s3a://prd-trust-inference-event/<engine-name>/<predict-name>/<YYYY>/MM/DD/hh/mm/ss/<000001>.snappy.parquet

Description: Following the fields of steam definition, the engine name would be ruleengine, or catwalk. The predict-name would be preride (checkpoint name), or cnpu (model name).

File Format: avro
File Compression: Snappy
There is no auto retention on sub-account S3. We will implement the archive process in the future.
The default pipeline and the new pipeline will run in parallel until the Data Engineering team is ready to retire the default pipeline.

Backtesting

Upon scheduling, the Backtesting Portal sends a message to SQS, which is then captured by the listening Lambda.
Lambda invokes a Spark job over the AWS elastic mapreduce engine (EMR).
The EMR engine fetches the executable artifacts containing the rule script and historical data from S3, and starts a Spark job to apply the rule script over historical data. Depending on the size of data, the Spark cluster will scale automatically to ensure timely completion.
Once completed, a report file is generated and available on Backtesting UI.

UI

Learnings and conclusions

After the release, here’s what our data analysers had to say:

For trust analysts, testing a rule on historical data happens outside the rule engine UI and is not user-friendly, leading to analysts wasting significant time.
For financial analysts, as analysts migrate to the rule engine UI, the existing solution will be deprecated with no other solution.
An alternative to simulate a rule; we no longer need to run a rule in shadow mode because we can use historical data to determine the outcome. This new approach saves us weeks of effort on the rule onboarding process.

What’s next?

The underlying Spark jobs in this tool were developed by knowledgeable data engineers, which is a disadvantage because it requires a high level of expertise to modify the analytics. To mitigate this restriction, we are looking into using domain-specific language (DSL) to allow users to input desired attributes and dimensions, and provide the job release pipeline for self-serving jobs.

Thanks to Jia Long Loh for the support on the offline infrastructure engineering.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How to automate updates for your domain list in Route 53 Resolver DNS Firewall

2022-09-01 Guillaume Neau

Post Syndicated from Guillaume Neau original https://aws.amazon.com/blogs/security/how-to-automate-updates-for-your-domain-list-in-route-53-resolver-dns-firewall/

Note: This post includes links to third-party websites. AWS is not responsible for the content on those websites.

Following the release of Amazon Route 53 Resolver DNS Firewall, Amazon Web Services (AWS) published several blog posts to help you protect your Amazon Virtual Private Cloud (Amazon VPC) DNS resolution, including How to Get Started with Amazon Route 53 Resolver DNS Firewall for Amazon VPC and Secure your Amazon VPC DNS resolution with Amazon Route 53 Resolver DNS Firewall. Route 53 Resolver DNS Firewall provides managed domain lists that are fully maintained and kept up-to-date by AWS and that directly benefit from the threat intelligence that we gather, but you might want to create or import your own list to have full control over the DNS filtering.

In this blog post, you will find a solution to automate the management of your domain list by using AWS Lambda, Amazon EventBridge, and Amazon Simple Storage Service (Amazon S3). The solution in this post uses, as an example, the URLhaus open Response Policy Zone (RPZ) list, which generates a new file every five minutes.

Architecture overview

The solution is made of the following four components, as shown in Figure 1.

An EventBridge scheduled rule to invoke the Lambda function on a schedule.
A Lambda function that uses the AWS SDK to perform the automation logic.
An S3 bucket to temporarily store the list of domains retrieved.
Amazon Route 53 Resolver DNS Firewall.

Figure 1: Architecture overview

After the solution is deployed, it works as follows:

The scheduled rule invokes the Lambda function every 5 minutes to fetch the latest domain list available.
The Lambda function fetches the list from URLhaus, parses the data retrieved, formats the data, uploads the list of domains into the S3 bucket, and invokes the Route 53 Resolver DNS Firewall importFirewallDomains API action.
The domain list is then updated.

Implementation steps

As a first step, create your own domain list on the Route 53 Resolver DNS Firewall. Having your own domain list allows you to have full control of the list of domains to which you want to apply actions, as defined within rule groups.

To create your own domain list

In the Route 53 console, in the left menu, choose Domain lists in the DNS firewall section.
Choose the Add domain list button, enter a name for your owned domain list, and then enter a placeholder domain to initialize the domain list.
Choose Add domain list to finalize the creation of the domain list.

Figure 2: Expected view of the console

The list from URLhaus contains more than a thousand records. You will use the ImportFirewallDomains endpoint to upload this list to DNS Firewall. The use of the ImportFirewallDomains endpoint requires that you first upload the list of domains and make the list available in an S3 bucket that is located in the same AWS Region as the owned domain list that you just created.

To create the S3 bucket

In the S3 console, choose Create bucket.
Under General configuration, configure the AWS Region option to be the same as the Region in which you created your domain list.
Finalize the configuration of your S3 bucket, and then choose Create bucket.

Because a new file is created every five minutes, we recommend setting a lifecycle rule to automatically expire and delete files after 24 hours to optimize for cost and only save the most recent lists.

To create the Lambda function

Follow the steps in the topic Creating an execution role in the IAM console to create an execution role. After step 4, when you configure permissions, choose Create Policy, and then create and add an IAM policy similar to the following example. This policy needs to:

Allow the Lambda function to put logs in Amazon CloudWatch.
Allow the Lambda function to have read and write access to objects placed in the created S3 bucket.
Allow the Lambda function to update the firewall domain list.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:<region>:<accountId>:*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::<DNSFW-BUCKET-NAME>/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "route53resolver:ImportFirewallDomains"
            ],
            "Resource": "arn:aws:route53resolver:<region>:<accountId>:firewall-domain-list/<domain-list-id>",
            "Effect": "Allow"
        }
    ]
}

(Optional) If you decide to use the example provided by AWS:
- After cloning the repository: Build the layer following the instruction included in the readme.md and the provided script.
- Zip the lambda.
- In the left menu, select Layers then Create Layer. Enter a name for the layer, then select Upload a .zip file. Choose to upload the layer (node-axios-layer.zip).
- As a compatible runtime, select: Node.js 16.x.
- Select Create
In the Lambda console, in the same Region as your domain list, choose Create function, and then do the following:
- Choose your desired runtime and architecture.
- (Optional) To use the code provided by AWS: Select Node.js 16.x as the runtime.
- Choose Change the default execution role.
- Choose Use an existing role, and then pick the role that you just created.
After the Lambda function is created, in the left menu of the Lambda console, choose Functions, and then select the function you created.
- For Code source, you can either enter the code of the Lambda function or choose the Upload from button and then choose the source for the code. AWS provides an example of functioning code on GitHub under a MIT-0 license.
(optional) To use the code provided by AWS:
- Choose the Upload from button and upload the zipped code example.
- After the code is uploaded, edit the default Runtime settings: Choose the Edit button and set the handler to be equal to: LambdaRpz.handler
- Edit the default Layers configuration, choose the Add a layer button, select Specify an ARN and enter the ARN of the layer created during the optional step 2.
- Edit the environment variables of the function: Select the Edit button and define the three following variables:
  1. Key : FirewallDomainListId | Value : <domain-list-id>
  2. Key : region | Value : <region>
  3. Key : s3Prefix | Value : <DNSFW-BUCKET-NAME>

The code that you place in the function will be able to fetch the list from URLhaus, upload the list as a file to S3, and start the import of domains.

For the Lambda function to be invoked every 5 minutes, next you will create a scheduled rule with Amazon EventBridge.

To automate the invoking of the Lambda function

In the EventBridge console, in the same AWS Region as your domain list, choose Create rule.
For Rule type, choose Schedule.
For Schedule pattern, select the option A schedule that runs at a regular rate, such as every 10 minutes, and under Rate expression set a rate of 5 minutes.

Figure 3: Console view when configuring a schedule
To select the target, choose AWS service, choose Lambda function, and then select the function that you previously created.

After the solution is deployed, your domain list will be updated every 5 minutes and look like the view in Figure 4.

Figure 4: Console view of the created domain list after it has been updated by the Lambda function

Code samples

You can use the samples in the amazon-route-53-resolver-firewall-automation-examples-2 GitHub repository to ease the automation of your domain list, and the associated updates. The repository contains script files to help you with the deployment process of the AWS CloudFormation template. Note that you need to have the AWS Command Line Interface (AWS CLI) installed and properly configured in order to use the files.

To deploy the CloudFormation stack

If you haven’t done so already, create an S3 bucket to store the artifacts in the Region where you wish to deploy. This name of this bucket will then be referenced as ParamS3ArtifactBucket with a value of <DOC-EXAMPLE-BUCKET-ARTIFACT>
Clone the repository locally.
git clone https://github.com/aws-samples/amazon-route-53-resolver-firewall-automation-examples-2
Build the Lambda function layer. From the /layer folder, use the provided script.
. ./build-layer.sh
Zip and upload the artifact to the bucket created in step 1. From the root folder, use the provided script.
. ./zipupload.sh <ParamS3ArtifactBucket>
Deploy the AWS CloudFormation stack by using either the AWS CLI or the CloudFormation console.
- To deploy by using the AWS CLI, from the root folder, type the following command, making sure to replace <region>, <DOC-EXAMPLE-BUCKET-ARTIFACT>, <DNSFW-BUCKET-NAME>, and <DomainListName>with your own values.
```
aws --region <region> cloudformation create-stack --stack-name DNSFWStack --capabilities CAPABILITY_NAMED_IAM --template-body file://./DNSFWStack.cfn.yaml --parameters ParameterKey=ParamS3ArtifactBucket,ParameterValue=<DOC-EXAMPLE-BUCKET-ARTIFACT> ParameterKey=ParamS3RpzBucket,ParameterValue=<DNSFW-BUCKET-NAME> ParameterKey=ParamFirewallDomainListName,ParameterValue=<DomainListName>
```
- To deploy by using the console, do the following:
  1. In the CloudFormation console, choose Create stack, and then choose With new resources (standard).
  2. On the creation screen, choose Template is ready, and upload the provided DNSFWStack.cfn.yaml file.
  3. Enter a stack name and configure the requested parameters with your desired configuration and outcomes. These parameters include the following:
    - The name of your firewall domain list.
    - The name of the S3 bucket that contains Lambda artifacts.
    - The name of the S3 bucket that will be created to contain the files with the domain information from URLhaus.
  4. Acknowledge that the template requires IAM permission because it will create the role for the Lambda function and manage its IAM policy, and then choose Create stack.

After a few minutes, all the resources should be created and the CloudFormation stack is now deployed. After 5 minutes, your domain list should be updated, as shown in Figure 5.

Figure 5: Console view of CloudFormation after the stack has been deployed

Conclusions and cost

In this blog post, you learned about creating and automating the update of a domain list that you fully control. To go further, you can extend and replicate the architecture pattern to fetch domain names from other sources by editing the source code of the Lambda function.

After the solution is in place, in order for the filtering to be effective, you need to create a rule group referencing the domain list and associate the rule group with some of your VPCs.

For cost information, see the AWS Pricing Calculator. This solution will be invoked 60 (minutes) * 24 (hours) * 30 (days) / 5 (minutes) = 8,640 times per month, invoking the Lambda function that will run for an average of 400 minutes, storing an average of 0.5 GB in Amazon S3, and creating a domain list that averages 1,500 domains. According to our public pricing, and without factoring in the AWS Free Tier, this will incur the estimated total cost of $1.43 per month for the filtering of 1 million DNS requests.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

How we automated FAQ responses at Grab

2022-07-13 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/automated-faq

Overview and initial analysis

Knowledge management is often one of the biggest challenges most companies face internally. Teams spend several working hours trying to either inefficiently look for information or constantly asking colleagues about information already documented somewhere. A lot of time is spent on the internal employee communication channels (in our case, Slack) simply trying to figure out answers to repetitive questions. On our journey to automate the responses to these repetitive questions, we needed first to figure out exactly how much time and effort is spent by on-call engineers answering such repetitive questions.

We soon identified that many of the internal engineering tools’ on-call activities involve answering users’ (internal users) questions on various Slack channels. Many of these questions have already been asked or documented on the wiki. These inquiries hinder on-call engineers’ productivity and affect their ability to focus on operational tasks. Once we figured out that on-call employees spend a lot of time answering Slack queries, we decided on a journey to determine the top questions.

We considered smaller groups of teams for this study and found out that:

The topmost user queries are “How do I do ABC?” or “Is XYZ broken?”.
The second most commonly asked questions revolve around access requests, approvals, or other permissions. The answer to such questions is often URLs to existing documentation.

These findings informed us that we didn’t just need an artificial intelligence (AI) based autoresponder to repetitive questions. We must, in fact, also leverage these channels’ chat histories to identify patterns.

Gathering user votes for shortlisted vendors

In light of saving costs and time and considering the quality of existing solutions already available in the market, we decided not to reinvent the wheel and instead purchase an existing product. And to figure out which product to purchase, we needed to do a comparative analysis. And thus began our vendor comparison journey!

While comparing the feature sets offered by different vendors, we understood that our users need to play a part in this decision-making process. However, sharing our vendor analysis with our users and allowing them to choose the bot of their choice posed several challenges:

Users could be biased towards known bots (from previous experiences).
Users could be biased towards big brands with a preconceived notion that big brands mean better features and better user support.
Users may likely pick the most expensive vendor, assuming that a higher cost means higher efficiency.

To ensure that we receive unbiased feedback, here’s how we opened users up to voting. We highlighted the top features of each vendor’s bot compared to other shortlisted bots. We hid the names of the bots to avoid brand attraction. At a high level, here’s what the categorisation looked like:

Features	Vendor 1 (name hidden)	Vendor 2 (name hidden)	Vendor 3 (name hidden)
Enables crowdsourcing, everyone is incentivised to participate. Participants/SME names are visible. Everyone can access the web UI and see how the responses configured on the bot.		–	–
Lowers discussions on channels by providing easy ways to raise tickets to the team instead of discussing on Slack.	–
Only a specific set of admins (or oncall engineers) feed and maintain the bot thus ensuring information authenticity and reliability.
Easy bot feeding mechanism/web UI to update FAQs.		–
Superior natural language processing capabilities.			–
Please vote	Vendor 1	Vendor 2	Vendor 3

Although none of the options had all the features our users wanted, about 60% chose Vendor 1 (OneBar). From this, we discovered the core features that our users needed while keeping them involved in the decision-making process.

Matching our requirements with available vendors’ feature sets

Although our users made their preferences clear, we still needed to ensure that the feature sets available in the market suited our internal requirements in terms of the setup and the features available in portals that we envisioned replacing. As part of our requirements gathering process, here are some of the critical conditions that became more and more prominent:

An ability to crowdsource Slack discussions/conclusions and save them directly from Slack (preferably with a single command).
An ability to auto-respond to Slack queries without calling the bot manually.
The bot must be able to respond to queries only on the preconfigured Slack channel (not a Slack-wide auto-responder that is already available).
Ability to auto-detect frequently asked questions on the channels would mean less work for platform engineers to feed the bot manually and periodically.
A trusted and secured data storage setup and a responsive customer support team.

Proof of concept

We considered several tools (including some of the tools used by our HR for auto-answering employee questions). We then decided to do a complete proof of concept (POC) with OneBar to check if it fulfils our internal requirements.

These were the phases in which we conducted the POC for the shortlisted vendor (OneBar):

Phase 1: Study the traffic, see what insights OneBar shows and what it could/should potentially show. Then think about how an ideal oncall or support should behave in such an environment. i.e. we could identify specific messages in history and describe what should’ve happened to each one of them.

Phase 2: Create required records in OneBar and configure it to match the desired behaviour as closely as possible.

Phase 3: Let the tool run for a couple of weeks and then evaluate how well it responds to questions, how often people search directly, how much information they add, etc. Onebar adds all these metrics in the app making it easier to monitor activity.

In addition to the Onebar POC, we investigated other solutions and did a thorough vendor comparison and analysis. After running the POC and investigating other vendors, we decided to use OneBar as its features best meet our needs.

Prioritising Slack channels

While we had multiple Slack channels that we’d love to have enabled the shortlisted bot on, our initial contract limited our use of the bot to only 20 channels. We could not use OneBar to auto-scan more than 20 Slack channels.

Users could still chat directly with the bot to get answers to FAQs based on what was fed to the bot’s knowledge base (KB). They could also access the web login, which displays its KB, other valuable features, and additional features for admins/experts.

Slack channels that we enabled the licensed features on were prioritised based on:

Most messages sent on the channel per month, i.e. most active channels.
Most members impacted, i.e. channels with a large member count.

To do this, we used Slack analytics reports and identified the channels that fit our prioritisation criteria.

Change is difficult but often essential

Once we’d onboarded the vendor, we began training and educating employees on using this new Knowledge Management system for all their FAQs. It was a challenge as change is always complex but essential for growth.

A series of tech talks and training conducted across the company and at more minor scales also helped guide users about the bot’s features and capabilities.

At the start, we suffered from a lack of data resulting in incorrect responses from the bot. But as the team became increasingly aware of the features and learned more about its capabilities, the bot’s number of KB items grew, resulting in a much more efficient experience. It took us around one quarter to feed the bot consistently to see accurate and frequent responses from it.

Crowdsourcing our internal glossary

With an increasing number of acronyms and company-specific words emerging each year, the number of acronyms and company-specific abbreviations that new joiners face is immense.

We solved this issue by using the bot’s channel-specific KB feature. We created a specific Slack channel dedicated to storing and retrieving definitions of acronyms and other words. This solution turned out to be a big hit with our users.

And who fed the bot with the terms and glossary items? Who better than our onboarding employees to train the bot to help other onboarders. A targeted campaign dedicated to feeding the bot excited many of our onboarders. They began to play around with the bot’s features and provide it with as many glossary items as possible, thus winning swags!

In a matter of weeks, the user base grew from a couple of hundred to around 3000. This effort was also called out in one of our company-wide All Hands meetings, a big win for our team!

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Automating detection of security vulnerabilities and bugs in CI/CD pipelines using Amazon CodeGuru Reviewer CLI

2022-06-01 Akash Verma

Post Syndicated from Akash Verma original https://aws.amazon.com/blogs/devops/automating-detection-of-security-vulnerabilities-and-bugs-in-ci-cd-pipelines-using-amazon-codeguru-reviewer-cli/

Watts S. Humphrey, the father of Software Quality, had famously quipped, “Every business is a software business”. Software is indeed integral to any industry. The engineers who create software are also responsible for making sure that the underlying code adheres to industry and organizational standards, are performant, and are absolved of any security vulnerabilities that could make them susceptible to attack.

Traditionally, security testing has been the forte of a specialized security testing team, who would conduct their tests toward the end of the Software Development lifecycle (SDLC). The adoption of DevSecOps practices meant that security became a shared responsibility between the development and security teams. Now, development teams can, on their own or as advised by their security team, setup and configure various code scanning tools to detect security vulnerabilities much earlier in the software delivery process (aka “Shift Left”). Meanwhile, the practice of Static code analysis and security application testing (SAST) has become a standard part of the SDLC. Furthermore, it’s imperative that the development teams expect SAST tools that are easy to set-up, seamlessly fit into their DevOps infrastructure, and can be configured without requiring assistance from security or DevOps experts.

In this post, we’ll demonstrate how you can leverage Amazon CodeGuru Reviewer Command Line Interface (CLI) to integrate CodeGuru Reviewer into your Jenkins Continuous Integration & Continuous Delivery (CI/CD) pipeline. Note that the solution isn’t limited to Jenkins, and it would be equally useful with any other build automation tool. Moreover, it can be integrated at any stage of your SDLC as part of the White-box testing. For example, you can integrate the CodeGuru Reviewer CLI as part of your software development process, as well as run it on your dev machine before committing the code.

Launched in 2020, CodeGuru Reviewer utilizes machine learning (ML) and automated reasoning to identify security vulnerabilities, inefficient uses of AWS APIs and SDKs, as well as other common coding errors. CodeGuru Reviewer employs a growing set of detectors for Java and Python to provide recommendations via the AWS Console. Customers that leverage the CodeGuru Reviewer CLI within a CI/CD pipeline also receive recommendations in a machine-readable JSON format, as well as HTML.

CodeGuru Reviewer offers native integration with Source Code Management (SCM) systems, such as GitHub, BitBucket, and AWS CodeCommit. However, it can be used with any SCM via its CLI. The CodeGuru Reviewer CLI is a shim layer on top of the AWS Command Line Interface (AWS CLI) that simplifies the interaction with the tool by handling the uploading of artifacts, triggering of the analysis, and fetching of the results, all in a single command.

Many customers, including Mastercard, are benefiting from this new CodeGuru Reviewer CLI.

“During one of our technical retrospectives, we noticed the need to integrate Amazon CodeGuru recommendations in our build pipelines hosted on Jenkins. Not all our developers can run or check CodeGuru recommendations through the AWS console. Incorporating CodeGuru CLI in our build pipelines acts as an important quality gate and ensures that our developers can immediately fix critical issues.”
Claudio Frattari, Lead DevOps at Mastercard

Solution overview

The application deployment workflow starts by placing the application code on a GitHub SCM. To automate the scenario, we have added GitHub to the Jenkins project under the “Source Code” section. We chose the GitHub option, which would clone the chosen GitHub repository in the Jenkins local workspace directory.

In the build stage of the pipeline (see Figure 1), we configure the appropriate build tool to perform the code build and security analysis. In this example, we will be using Maven as the build tool.

Figure 1: Jenkins pipeline with Amazon CodeGuru Reviewer

In the post-build stage, we configure the CodeGuru Reviewer CLI to generate the recommendations based on the review.

Lastly, in the concluding stage of the pipeline, we’ll be analyzing the JSON results using jq – a lightweight and flexible command-line JSON processor, and then failing the Jenkins job if we encounter observations that are of a “Critical” severity.

Jenkins will trigger the “CodeGuru Reviewer” (see Figure 1) based review process in the post-build stage, i.e., after the build finishes. Furthermore, you can configure other stages, such as automated testing or deployment, after this stage. Additionally, passing the location of the build artifacts to the CLI lets CodeGuru Reviewer perform a more in-depth security analysis. Build artifacts are either directories containing jar files (e.g., build/lib for Gradle or /target for Maven) or directories containing class hierarchies (e.g., build/classes/java/main for Gradle).

Walkthrough

Now that we have an overview of the workflow, let’s dive deep and walk you through the following steps in detail:

Installing the CodeGuru Reviewer CLI
Creating a Jenkins pipeline job
Reviewing the CodeGuru Reviewer recommendations
Configuring CodeGuru Reviewer CLI’s additional options

1. Installing the CodeGuru CLI Wrapper

a. Prerequisites

To run the CLI, we must have Git, Java, Maven, and the AWS CLI installed. Verify that they’re installed on our machine by running the following commands:

java -version 
mvn --version 
aws --version 
git –-version

If they aren’t installed, then download and install Java here (Amazon Corretto is a no-cost, multiplatform, production-ready distribution of the Open Java Development Kit), Maven from here, and Git from here. Instructions for installing AWS CLI are available here.

We would need to create an Amazon Simple Storage Service (Amazon S3) bucket with the prefix codeguru-reviewer-. Note that the bucket name must begin with the mentioned prefix, since we have used the name pattern in the following AWS Identity and Access Management (IAM) permissions, and CodeGuru Reviewer expects buckets to begin with this prefix. Refer to the following section 4(a) “Specifying S3 bucket name” for more details.

Furthermore, we’ll need working credentials on our machine to interact with our AWS account. Learn more about setting up credentials for AWS here. You can find the minimal permissions to run the CodeGuru Reviewer CLI as follows.

b. Required Permissions

To use the CodeGuru Reviewer CLI, we need at least the following AWS IAM permissions, attached to an AWS IAM User or an AWS IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "codeguru-reviewer:ListRepositoryAssociations",
                "codeguru-reviewer:AssociateRepository",
                "codeguru-reviewer:DescribeRepositoryAssociation",
                "codeguru-reviewer:CreateCodeReview",
                "codeguru-reviewer:DescribeCodeReview",
                "codeguru-reviewer:ListRecommendations",
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucket*",
                "s3:List*",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::codeguru-reviewer-*",
                "arn:aws:s3:::codeguru-reviewer-*/*"
            ],
            "Effect": "Allow"
        }
    ]
}

c. CLI installation

Please download the latest version of the CodeGuru Reviewer CLI available at GitHub. Then, run the following commands in sequence:

curl -OL https://github.com/aws/aws-codeguru-cli/releases/download/0.0.1/aws-codeguru-cli.zip
unzip aws-codeguru-cli.zip
export PATH=$PATH:./aws-codeguru-cli/bin

d. Using the CLI

The CodeGuru Reviewer CLI only has one required parameter –root-dir (or just -r) to specify to the local directory that should be analyzed. Furthermore, the –src option can be used to specify one or more files in this directory that contain the source code that should be analyzed. In turn, for Java applications, the –build option can be used to specify one or more build directories.

For a demonstration, we’ll analyze the demo application. This will make sure that we’re all set for when we leverage the CLI in Jenkins. To proceed, first we download and install the sample application, as follows:

git clone https://github.com/aws-samples/amazon-codeguru-reviewer-sample-app
cd amazon-codeguru-reviewer-sample-app
mvn clean compile

Now that we have built our demo application, we can use the aws-codeguru-cli CLI command that we added to the path to trigger the code scan:

aws-codeguru-cli --root-dir ./ --build target/classes --src src --output ./output

For additional assistance on the CLI command, reference the readme here.

2. Creating a Jenkins Pipeline job

CodeGuru Reviewer can be integrated in a Jenkins Pipeline as well as a Freestyle project. In this example, we’re leveraging a Pipeline.

a. Pipeline Job Configuration

Log in to Jenkins, choose “New Item”, then select “Pipeline” option.
Enter a name for the project (for example, “CodeGuruPipeline”), and choose OK.

Figure 2: Creating a new Jenkins pipeline

On the “Project configuration” page, scroll down to the bottom and find your pipeline. In the pipeline script, paste the following script (or use your own Jenkinsfile). The following example is a valid Jenkinsfile to integrate CodeGuru Reviewer with a project built using Maven.

pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                // Get code from a GitHub repository
                git clone https://github.com/aws-samples/amazon-codeguru-reviewer-java-detectors.git

                // Run Maven on a Unix agent
                sh "mvn clean compile"

                // To run Maven on a Windows agent, use following
                // bat "mvn -Dmaven.test.failure.ignore=true clean package"
            }
        }
        stage('CodeGuru Reviewer') {
            steps{
                sh 'ls -lsa *'
                sh 'pwd'
                // Here we’re setting an absolute path, but we can 
                // also use JENKINS environment variables
                sh '''
                    export BASE=/var/jenkins_home/workspace/CodeGuruPipeline/amazon-codeguru-reviewer-java-detectors
                    export SRC=${BASE}/src
                    export OUTPUT = ./output
                    /home/codeguru/aws-codeguru-cli/bin/aws-codeguru-cli --root-dir $BASE --build $BASE/target/classes --src $SRC --output $OUTPUT -c $GIT_PREVIOUS_COMMIT:$GIT_COMMIT --no-prompt
                    '''
            }
        }    
        stage('Checking findings'){
            steps{
                // In this example we are stopping our pipline on  
                // detecting Critical findings. We are using jq 
                // to count occurrences of Critical severity 
                sh '''
                CNT = $(cat ./output/recommendations.json |jq '.[] | select(.severity=="Critical")|.severity' | wc -l)'
                if (( $CNT > 0 )); then
                  echo "Critical findings discovered. Failing."
                  exit 1
                fi
                '''
            }
        }
    }
}

Save the configuration and select “Build now” on the side bar to trigger the build process (see Figure 3).

Figure 3: Jenkins pipeline in triggered state

3. Reviewing the CodeGuru Reviewer recommendations

Once the build process is finished, you can view the review results from CodeGuru Reviewer by selecting the Jenkins build history for the most recent build job. Then, browse to Workspace output. The output is available in JSON and HTML formats (Figure 4).

Figure 4: CodeGuru CLI Output

Snippets from the HTML and JSON reports are displayed in Figure 5 and 6 respectively.

In this example, our pipeline analyzes the JSON results with jq based on severity equal to critical and failing the job if there are any critical findings. Note that this output path is set with the –output option. For instance, the pipeline will fail on noticing the “critical” finding at Line 67 of the EventHandler.java class (Figure 5), flagged due to use of an insecure code. Till the time the code is remediated, the pipeline would prevent the code deployment. The vulnerability could have gone to production undetected, in absence of the tool.

Figure 5: CodeGuru HTML Report

Figure 6: CodeGuru JSON recommendations

4. Configuring CodeGuru Reviewer CLI’s additional options

a. Specifying Amazon S3 bucket name and policy

CodeGuru Reviewer needs one Amazon S3 bucket for the CLI to store the artifacts while the analysis is running. The artifacts are deleted after the analysis is completed. The same bucket will be reused for all the repositories that are analyzed in the same account and region (unless specified otherwise by the user). Note that CodeGuru Reviewer expects the S3 bucket name to begin with codeguru-reviewer-. At this time, you can’t use a different naming pattern. However, if you want to use a different bucket name, then you can use the –bucket-name option.

Select the Permissions tab of your S3 bucket. Update the Block public access and add the following S3 bucket policy.

Figure 7: S3 bucket settings

S3 bucket policy:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"PublicRead",
         "Effect":"Allow",
         "Principal":"*",
         "Action":"s3:GetObject",
         "Resource":"[Change to ARN for your S3 bucket]/*"
      }
   ]
}

Note that if you must change the bucket’s name, then you can remove the associated S3 bucket in the AWS console under CodeGuru → CI workflows and select Disassociate Workflow.

b. Analyzing a single commit

The CLI also lets us specify a specific commit range to analyze. This can lead to faster and more cost-effective scans for the incremental code changes, instead of a full repository scan. For example, if we just want to analyze the last commit, we can run:

aws-codeguru-cli -r ./ -s src/main/java -b build/libs -c HEAD^:HEAD --no-prompt

Here, we use the -c option to specify that we only want to analyze the commits between HEAD^ (the previous commit) and HEAD (the current commit). Moreover, we add the –no-prompt option to automatically answer questions by the CLI with yes. This option is useful if we plan to use the CLI in an automated way, such as in our CI/CD workflow.

c. Encrypting artifacts

CodeGuru Reviewer lets us use a customer managed key to encrypt the content of the S3 bucket that is used to store the source and build artifacts. To achieve this, create a customer owned key in AWS Key Management Service (AWS KMS) (see Figure 8).

Figure 8: KMS settings

We must grant CodeGuru Reviewer the permission to decrypt artifacts with this key by adding the following Statement to your Key policy:

{
   "Sid":"Allow CodeGuru to use the key to decrypt artifact",
   "Effect":"Allow",
   "Principal":{
      "AWS":"*"
   },
   "Action":[
      "kms:Decrypt",
      "kms:DescribeKey"
   ],
   "Resource":"*",
   "Condition":{
      "StringEquals":{
         "kms:ViaService":"codeguru-reviewer.amazonaws.com",
         "kms:CallerAccount":[
            "YOUR AWS ACCOUNT ID"
         ]
      }
   }
}

Then, enable server-side encryption for the S3 bucket that we’re using with CodeGuru Reviewer (Figure 9).

S3 bucket settings:

Figure 9: S3 bucket encryption settings

After we enable encryption on the bucket, we must delete all the CodeGuru repository associations that use this bucket, and then recreate them by analyzing the repositories while providing the key (as in the following example, Figure 10):

Figure 10: CodeGuru CI Workflow

Note that the first time you check out your repository, it will always trigger a full repository scan. Consider setting the -c option, as this will allow a commit range.

Cleaning Up

At this stage, you may choose to delete the resources created while following this blog, to avoid incurring any unwanted costs.

Delete Amazon S3 bucket.
Delete AWS KMS key.
Delete the Jenkins installation, if not required further.

Conclusion

In this post, we outlined how you can integrate Amazon CodeGuru Reviewer CLI with the Jenkins open-source build automation tool to perform code analysis as part of your code build pipeline and act as a quality gate. We showed you how to create a Jenkins pipeline job and integrate the CodeGuru Reviewer CLI to detect issues in your Java and Python code, as well as access the recommendations for remediating these issues. We presented an example where you can stop the build upon finding critical violations. Furthermore, we discussed how you can specify a commit range to avoid a full repo scan, and how the S3 bucket used by CodeGuru Reviewer to store artifacts can be encrypted using customer managed keys.

The CodeGuru Reviewer CLI offers you a one-line command to scan any code on your machine and retrieve recommendations. You can run the CLI anywhere where you can run AWS commands. In other words, you can use the CLI to integrate CodeGuru Reviewer into your favourite CI tool, as a pre-commit hook, or anywhere else in your workflow. In turn, you can combine CodeGuru Reviewer with Dynamic Application Security Testing (DAST) and Software Composition Analysis (SCA) tools to achieve a hybrid application security testing method that helps you combine the inside-out and outside-in testing approaches, cross-reference results, and detect vulnerabilities that both exist and are exploitable.

Hopefully, you have found this post informative, and the proposed solution useful. If you need helping hands, then AWS Professional Services can help implement this solution in your enterprise, as well as introduce you to our AWS DevOps services and offerings.

About the Authors

Manage application security and compliance with the AWS Cloud Development Kit and cdk-nag

2022-05-25 Rodney Bozo

Post Syndicated from Rodney Bozo original https://aws.amazon.com/blogs/devops/manage-application-security-and-compliance-with-the-aws-cloud-development-kit-and-cdk-nag/

Infrastructure as Code (IaC) is an important part of Cloud Applications. Developers rely on various Static Application Security Testing (SAST) tools to identify security/compliance issues and mitigate these issues early on, before releasing their applications to production. Additionally, SAST tools often provide reporting mechanisms that can help developers verify compliance during security reviews.

cdk-nag integrates directly into AWS Cloud Development Kit (AWS CDK) applications to provide identification and reporting mechanisms similar to SAST tooling.

This post demonstrates how to integrate cdk-nag into an AWS CDK application to provide continual feedback and help align your applications with best practices.

Overview of cdk-nag

cdk-nag (inspired by cfn_nag) validates that the state of constructs within a given scope comply with a given set of rules. Additionally, cdk-nag provides a rule suppression and compliance reporting system. cdk-nag validates constructs by extending AWS CDK Aspects. If you’re interested in learning more about the AWS CDK Aspect system, then you should check out this post.

cdk-nag includes several rule sets (NagPacks) to validate your application against. As of this post, cdk-nag includes the AWS Solutions, HIPAA Security, NIST 800-53 rev 4, NIST 800-53 rev 5, and PCI DSS 3.2.1 NagPacks. You can pick and choose different NagPacks and apply as many as you wish to a given scope.

cdk-nag rules can either be warnings or errors. Both warnings and errors will be displayed in the console and compliance reports. Only unsuppressed errors will prevent applications from deploying with the cdk deploy command.

You can see which rules are implemented in each of the NagPacks in the Rules Documentation in the GitHub repository.

Walkthrough

This walkthrough will setup a minimal AWS CDK v2 application, as well as demonstrate how to apply a NagPack to the application, how to suppress rules, and how to view a report of the findings. Although cdk-nag has support for Python, TypeScript, Java, and .NET AWS CDK applications, we’ll use TypeScript for this walkthrough.

Prerequisites

For this walkthrough, you should have the following prerequisites:

A local installation of and experience using the AWS CDK.

Create a baseline AWS CDK application

In this section you will create and synthesize a small AWS CDK v2 application with an Amazon Simple Storage Service (Amazon S3) bucket. If you are unfamiliar with using the AWS CDK, then learn how to install and setup the AWS CDK by looking at their open source GitHub repository.

Run the following commands to create the AWS CDK application:

mkdir CdkTest
cd CdkTest
cdk init app --language typescript

Replace the contents of the lib/cdk_test-stack.ts with the following:

import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    const bucket = new Bucket(this, 'Bucket')
  }
}

Run the following commands to install dependencies and synthesize our sample app:

npm install
npx cdk synth

You should see an AWS CloudFormation template with an S3 bucket both in your terminal and in cdk.out/CdkTestStack.template.json.

Apply a NagPack in your application

In this section, you’ll install cdk-nag, include the AwsSolutions NagPack in your application, and view the results.

Run the following command to install cdk-nag:

npm install cdk-nag

Replace the contents of the bin/cdk_test.ts with the following:

#!/usr/bin/env node
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import { CdkTestStack } from '../lib/cdk_test-stack';
import { AwsSolutionsChecks } from 'cdk-nag'
import { Aspects } from 'aws-cdk-lib';

const app = new cdk.App();
// Add the cdk-nag AwsSolutions Pack with extra verbose logging enabled.
Aspects.of(app).add(new AwsSolutionsChecks({ verbose: true }))
new CdkTestStack(app, 'CdkTestStack', {});

Run the following command to view the output and generate the compliance report:

npx cdk synth

The output should look similar to the following (Note: SSE stands for Server-side encryption):

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S1: The S3 Bucket has server access logs disabled. The bucket should have server access logging enabled to provide detailed records for the requests that are made to the bucket.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S2: The S3 Bucket does not have public access restricted and blocked. The bucket should have public access restricted and blocked to prevent unauthorized access.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S10: The S3 Bucket does not require requests to use SSL. You can use HTTPS (TLS) to help prevent potential attackers from eavesdropping on or manipulating network traffic using person-in-the-middle or similar attacks. You should allow only encrypted connections over HTTPS (TLS) using the aws:SecureTransport condition on Amazon S3 bucket policies.

Found errors

Note that applying the AwsSolutions NagPack to the application rendered several errors in the console (AwsSolutions-S1, AwsSolutions-S2, AwsSolutions-S3, and AwsSolutions-S10). Furthermore, the cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains the errors as well:

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Remediating and suppressing errors

In this section, you’ll remediate the AwsSolutions-S10 error, suppress the AwsSolutions-S1 error on a Stack level, suppress the AwsSolutions-S2 error on a Resource level errors, and not remediate the AwsSolutions-S3 error and view the results.

Replace the contents of the lib/cdk_test-stack.ts with the following:

import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';
import { NagSuppressions } from 'cdk-nag'

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    // The local scope 'this' is the Stack. 
    NagSuppressions.addStackSuppressions(this, [
      {
        id: 'AwsSolutions-S1',
        reason: 'Demonstrate a stack level suppression.'
      },
    ])
    // Remediating AwsSolutions-S10 by enforcing SSL on the bucket.
    const bucket = new Bucket(this, 'Bucket', { enforceSSL: true })
    NagSuppressions.addResourceSuppressions(bucket, [
      {
        id: 'AwsSolutions-S2',
        reason: 'Demonstrate a resource level suppression.'
      },
    ])
  }
}

Run the cdk synth command again:

npx cdk synth

The output should look similar to the following:

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

Found errors

The cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains more details about rule compliance, non-compliance, and suppressions.

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a stack level suppression.","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a resource level suppression.","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Moreover, note that the resultant cdk.out/CdkTestStack.template.json template contains the cdk-nag suppression data. This provides transparency with what rules weren’t applied to an application, as the suppression data is included in the resources.

{
  "Metadata": {
    "cdk_nag": {
      "rules_to_suppress": [
        {
          "id": "AwsSolutions-S1",
          "reason": "Demonstrate a stack level suppression."
        }
      ]
    }
  },
  "Resources": {
    "BucketDEB6E181": {
      "Type": "AWS::S3::Bucket",
      "UpdateReplacePolicy": "Retain",
      "DeletionPolicy": "Retain",
      "Metadata": {
        "aws:cdk:path": "CdkTestStack/Bucket/Resource",
        "cdk_nag": {
          "rules_to_suppress": [
            {
              "id": "AwsSolutions-S2",
              "reason": "Demonstrate a resource level suppression."
            }
          ]
        }
      }
    },
  ...
  },
  ...
}

Reflecting on the Walkthrough

In this section, you learned how to apply a NagPack to your application, remediate/suppress warnings and errors, and review the compliance reports. The reporting and suppression systems provide mechanisms for the development and security teams within organizations to work together to identify and mitigate potential security/compliance issues. Security can choose which NagPacks developers should apply to their applications. Then, developers can use the feedback to quickly remediate issues. Security can use the reports to validate compliances. Furthermore, developers and security can work together to use suppressions to transparently document exceptions to rules that they’ve decided not to follow.

Advanced usage and further reading

This section briefly covers some advanced options for using cdk-nag.

Unit Testing with the AWS CDK Assertions Library

The Annotations submodule of the AWS CDK assertions library lets you check for cdk-nag warnings and errors without AWS credentials by integrating a NagPack into your application unit tests. Read this post for further information about the AWS CDK assertions module. The following is an example of using assertions with a TypeScript AWS CDK application and Jest for unit testing.

import { Annotations, Match } from 'aws-cdk-lib/assertions';
import { App, Aspects, Stack } from 'aws-cdk-lib';
import { AwsSolutionsChecks } from 'cdk-nag';
import { CdkTestStack } from '../lib/cdk_test-stack';

describe('cdk-nag AwsSolutions Pack', () => {
  let stack: Stack;
  let app: App;
  // In this case we can use beforeAll() over beforeEach() since our tests 
  // do not modify the state of the application 
  beforeAll(() => {
    // GIVEN
    app = new App();
    stack = new CdkTestStack(app, 'test');

    // WHEN
    Aspects.of(stack).add(new AwsSolutionsChecks());
  });

  // THEN
  test('No unsuppressed Warnings', () => {
    const warnings = Annotations.fromStack(stack).findWarning(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(warnings).toHaveLength(0);
  });

  test('No unsuppressed Errors', () => {
    const errors = Annotations.fromStack(stack).findError(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(errors).toHaveLength(0);
  });
});

Additionally, many testing frameworks include watch functionality. This is a background process that reruns all of the tests when files in your project have changed for fast feedback. For example, when using the AWS CDK in JavaScript/Typescript, you can use the Jest CLI watch commands. When Jest watch detects a file change, it attempts to run unit tests related to the changed file. This can be used to automatically run cdk-nag-related tests when making changes to your AWS CDK application.

CDK Watch

When developing in non-production environments, consider using AWS CDK Watch with a NagPack for fast feedback. AWS CDK Watch attempts to synthesize and then deploy changes whenever you save changes to your files. Aspects are run during synthesis. Therefore, any NagPacks applied to your application will also run on save. As in the walkthrough, all of the unsuppressed errors will prevent deployments, all of the messages will be output to the console, and all of the compliance reports will be generated. Read this post for further information about AWS CDK Watch.

Conclusion

In this post, you learned how to use cdk-nag in your AWS CDK applications. To learn more about using cdk-nag in your applications, check out the README in the GitHub Repository. If you would like to learn how to create your own rules and NagPacks, then check out the developer documentation. The repository is open source and welcomes community contributions and feedback.

Author:

Streamlining evidence collection with AWS Audit Manager

2022-03-03 Nicholas Parks

Post Syndicated from Nicholas Parks original https://aws.amazon.com/blogs/security/streamlining-evidence-collection-with-aws-audit-manager/

In this post, we will show you how to deploy a solution into your Amazon Web Services (AWS) account that enables you to simply attach manual evidence to controls using AWS Audit Manager. Making evidence-collection as seamless as possible minimizes audit fatigue and helps you maintain a strong compliance posture.

As an AWS customer, you can use APIs to deliver high quality software at a rapid pace. If you have compliance-focused teams that rely on manual, ticket-based processes, you might find it difficult to document audit changes as those changes increase in velocity and volume.

As your organization works to meet audit and regulatory obligations, you can save time by incorporating audit compliance processes into a DevOps model. You can use modern services like Audit Manager to make this easier. Audit Manager automates evidence collection and generates reports, which helps reduce manual auditing efforts and enables you to scale your cloud auditing capabilities along with your business.

AWS Audit Manager uses services such as AWS Security Hub, AWS Config, and AWS CloudTrail to automatically collect and organize evidence, such as resource configuration snapshots, user activity, and compliance check results. However, for controls represented in your software or processes without an AWS service-specific metric to gather, you need to manually create and provide documentation as evidence to demonstrate that you have established organizational processes to maintain compliance. The solution in this blog post streamlines these types of activities.

Solution architecture

This solution creates an HTTPS API endpoint, which allows integration with other software development lifecycle (SDLC) solutions, IT service management (ITSM) products, and clinical trial management systems (CTMS) solutions that capture trial process change amendment documentation (in the case of pharmaceutical companies who use AWS to build robust pharmacovigilance solutions). The endpoint can also be a backend microservice to an application that allows contract research organizations (CRO) investigators to add their compliance supporting documentation.

In this solution’s current form, you can submit an evidence file payload along with the assessment and control details to the API and this solution will tie all the information together for the audit report. This post and solution is directed towards engineering teams who are looking for a way to accelerate evidence collection. To maximize the effectiveness of this solution, your engineering team will also need to collaborate with cross-functional groups, such as audit and business stakeholders, to design a process and service that constructs and sends the message(s) to the API and to scale out usage across the organization.

To download the code for this solution, and the configuration that enables you to set up auto-ingestion of manual evidence, see the aws-audit-manager-manual-evidence-automation GitHub repository.

Architecture overview

In this solution, you use AWS Serverless Application Model (AWS SAM) templates to build the solution and deploy to your AWS account. See Figure 1 for an illustration of the high-level architecture.

Figure 1. The architecture of the AWS Audit Manager automation solution

The SAM template creates resources that support the following workflow:

A client can call an Amazon API Gateway endpoint by sending a payload that includes assessment details and the evidence payload.
An AWS Lambda function implements the API to handle the request.
The Lambda function uploads the evidence to an Amazon Simple Storage Service (Amazon S3) bucket (3a) and uses AWS Key Management Service (AWS KMS) to encrypt the data (3b).
The Lambda function also initializes the AWS Step Functions workflow.
Within the Step Functions workflow, a Standard Workflow calls two Lambda functions. The first looks for a matching control within an assessment, and the second updates the control within the assessment with the evidence.
When the Step Functions workflow concludes, it sends a notification for success or failure to subscribers of an Amazon Simple Notification Service (Amazon SNS) topic.

Deploy the solution

The project available in the aws-audit-manager-manual-evidence-automation GitHub repository contains source code and supporting files for a serverless application you can deploy with the AWS SAM command line interface (CLI). It includes the following files and folders:

src	Code for the application’s Lambda implementation of the Step Functions workflow. It also includes a Step Functions definition file.
template.yml	A template that defines the application’s AWS resources.

Resources for this project are defined in the template.yml file. You can update the template to add AWS resources through the same deployment process that updates your application code.

Prerequisites

This solution assumes the following:

AWS Audit Manager is enabled.
You have already created an assessment in AWS Audit Manager.
You have the necessary tools to use the AWS SAM CLI (see details in the table that follows).

For more information about setting up Audit Manager and selecting a framework, see Getting started with Audit Manager in the blog post AWS Audit Manager Simplifies Audit Preparation.

The AWS SAM CLI is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. The AWS SAM CLI uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application’s build environment and API.

To use the AWS SAM CLI, you need the following tools:

AWS SAM CLI	Install the AWS SAM CLI
Node.js	Install Node.js 14, including the npm package management tool
Docker	Install Docker community edition

To deploy the solution

Open your terminal and use the following command to create a folder to clone the project into, then navigate to that folder. Be sure to replace <FolderName> with your own value.
mkdir Desktop/<FolderName> && cd $_
Clone the project into the folder you just created by using the following command.
git clone https://github.com/aws-samples/aws-audit-manager-manual-evidence-automation.git
Navigate into the newly created project folder by using the following command.
cd aws-audit-manager-manual-evidence-automation
In the AWS SAM shell, use the following command to build the source of your application.
sam build
In the AWS SAM shell, use the following command to package and deploy your application to AWS. Be sure to replace <DOC-EXAMPLE-BUCKET> with your own unique S3 bucket name.
sam deploy –guided –parameter-overrides paramBucketName=<DOC-EXAMPLE-BUCKET>
When prompted, enter the AWS Region where AWS Audit Manager was configured. For the rest of the prompts, leave the default values.
To activate the IAM authentication feature for API gateway, override the default value by using the following command.
paramUseIAMwithGateway=AWS_IAM

To test the deployed solution

After you deploy the solution, run an invocation like the one below for an assessment (using curl). Be sure to replace <YOURAPIENDPOINT> and <AWS REGION> with your own values.

curl –location –request POST
‘https://<YOURAPIENDPOINT>.execute-api.<AWS REGION>.amazonaws.com/Prod’ \
–header ‘x-api-key: ‘ \
–form ‘payload=@”<PATH TO FILE>”‘ \
–form ‘AssessmentName=”GxP21cfr11″‘ \
–form ‘ControlSetName=”General requirements”‘ \
–form ‘ControlIdName=”11.100(a)”‘

Check to see that your file is correctly attached to the control for your assessment.

Form-data interface parameters

The API implements a form-data interface that expects four parameters:

AssessmentName: The name for the assessment in Audit Manager. In this example, the AssessmentName is GxP21cfr11.
ControlSetName: The display name for a control set within an assessment. In this example, the ControlSetName is General requirements.
ControlIdName: this is a particular control within a control set. In this example, the ControlIdName is 11.100(a).
Payload: this is the file representing evidence to be uploaded.

As a refresher of Audit Manager concepts, evidence is collected for a particular control. Controls are grouped into control sets. Control sets can be grouped into a particular framework. The assessment is considered an implementation, or an instance, of the framework. For more information, see AWS Audit Manager concepts and terminology.

To clean up the deployed solution

To clean up the solution, use the following commands to delete the AWS CloudFormation stack and your S3 bucket. Be sure to replace <YourStackId> and <DOC-EXAMPLE-BUCKET> with your own values.

aws cloudformation delete-stack –stack-name <YourStackId>
aws s3 rb s3://<DOC-EXAMPLE-BUCKET> –force

Conclusion

This solution provides a way to allow for better coordination between your software delivery organization and compliance professionals. This allows your organization to continuously deliver new updates without overwhelming your security professionals with manual audit review tasks.

Next steps

There are various ways to extend this solution.

Update the API Lambda implementation to be a webhook for your favorite software development lifecycle (SDLC) or IT service management (ITSM) solution.
Modify the steps within the Step Functions state machine to more closely match your unique compliance processes.
Use AWS CodePipeline to start Step Functions state machines natively, or integrate a variation of this solution with any continuous compliance workflow that you have.

Learn more AWS Audit Manager, DevOps, and AWS for Health and start building!

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.