Tag Archives: automation

How to automatically build forensic kernel modules for Amazon Linux EC2 instances

2022-09-26 Jonathan Nguyen

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/how-to-automatically-build-forensic-kernel-modules-for-amazon-linux-ec2-instances/

In this blog post, we will walk you through the EC2 forensic module factory solution to deploy automation to build forensic kernel modules that are required for Amazon Elastic Compute Cloud (Amazon EC2) incident response automation.

When an EC2 instance is suspected to have been compromised, it’s strongly recommended to investigate what happened to the instance. You should look for activities such as:

Open network connections
List of running processes
Processes that contain injected code
Memory-resident infections
Other forensic artifacts

When an EC2 instance is compromised, it’s important to take action as quickly as possible. Before you shut down the EC2 instance, you first need to capture the contents of its volatile memory (RAM) in a memory dump because it contains the instance’s in-progress operations. This is key in determining the root cause of compromise.

In order to capture volatile memory in Linux, you can use a tool like Linux Memory Extractor (LiME). This requires you to have the kernel modules that are specific to the kernel version of the instance for which you want to capture volatile memory. We also recommend that you limit the actions you take on the instance where you are trying to capture the volatile memory in order to minimize the set of artifacts created as part of the capture process, so you need a method to build the tools for capturing volatile memory outside the instance under investigation. After you capture the volatile memory, you can use a tool like Volatility2 to analyze it in a dedicated forensics environment. You can use tools like LiME and Volatility2 on EC2 instances that use x86, x64, and Graviton instance types.

Prerequisites

This solution has the following prerequisites:

The Target EC2 instance is using Amazon Linux 2 operating system.
An AWS Identity and Access Management (IAM) role with permissions to deploy the required resources in an AWS account. More details about these permissions follow in the next section.

Solution overview

The EC2 forensic module factory solution consists of the following resources:

One AWS Step Functions workflow
Two AWS Lambda functions
One AWS Systems Manager document (SSM document)

Important: The SSM document clones the LiME and Volatility2 GitHub repositories, and these tools use version 2.0 of the GNU General Public License. This SSM document can be updated to include your preferred tools, like fmem or Volatility3, for forensic analysis and capture.
One Amazon Simple Storage Service (Amazon S3) bucket
One Amazon Virtual Private Cloud (Amazon VPC)
One security group for the EC2 instance that is provisioned during the automation
The solution uses the following VPC endpoints for AWS services:
- ec2_endpoint
- ec2_msg_endpoint
- kms_endpoint
- ssm_endpoint
- ssm_msg_endpoint
- s3_endpoint

Figure 1 shows an overview of the EC2 forensic module factory solution workflow.

Figure 1: Automation to build forensic kernel modules for an Amazon Linux EC2 instance

The EC2 forensic module factory solution workflow in Figure 1 includes the following numbered steps:

A Step Functions workflow is started, which creates a Step Functions task token and invokes the first Lambda function, createEC2module, to create EC2 forensic modules.
1. A Step Functions task token is used to allow long-running processes to complete and to avoid a Lambda timeout error. The createEC2module function runs for approximately 9 minutes. The run time for the function can vary depending on any customizations to the createEC2module function or the SSM document.
The createEC2module function launches an EC2 instance based on the Amazon Machine Image (AMI) provided.
Once the EC2 instance is running, an SSM document is run, which includes the following steps:
1. If a specific kernel version is provided in step 1, this kernel version will be installed on the EC2 instance. If no kernel version is provided, the default kernel version on the EC2 instance will be used to create the modules.
2. If a specific kernel version was selected and installed, the system is rebooted to use this kernel version.
3. The prerequisite build tools are installed, as well as the LiME and Volatility2 packages.
4. The LiME kernel module and the Volatility2 profile are built.
The kernel modules for LiME and Volatility2 are put into the S3 bucket.
Upon completion, the Step Functions task token is sent to the Step Functions workflow to invoke the second cleanupEC2module Lambda function to terminate the EC2 instance that was launched in step 2.

Solution deployment

You can deploy the EC2 forensic module factory solution by using either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

Option 1: Deploy the solution with AWS CloudFormation (console)

Sign in to your preferred security tooling account in the AWS Management Console, and choose the following Launch Stack button to open the AWS CloudFormation console pre-loaded with the template for this solution. It will take approximately 10 minutes for the CloudFormation stack to complete.

Option 2: Deploy the solution by using the AWS CDK

You can find the latest code for the EC2 forensic module factory solution in the ec2-forensic-module-factory GitHub repository, where you can also contribute to the sample code. For instructions and more information on using the AWS CDK, see Get Started with AWS CDK.

To deploy the solution by using the AWS CDK

To build the app when navigating to the project’s root folder, use the following commands.
npm install -g aws-cdk npm install
Run the following commands in your terminal while authenticated in your preferred security tooling AWS account. Be sure to replace <INSERT_AWS_ACCOUNT> with your account number, and replace <INSERT_REGION> with the AWS Region that you want the solution deployed to.
cdk bootstrap aws://<INSERT_AWS_ACCOUNT>/<INSERT_REGION> cdk deploy

Run the solution to build forensic kernel objects

Now that you’ve deployed the EC2 forensic module factory solution, you need to invoke the Step Functions workflow in order to create the forensic kernel objects. The following is an example of manually invoking the workflow, to help you understand what actions are being performed. These actions can also be integrated and automated with an EC2 incident response solution.

To manually invoke the workflow to create the forensic kernel objects (console)

In the AWS Management Console, sign in to the account where the solution was deployed.
In the AWS Step Functions console, select the state machine named create_ec2_volatile_memory_modules.
Choose Start execution.
At the input prompt, enter the following JSON values.
{ "AMI_ID": "ami-0022f774911c1d690", "kernelversion":"kernel-4.14.104-95.84.amzn2.x86_64" }
Choose Start execution to start the workflow, as shown in Figure 2.

Figure 2: Step Functions step input example to build custom kernel version using Amazon Linux 2 AMI ID

Workflow progress

You can use the AWS Management Console to follow the progress of the Step Functions workflow. If the workflow is successful, you should see the image when you view the status of the Step Functions workflow, as shown in Figure 3.

Figure 3: Step Functions workflow success example

Note: The Step Functions workflow run time depends on the commands that are being run in the SSM document. The example SSM document included in this post runs for approximately 9 minutes. For information about possible Step Functions errors, see Error handling in Step Functions.

To verify that the artifacts are built

After the Step Functions workflow has successfully completed, go to the S3 bucket that was provisioned in the EC2 forensic module factory solution.
Look for two prefixes in the bucket for LiME and Volatility2, as shown in Figure 4.

Figure 4: S3 bucket prefix for forensic kernel modules
Open each tool name prefix in S3 to find the actual module, such as in the following examples:
- LiME example: lime-4.14.104-95.84.amzn2.x86_64.ko
- Volatility2 example: 4.14.104-95.84.amzn2.x86_64.zip

Now that the objects have been created, the solution has successfully completed.

Incorporate forensic module builds into an EC2 AMI pipeline

Each organization has specific requirements for allowing application teams to use various EC2 AMIs, and organizations commonly implement an EC2 image pipeline using tools like EC2 Image Builder. EC2 Image Builder uses recipes to install and configure required components in the AMI before application teams can launch EC2 instances in their environment.

The EC2 forensic module factory solution we implemented here makes use of an existing EC2 instance AMI. As mentioned, the solution uses an SSM document to create forensic modules. The logic in the SSM document could be incorporated into your EC2 image pipeline to create the forensic modules and store them in an S3 bucket. S3 also allows additional layers of protection such as enforcing default bucket encryption with an AWS Key Management Service Customer Managed Key (CMK), verifying S3 object integrity with checksum, S3 Object Lock, and restrictive S3 bucket policies. These protections can help you to ensure that your forensic modules have not been modified and are only accessible by authorized entities.

It is important to note that incorporating forensic module creation into an EC2 AMI pipeline will build forensic modules for the specific kernel version used in that AMI. You would still need to employ this EC2 forensic module solution to build a specific forensic module version if it is missing from the S3 bucket where you are creating and storing these forensic modules. The need to do this can arise if the EC2 instance is updated after the initial creation of the AMI.

Incorporate the solution into existing EC2 incident response automation

There are many existing solutions to automate incident response workflow for quarantining and capturing forensic evidence for EC2 instances, but the majority of EC2 incident response automation solutions have a single dependency in common, which is the use of specific forensic modules for the target EC2 instance kernel version. The EC2 forensic module factory solution in this post enables you to be both proactive and reactive when building forensic kernel modules for your EC2 instances.

You can use the EC2 forensic module factory solution in two different ways:

Ad-hoc – In this post, you walked through the solution by running the Step Functions workflow with specific parameters. You can do this to build a repository of kernel modules.
Automated – Alternatively, you can incorporate this solution into existing automation by invoking the Step Functions workflow and passing the AMI ID and kernel version. An example could be the following:
1. An existing EC2 incident response solution attempts to get the forensic modules to capture the volatile memory from an S3 bucket.
2. If the specific kernel version is missing in the S3 bucket, the solution updates the automation to StartExecution on the create_ec2_volatile_memory_modules state machine.
3. The Step Functions workflow builds the specific forensic modules.
4. After the Step Functions workflow is complete, the EC2 incident response solution restarts its workflow to get the forensic modules to capture the volatile memory on the EC2 instance.

Now that you have the kernel modules, you can both capture the volatile memory by using LiME, and then conduct analysis on the memory dump by using a Volatility2 profile.

To capture and analyze volatile memory on the target EC2 instance (high-level steps)

Copy the LiME module from the S3 bucket holding the module repository to the target EC2 instance.
Capture the volatile memory by using the LiME module.
Stream the volatile memory dump to a S3 bucket.
Launch an EC2 forensic workstation instance, with Volatility2 installed.
Copy the Volatility2 profile from the S3 bucket to the appropriate location.
Copy the volatile memory dump to the EC2 forensic workstation.
Run analysis on the volatile memory with Volatility2 by using the specific Volatility2 profile created for the target EC2 instance.

Automated self-service AWS solution

AWS has also released the Automated Forensics Orchestrator for Amazon EC2 solution that you can use to quickly set up and configure a dedicated forensics orchestration automation solution for your security teams. The Automated Forensics Orchestrator for Amazon EC2 allows you to capture and examine the data from EC2 instances and attached Amazon Elastic Block Store (Amazon EBS) volumes in your AWS environment. This data is collected as forensic evidence for analysis by the security team.

The Automated Forensics Orchestrator for Amazon EC2 creates the foundational components to enable the EC2 forensic module factory solution’s memory forensic acquisition workflow and forensic investigation and reporting service. Both the Automated Forensics Orchestrator for Amazon EC2, and the EC2 forensic module factory, are hosted in different GitHub projects. And you will need to reconcile the expected S3 bucket locations for the associated modules:

Automated Forensics Orchestrator for Amazon EC2 modules: S3 bucket location for LiME and S3 bucket location for Volatility2
EC2 forensic module factory modules: S3 bucket location for LiME and S3 bucket location for Volatility2

Customize the EC2 forensic module factory solution

The SSM document pulls open-source packages to build tools for the specific Linux kernel version. You can update the SSM document to your specific requirements for forensic analysis, including expanding support for other operating systems, versions, and tools.

You can also update the S3 object naming convention and object tagging, to allow external solutions to reference and copy the appropriate kernel module versions to enable the forensic workflow.

Clean up

If you deployed the EC2 forensic module factory solution by using the Launch Stack button in the AWS Management Console or the CloudFormation template ec2_module_factory_cfn, do the following to clean up:

In the AWS CloudFormation console for the account and Region where you deployed the solution, choose the Ec2VolModules stack.
Choose the option to Delete the stack.

If you deployed the solution by using the AWS CDK, run the following command.

cdk destroy

Conclusion

In this blog post, we walked you through the deployment and use of the EC2 forensic module factory solution to use AWS Step Functions, AWS Lambda, AWS Systems Manager, and Amazon EC2 to create specific versions of forensic kernel modules for Amazon Linux EC2 instances.

The solution provides a framework to create the foundational components required in an EC2 incident response automation solution. You can customize the solution to your needs to fit into an existing EC2 automation, or you can deploy this solution in tandem with the Automated Forensics Orchestrator for Amazon EC2.

If you have feedback about this post, submit comments in the Comments section below. If you have any questions about this post, start a thread on re:Post.

Want more AWS Security news? Follow us on Twitter.

Automatic rule backtesting with large quantities of data

2022-09-08 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/automatic-rule-backtesting

Introduction

Analysts need to analyse and simulate a rule on historical data to check the performance and accuracy of the rule. Backtesting enables analysts to run simulations of the rules and manage the results from the rule engine UI.

Backtesting helps analysts to:

Define the desired impact of the rule for our business and users.
Evaluate the accuracy of the rule based on historical data.
Compare and analyse results with data points, such as known false positives, user segments, risk profile of a user or transaction, and so on.

Currently, the analytics process to test performance of a rule is not standardised, and is inaccurate and inefficient. Analysts from different teams have different approaches:

Offline process using Presto tables. This process is lengthy and inaccurate.
Offline process based on the rule engine payload. The setup takes time, and the process is not streamlined.
Running rules in shadow mode. This process takes days to get the desired result.
A team in Grab uses different rule engines to manage rules and do backtesting. This doubles the effort for analysts and engineers.

In our vision for backtesting, it should allow analysts to:

Efficiently run and manage their jobs.
Create custom metrics, reports and dimensions for backtesting.
Add external data points and metrics to do a deep dive.

For the purpose of establishing a minimum viable product (MVP), backtesting will support basic capabilities and enable analysts to access required metrics and data points. Thus, analysts can:

Run backtesting jobs from the rule engine UI.
Get fixed reports and dimensions for every checkpoint.
Get access to relevant data to analyse backtesting results.

Background

Assume a simple use case: A rule to detect the transaction risk.

Each transaction has a transaction_id, user_id, currency, amount, timestamp. The rule engine also provides a treatment (Approve or Decline) based on the rule logic for the transaction.

In this specific use case, we would like to see what will be the aggregation number of the total transactions, total distinct users, and the sum of the amount, based on the dimensions of date, treatment, and currency in the last couple of weeks.

The result may look like the following data:

Dimension	Dimension	Dimension	metric	metric	metric
Date	Treatment	Currency	Total tx	Distinct user	Total amount
2020-05-1	Approve	SGD	100	80	10020
2020-05-1	Decline	SGD	50	40	450
2020-05-1	Approve	MYR	110	100	1200
2020-05-1	Decline	MYR	30	15	400

* This data does not reflect actual Grab data and is for illustrative purposes only.

Solution

Use a cloud-agnostic Spark-based data pipeline to replay any existing or proposed rule to check performance.
Use a Web Portal to:
- Create or select a rule to replay, with replay time range.
- Display and download the result, such as total events and hit counts.
Replay any existing or proposed rule for checking performance.
Allow users to create or select a rule to replay in the rule engine UI, with provided replay time range.
Display the replay result in the rule engine UI, such as total events and hit counts.
Provide a way to download all testing results in the rule engine UI (for example, all rule responses).
Remove dependency on the specific cloud provider stack, so other teams in Grab can use it instead of Google Cloud Platform (GCP).

Architecture details

The rule editor UI reacts to the user input. Its engine sends a job command to the Amazon Simple Queue Service (SQS) to initialise the job. After that, the rule editor also performs the following processes in the background:

Lambda listens to the request SQS queue and invokes a job via the Spark jobs API.
The job fetches the executable artifacts, data source. After the job is completed, the job script saves the result sheet as required to S3.
The Spark script pushes the job final status (success, failure, timeout) through the shutdown hook to respond to the SQS queue.
The rule editor engine listens to response callback messages, and processes the job metadata to the database, or sends notifications.
The rule editor displays the job metadata on the UI.
The package pipeline builds and deploys the executable artifacts to S3 as a manageable structure.
The Spark script takes the filter logic as its input parameters.

Workflow

Historical data preparation

The historical events are published by the rule engine through Kafka, and stored into the S3 bucket based on time. The Backtesting system then fetches these data for testing based on the time range requested.

By using a Kubernetes stream pipeline, we also save the trust inference stream to Trust AWS subaccount. With the customer bucket and file format, we can improve the efficiency of the data processing, and also avoid any delay from the data lake.

Engineering specifications

Target location:

    s3a://stg-trust-inference-event/<engine-name>/<predict-name>/<YYYY>/MM/DD/hh/mm/ss/<000001>.snappy.parquet
    s3a://prd-trust-inference-event/<engine-name>/<predict-name>/<YYYY>/MM/DD/hh/mm/ss/<000001>.snappy.parquet

Description: Following the fields of steam definition, the engine name would be ruleengine, or catwalk. The predict-name would be preride (checkpoint name), or cnpu (model name).

File Format: avro
File Compression: Snappy
There is no auto retention on sub-account S3. We will implement the archive process in the future.
The default pipeline and the new pipeline will run in parallel until the Data Engineering team is ready to retire the default pipeline.

Backtesting

Upon scheduling, the Backtesting Portal sends a message to SQS, which is then captured by the listening Lambda.
Lambda invokes a Spark job over the AWS elastic mapreduce engine (EMR).
The EMR engine fetches the executable artifacts containing the rule script and historical data from S3, and starts a Spark job to apply the rule script over historical data. Depending on the size of data, the Spark cluster will scale automatically to ensure timely completion.
Once completed, a report file is generated and available on Backtesting UI.

UI

Learnings and conclusions

After the release, here’s what our data analysers had to say:

For trust analysts, testing a rule on historical data happens outside the rule engine UI and is not user-friendly, leading to analysts wasting significant time.
For financial analysts, as analysts migrate to the rule engine UI, the existing solution will be deprecated with no other solution.
An alternative to simulate a rule; we no longer need to run a rule in shadow mode because we can use historical data to determine the outcome. This new approach saves us weeks of effort on the rule onboarding process.

What’s next?

The underlying Spark jobs in this tool were developed by knowledgeable data engineers, which is a disadvantage because it requires a high level of expertise to modify the analytics. To mitigate this restriction, we are looking into using domain-specific language (DSL) to allow users to input desired attributes and dimensions, and provide the job release pipeline for self-serving jobs.

Thanks to Jia Long Loh for the support on the offline infrastructure engineering.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How to automate updates for your domain list in Route 53 Resolver DNS Firewall

2022-09-01 Guillaume Neau

Post Syndicated from Guillaume Neau original https://aws.amazon.com/blogs/security/how-to-automate-updates-for-your-domain-list-in-route-53-resolver-dns-firewall/

Note: This post includes links to third-party websites. AWS is not responsible for the content on those websites.

Following the release of Amazon Route 53 Resolver DNS Firewall, Amazon Web Services (AWS) published several blog posts to help you protect your Amazon Virtual Private Cloud (Amazon VPC) DNS resolution, including How to Get Started with Amazon Route 53 Resolver DNS Firewall for Amazon VPC and Secure your Amazon VPC DNS resolution with Amazon Route 53 Resolver DNS Firewall. Route 53 Resolver DNS Firewall provides managed domain lists that are fully maintained and kept up-to-date by AWS and that directly benefit from the threat intelligence that we gather, but you might want to create or import your own list to have full control over the DNS filtering.

In this blog post, you will find a solution to automate the management of your domain list by using AWS Lambda, Amazon EventBridge, and Amazon Simple Storage Service (Amazon S3). The solution in this post uses, as an example, the URLhaus open Response Policy Zone (RPZ) list, which generates a new file every five minutes.

Architecture overview

The solution is made of the following four components, as shown in Figure 1.

An EventBridge scheduled rule to invoke the Lambda function on a schedule.
A Lambda function that uses the AWS SDK to perform the automation logic.
An S3 bucket to temporarily store the list of domains retrieved.
Amazon Route 53 Resolver DNS Firewall.

Figure 1: Architecture overview

After the solution is deployed, it works as follows:

The scheduled rule invokes the Lambda function every 5 minutes to fetch the latest domain list available.
The Lambda function fetches the list from URLhaus, parses the data retrieved, formats the data, uploads the list of domains into the S3 bucket, and invokes the Route 53 Resolver DNS Firewall importFirewallDomains API action.
The domain list is then updated.

Implementation steps

As a first step, create your own domain list on the Route 53 Resolver DNS Firewall. Having your own domain list allows you to have full control of the list of domains to which you want to apply actions, as defined within rule groups.

To create your own domain list

In the Route 53 console, in the left menu, choose Domain lists in the DNS firewall section.
Choose the Add domain list button, enter a name for your owned domain list, and then enter a placeholder domain to initialize the domain list.
Choose Add domain list to finalize the creation of the domain list.

Figure 2: Expected view of the console

The list from URLhaus contains more than a thousand records. You will use the ImportFirewallDomains endpoint to upload this list to DNS Firewall. The use of the ImportFirewallDomains endpoint requires that you first upload the list of domains and make the list available in an S3 bucket that is located in the same AWS Region as the owned domain list that you just created.

To create the S3 bucket

In the S3 console, choose Create bucket.
Under General configuration, configure the AWS Region option to be the same as the Region in which you created your domain list.
Finalize the configuration of your S3 bucket, and then choose Create bucket.

Because a new file is created every five minutes, we recommend setting a lifecycle rule to automatically expire and delete files after 24 hours to optimize for cost and only save the most recent lists.

To create the Lambda function

Follow the steps in the topic Creating an execution role in the IAM console to create an execution role. After step 4, when you configure permissions, choose Create Policy, and then create and add an IAM policy similar to the following example. This policy needs to:

Allow the Lambda function to put logs in Amazon CloudWatch.
Allow the Lambda function to have read and write access to objects placed in the created S3 bucket.
Allow the Lambda function to update the firewall domain list.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:<region>:<accountId>:*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::<DNSFW-BUCKET-NAME>/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "route53resolver:ImportFirewallDomains"
            ],
            "Resource": "arn:aws:route53resolver:<region>:<accountId>:firewall-domain-list/<domain-list-id>",
            "Effect": "Allow"
        }
    ]
}

(Optional) If you decide to use the example provided by AWS:
- After cloning the repository: Build the layer following the instruction included in the readme.md and the provided script.
- Zip the lambda.
- In the left menu, select Layers then Create Layer. Enter a name for the layer, then select Upload a .zip file. Choose to upload the layer (node-axios-layer.zip).
- As a compatible runtime, select: Node.js 16.x.
- Select Create
In the Lambda console, in the same Region as your domain list, choose Create function, and then do the following:
- Choose your desired runtime and architecture.
- (Optional) To use the code provided by AWS: Select Node.js 16.x as the runtime.
- Choose Change the default execution role.
- Choose Use an existing role, and then pick the role that you just created.
After the Lambda function is created, in the left menu of the Lambda console, choose Functions, and then select the function you created.
- For Code source, you can either enter the code of the Lambda function or choose the Upload from button and then choose the source for the code. AWS provides an example of functioning code on GitHub under a MIT-0 license.
(optional) To use the code provided by AWS:
- Choose the Upload from button and upload the zipped code example.
- After the code is uploaded, edit the default Runtime settings: Choose the Edit button and set the handler to be equal to: LambdaRpz.handler
- Edit the default Layers configuration, choose the Add a layer button, select Specify an ARN and enter the ARN of the layer created during the optional step 2.
- Edit the environment variables of the function: Select the Edit button and define the three following variables:
  1. Key : FirewallDomainListId | Value : <domain-list-id>
  2. Key : region | Value : <region>
  3. Key : s3Prefix | Value : <DNSFW-BUCKET-NAME>

The code that you place in the function will be able to fetch the list from URLhaus, upload the list as a file to S3, and start the import of domains.

For the Lambda function to be invoked every 5 minutes, next you will create a scheduled rule with Amazon EventBridge.

To automate the invoking of the Lambda function

In the EventBridge console, in the same AWS Region as your domain list, choose Create rule.
For Rule type, choose Schedule.
For Schedule pattern, select the option A schedule that runs at a regular rate, such as every 10 minutes, and under Rate expression set a rate of 5 minutes.

Figure 3: Console view when configuring a schedule
To select the target, choose AWS service, choose Lambda function, and then select the function that you previously created.

After the solution is deployed, your domain list will be updated every 5 minutes and look like the view in Figure 4.

Figure 4: Console view of the created domain list after it has been updated by the Lambda function

Code samples

You can use the samples in the amazon-route-53-resolver-firewall-automation-examples-2 GitHub repository to ease the automation of your domain list, and the associated updates. The repository contains script files to help you with the deployment process of the AWS CloudFormation template. Note that you need to have the AWS Command Line Interface (AWS CLI) installed and properly configured in order to use the files.

To deploy the CloudFormation stack

If you haven’t done so already, create an S3 bucket to store the artifacts in the Region where you wish to deploy. This name of this bucket will then be referenced as ParamS3ArtifactBucket with a value of <DOC-EXAMPLE-BUCKET-ARTIFACT>
Clone the repository locally.
git clone https://github.com/aws-samples/amazon-route-53-resolver-firewall-automation-examples-2
Build the Lambda function layer. From the /layer folder, use the provided script.
. ./build-layer.sh
Zip and upload the artifact to the bucket created in step 1. From the root folder, use the provided script.
. ./zipupload.sh <ParamS3ArtifactBucket>
Deploy the AWS CloudFormation stack by using either the AWS CLI or the CloudFormation console.
- To deploy by using the AWS CLI, from the root folder, type the following command, making sure to replace <region>, <DOC-EXAMPLE-BUCKET-ARTIFACT>, <DNSFW-BUCKET-NAME>, and <DomainListName>with your own values.
```
aws --region <region> cloudformation create-stack --stack-name DNSFWStack --capabilities CAPABILITY_NAMED_IAM --template-body file://./DNSFWStack.cfn.yaml --parameters ParameterKey=ParamS3ArtifactBucket,ParameterValue=<DOC-EXAMPLE-BUCKET-ARTIFACT> ParameterKey=ParamS3RpzBucket,ParameterValue=<DNSFW-BUCKET-NAME> ParameterKey=ParamFirewallDomainListName,ParameterValue=<DomainListName>
```
- To deploy by using the console, do the following:
  1. In the CloudFormation console, choose Create stack, and then choose With new resources (standard).
  2. On the creation screen, choose Template is ready, and upload the provided DNSFWStack.cfn.yaml file.
  3. Enter a stack name and configure the requested parameters with your desired configuration and outcomes. These parameters include the following:
    - The name of your firewall domain list.
    - The name of the S3 bucket that contains Lambda artifacts.
    - The name of the S3 bucket that will be created to contain the files with the domain information from URLhaus.
  4. Acknowledge that the template requires IAM permission because it will create the role for the Lambda function and manage its IAM policy, and then choose Create stack.

After a few minutes, all the resources should be created and the CloudFormation stack is now deployed. After 5 minutes, your domain list should be updated, as shown in Figure 5.

Figure 5: Console view of CloudFormation after the stack has been deployed

Conclusions and cost

In this blog post, you learned about creating and automating the update of a domain list that you fully control. To go further, you can extend and replicate the architecture pattern to fetch domain names from other sources by editing the source code of the Lambda function.

After the solution is in place, in order for the filtering to be effective, you need to create a rule group referencing the domain list and associate the rule group with some of your VPCs.

For cost information, see the AWS Pricing Calculator. This solution will be invoked 60 (minutes) * 24 (hours) * 30 (days) / 5 (minutes) = 8,640 times per month, invoking the Lambda function that will run for an average of 400 minutes, storing an average of 0.5 GB in Amazon S3, and creating a domain list that averages 1,500 domains. According to our public pricing, and without factoring in the AWS Free Tier, this will incur the estimated total cost of $1.43 per month for the filtering of 1 million DNS requests.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

How we automated FAQ responses at Grab

2022-07-13 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/automated-faq

Overview and initial analysis

Knowledge management is often one of the biggest challenges most companies face internally. Teams spend several working hours trying to either inefficiently look for information or constantly asking colleagues about information already documented somewhere. A lot of time is spent on the internal employee communication channels (in our case, Slack) simply trying to figure out answers to repetitive questions. On our journey to automate the responses to these repetitive questions, we needed first to figure out exactly how much time and effort is spent by on-call engineers answering such repetitive questions.

We soon identified that many of the internal engineering tools’ on-call activities involve answering users’ (internal users) questions on various Slack channels. Many of these questions have already been asked or documented on the wiki. These inquiries hinder on-call engineers’ productivity and affect their ability to focus on operational tasks. Once we figured out that on-call employees spend a lot of time answering Slack queries, we decided on a journey to determine the top questions.

We considered smaller groups of teams for this study and found out that:

The topmost user queries are “How do I do ABC?” or “Is XYZ broken?”.
The second most commonly asked questions revolve around access requests, approvals, or other permissions. The answer to such questions is often URLs to existing documentation.

These findings informed us that we didn’t just need an artificial intelligence (AI) based autoresponder to repetitive questions. We must, in fact, also leverage these channels’ chat histories to identify patterns.

Gathering user votes for shortlisted vendors

In light of saving costs and time and considering the quality of existing solutions already available in the market, we decided not to reinvent the wheel and instead purchase an existing product. And to figure out which product to purchase, we needed to do a comparative analysis. And thus began our vendor comparison journey!

While comparing the feature sets offered by different vendors, we understood that our users need to play a part in this decision-making process. However, sharing our vendor analysis with our users and allowing them to choose the bot of their choice posed several challenges:

Users could be biased towards known bots (from previous experiences).
Users could be biased towards big brands with a preconceived notion that big brands mean better features and better user support.
Users may likely pick the most expensive vendor, assuming that a higher cost means higher efficiency.

To ensure that we receive unbiased feedback, here’s how we opened users up to voting. We highlighted the top features of each vendor’s bot compared to other shortlisted bots. We hid the names of the bots to avoid brand attraction. At a high level, here’s what the categorisation looked like:

Features	Vendor 1 (name hidden)	Vendor 2 (name hidden)	Vendor 3 (name hidden)
Enables crowdsourcing, everyone is incentivised to participate. Participants/SME names are visible. Everyone can access the web UI and see how the responses configured on the bot.		–	–
Lowers discussions on channels by providing easy ways to raise tickets to the team instead of discussing on Slack.	–
Only a specific set of admins (or oncall engineers) feed and maintain the bot thus ensuring information authenticity and reliability.
Easy bot feeding mechanism/web UI to update FAQs.		–
Superior natural language processing capabilities.			–
Please vote	Vendor 1	Vendor 2	Vendor 3

Although none of the options had all the features our users wanted, about 60% chose Vendor 1 (OneBar). From this, we discovered the core features that our users needed while keeping them involved in the decision-making process.

Matching our requirements with available vendors’ feature sets

Although our users made their preferences clear, we still needed to ensure that the feature sets available in the market suited our internal requirements in terms of the setup and the features available in portals that we envisioned replacing. As part of our requirements gathering process, here are some of the critical conditions that became more and more prominent:

An ability to crowdsource Slack discussions/conclusions and save them directly from Slack (preferably with a single command).
An ability to auto-respond to Slack queries without calling the bot manually.
The bot must be able to respond to queries only on the preconfigured Slack channel (not a Slack-wide auto-responder that is already available).
Ability to auto-detect frequently asked questions on the channels would mean less work for platform engineers to feed the bot manually and periodically.
A trusted and secured data storage setup and a responsive customer support team.

Proof of concept

We considered several tools (including some of the tools used by our HR for auto-answering employee questions). We then decided to do a complete proof of concept (POC) with OneBar to check if it fulfils our internal requirements.

These were the phases in which we conducted the POC for the shortlisted vendor (OneBar):

Phase 1: Study the traffic, see what insights OneBar shows and what it could/should potentially show. Then think about how an ideal oncall or support should behave in such an environment. i.e. we could identify specific messages in history and describe what should’ve happened to each one of them.

Phase 2: Create required records in OneBar and configure it to match the desired behaviour as closely as possible.

Phase 3: Let the tool run for a couple of weeks and then evaluate how well it responds to questions, how often people search directly, how much information they add, etc. Onebar adds all these metrics in the app making it easier to monitor activity.

In addition to the Onebar POC, we investigated other solutions and did a thorough vendor comparison and analysis. After running the POC and investigating other vendors, we decided to use OneBar as its features best meet our needs.

Prioritising Slack channels

While we had multiple Slack channels that we’d love to have enabled the shortlisted bot on, our initial contract limited our use of the bot to only 20 channels. We could not use OneBar to auto-scan more than 20 Slack channels.

Users could still chat directly with the bot to get answers to FAQs based on what was fed to the bot’s knowledge base (KB). They could also access the web login, which displays its KB, other valuable features, and additional features for admins/experts.

Slack channels that we enabled the licensed features on were prioritised based on:

Most messages sent on the channel per month, i.e. most active channels.
Most members impacted, i.e. channels with a large member count.

To do this, we used Slack analytics reports and identified the channels that fit our prioritisation criteria.

Change is difficult but often essential

Once we’d onboarded the vendor, we began training and educating employees on using this new Knowledge Management system for all their FAQs. It was a challenge as change is always complex but essential for growth.

A series of tech talks and training conducted across the company and at more minor scales also helped guide users about the bot’s features and capabilities.

At the start, we suffered from a lack of data resulting in incorrect responses from the bot. But as the team became increasingly aware of the features and learned more about its capabilities, the bot’s number of KB items grew, resulting in a much more efficient experience. It took us around one quarter to feed the bot consistently to see accurate and frequent responses from it.

Crowdsourcing our internal glossary

With an increasing number of acronyms and company-specific words emerging each year, the number of acronyms and company-specific abbreviations that new joiners face is immense.

We solved this issue by using the bot’s channel-specific KB feature. We created a specific Slack channel dedicated to storing and retrieving definitions of acronyms and other words. This solution turned out to be a big hit with our users.

And who fed the bot with the terms and glossary items? Who better than our onboarding employees to train the bot to help other onboarders. A targeted campaign dedicated to feeding the bot excited many of our onboarders. They began to play around with the bot’s features and provide it with as many glossary items as possible, thus winning swags!

In a matter of weeks, the user base grew from a couple of hundred to around 3000. This effort was also called out in one of our company-wide All Hands meetings, a big win for our team!

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Automating detection of security vulnerabilities and bugs in CI/CD pipelines using Amazon CodeGuru Reviewer CLI

2022-06-01 Akash Verma

Post Syndicated from Akash Verma original https://aws.amazon.com/blogs/devops/automating-detection-of-security-vulnerabilities-and-bugs-in-ci-cd-pipelines-using-amazon-codeguru-reviewer-cli/

Watts S. Humphrey, the father of Software Quality, had famously quipped, “Every business is a software business”. Software is indeed integral to any industry. The engineers who create software are also responsible for making sure that the underlying code adheres to industry and organizational standards, are performant, and are absolved of any security vulnerabilities that could make them susceptible to attack.

Traditionally, security testing has been the forte of a specialized security testing team, who would conduct their tests toward the end of the Software Development lifecycle (SDLC). The adoption of DevSecOps practices meant that security became a shared responsibility between the development and security teams. Now, development teams can, on their own or as advised by their security team, setup and configure various code scanning tools to detect security vulnerabilities much earlier in the software delivery process (aka “Shift Left”). Meanwhile, the practice of Static code analysis and security application testing (SAST) has become a standard part of the SDLC. Furthermore, it’s imperative that the development teams expect SAST tools that are easy to set-up, seamlessly fit into their DevOps infrastructure, and can be configured without requiring assistance from security or DevOps experts.

In this post, we’ll demonstrate how you can leverage Amazon CodeGuru Reviewer Command Line Interface (CLI) to integrate CodeGuru Reviewer into your Jenkins Continuous Integration & Continuous Delivery (CI/CD) pipeline. Note that the solution isn’t limited to Jenkins, and it would be equally useful with any other build automation tool. Moreover, it can be integrated at any stage of your SDLC as part of the White-box testing. For example, you can integrate the CodeGuru Reviewer CLI as part of your software development process, as well as run it on your dev machine before committing the code.

Launched in 2020, CodeGuru Reviewer utilizes machine learning (ML) and automated reasoning to identify security vulnerabilities, inefficient uses of AWS APIs and SDKs, as well as other common coding errors. CodeGuru Reviewer employs a growing set of detectors for Java and Python to provide recommendations via the AWS Console. Customers that leverage the CodeGuru Reviewer CLI within a CI/CD pipeline also receive recommendations in a machine-readable JSON format, as well as HTML.

CodeGuru Reviewer offers native integration with Source Code Management (SCM) systems, such as GitHub, BitBucket, and AWS CodeCommit. However, it can be used with any SCM via its CLI. The CodeGuru Reviewer CLI is a shim layer on top of the AWS Command Line Interface (AWS CLI) that simplifies the interaction with the tool by handling the uploading of artifacts, triggering of the analysis, and fetching of the results, all in a single command.

Many customers, including Mastercard, are benefiting from this new CodeGuru Reviewer CLI.

“During one of our technical retrospectives, we noticed the need to integrate Amazon CodeGuru recommendations in our build pipelines hosted on Jenkins. Not all our developers can run or check CodeGuru recommendations through the AWS console. Incorporating CodeGuru CLI in our build pipelines acts as an important quality gate and ensures that our developers can immediately fix critical issues.”
Claudio Frattari, Lead DevOps at Mastercard

Solution overview

The application deployment workflow starts by placing the application code on a GitHub SCM. To automate the scenario, we have added GitHub to the Jenkins project under the “Source Code” section. We chose the GitHub option, which would clone the chosen GitHub repository in the Jenkins local workspace directory.

In the build stage of the pipeline (see Figure 1), we configure the appropriate build tool to perform the code build and security analysis. In this example, we will be using Maven as the build tool.

Figure 1: Jenkins pipeline with Amazon CodeGuru Reviewer

In the post-build stage, we configure the CodeGuru Reviewer CLI to generate the recommendations based on the review.

Lastly, in the concluding stage of the pipeline, we’ll be analyzing the JSON results using jq – a lightweight and flexible command-line JSON processor, and then failing the Jenkins job if we encounter observations that are of a “Critical” severity.

Jenkins will trigger the “CodeGuru Reviewer” (see Figure 1) based review process in the post-build stage, i.e., after the build finishes. Furthermore, you can configure other stages, such as automated testing or deployment, after this stage. Additionally, passing the location of the build artifacts to the CLI lets CodeGuru Reviewer perform a more in-depth security analysis. Build artifacts are either directories containing jar files (e.g., build/lib for Gradle or /target for Maven) or directories containing class hierarchies (e.g., build/classes/java/main for Gradle).

Walkthrough

Now that we have an overview of the workflow, let’s dive deep and walk you through the following steps in detail:

Installing the CodeGuru Reviewer CLI
Creating a Jenkins pipeline job
Reviewing the CodeGuru Reviewer recommendations
Configuring CodeGuru Reviewer CLI’s additional options

1. Installing the CodeGuru CLI Wrapper

a. Prerequisites

To run the CLI, we must have Git, Java, Maven, and the AWS CLI installed. Verify that they’re installed on our machine by running the following commands:

java -version 
mvn --version 
aws --version 
git –-version

If they aren’t installed, then download and install Java here (Amazon Corretto is a no-cost, multiplatform, production-ready distribution of the Open Java Development Kit), Maven from here, and Git from here. Instructions for installing AWS CLI are available here.

We would need to create an Amazon Simple Storage Service (Amazon S3) bucket with the prefix codeguru-reviewer-. Note that the bucket name must begin with the mentioned prefix, since we have used the name pattern in the following AWS Identity and Access Management (IAM) permissions, and CodeGuru Reviewer expects buckets to begin with this prefix. Refer to the following section 4(a) “Specifying S3 bucket name” for more details.

Furthermore, we’ll need working credentials on our machine to interact with our AWS account. Learn more about setting up credentials for AWS here. You can find the minimal permissions to run the CodeGuru Reviewer CLI as follows.

b. Required Permissions

To use the CodeGuru Reviewer CLI, we need at least the following AWS IAM permissions, attached to an AWS IAM User or an AWS IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "codeguru-reviewer:ListRepositoryAssociations",
                "codeguru-reviewer:AssociateRepository",
                "codeguru-reviewer:DescribeRepositoryAssociation",
                "codeguru-reviewer:CreateCodeReview",
                "codeguru-reviewer:DescribeCodeReview",
                "codeguru-reviewer:ListRecommendations",
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucket*",
                "s3:List*",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::codeguru-reviewer-*",
                "arn:aws:s3:::codeguru-reviewer-*/*"
            ],
            "Effect": "Allow"
        }
    ]
}

c. CLI installation

Please download the latest version of the CodeGuru Reviewer CLI available at GitHub. Then, run the following commands in sequence:

curl -OL https://github.com/aws/aws-codeguru-cli/releases/download/0.0.1/aws-codeguru-cli.zip
unzip aws-codeguru-cli.zip
export PATH=$PATH:./aws-codeguru-cli/bin

d. Using the CLI

The CodeGuru Reviewer CLI only has one required parameter –root-dir (or just -r) to specify to the local directory that should be analyzed. Furthermore, the –src option can be used to specify one or more files in this directory that contain the source code that should be analyzed. In turn, for Java applications, the –build option can be used to specify one or more build directories.

For a demonstration, we’ll analyze the demo application. This will make sure that we’re all set for when we leverage the CLI in Jenkins. To proceed, first we download and install the sample application, as follows:

git clone https://github.com/aws-samples/amazon-codeguru-reviewer-sample-app
cd amazon-codeguru-reviewer-sample-app
mvn clean compile

Now that we have built our demo application, we can use the aws-codeguru-cli CLI command that we added to the path to trigger the code scan:

aws-codeguru-cli --root-dir ./ --build target/classes --src src --output ./output

For additional assistance on the CLI command, reference the readme here.

2. Creating a Jenkins Pipeline job

CodeGuru Reviewer can be integrated in a Jenkins Pipeline as well as a Freestyle project. In this example, we’re leveraging a Pipeline.

a. Pipeline Job Configuration

Log in to Jenkins, choose “New Item”, then select “Pipeline” option.
Enter a name for the project (for example, “CodeGuruPipeline”), and choose OK.

Figure 2: Creating a new Jenkins pipeline

On the “Project configuration” page, scroll down to the bottom and find your pipeline. In the pipeline script, paste the following script (or use your own Jenkinsfile). The following example is a valid Jenkinsfile to integrate CodeGuru Reviewer with a project built using Maven.

pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                // Get code from a GitHub repository
                git clone https://github.com/aws-samples/amazon-codeguru-reviewer-java-detectors.git

                // Run Maven on a Unix agent
                sh "mvn clean compile"

                // To run Maven on a Windows agent, use following
                // bat "mvn -Dmaven.test.failure.ignore=true clean package"
            }
        }
        stage('CodeGuru Reviewer') {
            steps{
                sh 'ls -lsa *'
                sh 'pwd'
                // Here we’re setting an absolute path, but we can 
                // also use JENKINS environment variables
                sh '''
                    export BASE=/var/jenkins_home/workspace/CodeGuruPipeline/amazon-codeguru-reviewer-java-detectors
                    export SRC=${BASE}/src
                    export OUTPUT = ./output
                    /home/codeguru/aws-codeguru-cli/bin/aws-codeguru-cli --root-dir $BASE --build $BASE/target/classes --src $SRC --output $OUTPUT -c $GIT_PREVIOUS_COMMIT:$GIT_COMMIT --no-prompt
                    '''
            }
        }    
        stage('Checking findings'){
            steps{
                // In this example we are stopping our pipline on  
                // detecting Critical findings. We are using jq 
                // to count occurrences of Critical severity 
                sh '''
                CNT = $(cat ./output/recommendations.json |jq '.[] | select(.severity=="Critical")|.severity' | wc -l)'
                if (( $CNT > 0 )); then
                  echo "Critical findings discovered. Failing."
                  exit 1
                fi
                '''
            }
        }
    }
}

Save the configuration and select “Build now” on the side bar to trigger the build process (see Figure 3).

Figure 3: Jenkins pipeline in triggered state

3. Reviewing the CodeGuru Reviewer recommendations

Once the build process is finished, you can view the review results from CodeGuru Reviewer by selecting the Jenkins build history for the most recent build job. Then, browse to Workspace output. The output is available in JSON and HTML formats (Figure 4).

Figure 4: CodeGuru CLI Output

Snippets from the HTML and JSON reports are displayed in Figure 5 and 6 respectively.

In this example, our pipeline analyzes the JSON results with jq based on severity equal to critical and failing the job if there are any critical findings. Note that this output path is set with the –output option. For instance, the pipeline will fail on noticing the “critical” finding at Line 67 of the EventHandler.java class (Figure 5), flagged due to use of an insecure code. Till the time the code is remediated, the pipeline would prevent the code deployment. The vulnerability could have gone to production undetected, in absence of the tool.

Figure 5: CodeGuru HTML Report

Figure 6: CodeGuru JSON recommendations

4. Configuring CodeGuru Reviewer CLI’s additional options

a. Specifying Amazon S3 bucket name and policy

CodeGuru Reviewer needs one Amazon S3 bucket for the CLI to store the artifacts while the analysis is running. The artifacts are deleted after the analysis is completed. The same bucket will be reused for all the repositories that are analyzed in the same account and region (unless specified otherwise by the user). Note that CodeGuru Reviewer expects the S3 bucket name to begin with codeguru-reviewer-. At this time, you can’t use a different naming pattern. However, if you want to use a different bucket name, then you can use the –bucket-name option.

Select the Permissions tab of your S3 bucket. Update the Block public access and add the following S3 bucket policy.

Figure 7: S3 bucket settings

S3 bucket policy:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"PublicRead",
         "Effect":"Allow",
         "Principal":"*",
         "Action":"s3:GetObject",
         "Resource":"[Change to ARN for your S3 bucket]/*"
      }
   ]
}

Note that if you must change the bucket’s name, then you can remove the associated S3 bucket in the AWS console under CodeGuru → CI workflows and select Disassociate Workflow.

b. Analyzing a single commit

The CLI also lets us specify a specific commit range to analyze. This can lead to faster and more cost-effective scans for the incremental code changes, instead of a full repository scan. For example, if we just want to analyze the last commit, we can run:

aws-codeguru-cli -r ./ -s src/main/java -b build/libs -c HEAD^:HEAD --no-prompt

Here, we use the -c option to specify that we only want to analyze the commits between HEAD^ (the previous commit) and HEAD (the current commit). Moreover, we add the –no-prompt option to automatically answer questions by the CLI with yes. This option is useful if we plan to use the CLI in an automated way, such as in our CI/CD workflow.

c. Encrypting artifacts

CodeGuru Reviewer lets us use a customer managed key to encrypt the content of the S3 bucket that is used to store the source and build artifacts. To achieve this, create a customer owned key in AWS Key Management Service (AWS KMS) (see Figure 8).

Figure 8: KMS settings

We must grant CodeGuru Reviewer the permission to decrypt artifacts with this key by adding the following Statement to your Key policy:

{
   "Sid":"Allow CodeGuru to use the key to decrypt artifact",
   "Effect":"Allow",
   "Principal":{
      "AWS":"*"
   },
   "Action":[
      "kms:Decrypt",
      "kms:DescribeKey"
   ],
   "Resource":"*",
   "Condition":{
      "StringEquals":{
         "kms:ViaService":"codeguru-reviewer.amazonaws.com",
         "kms:CallerAccount":[
            "YOUR AWS ACCOUNT ID"
         ]
      }
   }
}

Then, enable server-side encryption for the S3 bucket that we’re using with CodeGuru Reviewer (Figure 9).

S3 bucket settings:

Figure 9: S3 bucket encryption settings

After we enable encryption on the bucket, we must delete all the CodeGuru repository associations that use this bucket, and then recreate them by analyzing the repositories while providing the key (as in the following example, Figure 10):

Figure 10: CodeGuru CI Workflow

Note that the first time you check out your repository, it will always trigger a full repository scan. Consider setting the -c option, as this will allow a commit range.

Cleaning Up

At this stage, you may choose to delete the resources created while following this blog, to avoid incurring any unwanted costs.

Delete Amazon S3 bucket.
Delete AWS KMS key.
Delete the Jenkins installation, if not required further.

Conclusion

In this post, we outlined how you can integrate Amazon CodeGuru Reviewer CLI with the Jenkins open-source build automation tool to perform code analysis as part of your code build pipeline and act as a quality gate. We showed you how to create a Jenkins pipeline job and integrate the CodeGuru Reviewer CLI to detect issues in your Java and Python code, as well as access the recommendations for remediating these issues. We presented an example where you can stop the build upon finding critical violations. Furthermore, we discussed how you can specify a commit range to avoid a full repo scan, and how the S3 bucket used by CodeGuru Reviewer to store artifacts can be encrypted using customer managed keys.

The CodeGuru Reviewer CLI offers you a one-line command to scan any code on your machine and retrieve recommendations. You can run the CLI anywhere where you can run AWS commands. In other words, you can use the CLI to integrate CodeGuru Reviewer into your favourite CI tool, as a pre-commit hook, or anywhere else in your workflow. In turn, you can combine CodeGuru Reviewer with Dynamic Application Security Testing (DAST) and Software Composition Analysis (SCA) tools to achieve a hybrid application security testing method that helps you combine the inside-out and outside-in testing approaches, cross-reference results, and detect vulnerabilities that both exist and are exploitable.

Hopefully, you have found this post informative, and the proposed solution useful. If you need helping hands, then AWS Professional Services can help implement this solution in your enterprise, as well as introduce you to our AWS DevOps services and offerings.

About the Authors

Manage application security and compliance with the AWS Cloud Development Kit and cdk-nag

2022-05-25 Rodney Bozo

Post Syndicated from Rodney Bozo original https://aws.amazon.com/blogs/devops/manage-application-security-and-compliance-with-the-aws-cloud-development-kit-and-cdk-nag/

Infrastructure as Code (IaC) is an important part of Cloud Applications. Developers rely on various Static Application Security Testing (SAST) tools to identify security/compliance issues and mitigate these issues early on, before releasing their applications to production. Additionally, SAST tools often provide reporting mechanisms that can help developers verify compliance during security reviews.

cdk-nag integrates directly into AWS Cloud Development Kit (AWS CDK) applications to provide identification and reporting mechanisms similar to SAST tooling.

This post demonstrates how to integrate cdk-nag into an AWS CDK application to provide continual feedback and help align your applications with best practices.

Overview of cdk-nag

cdk-nag (inspired by cfn_nag) validates that the state of constructs within a given scope comply with a given set of rules. Additionally, cdk-nag provides a rule suppression and compliance reporting system. cdk-nag validates constructs by extending AWS CDK Aspects. If you’re interested in learning more about the AWS CDK Aspect system, then you should check out this post.

cdk-nag includes several rule sets (NagPacks) to validate your application against. As of this post, cdk-nag includes the AWS Solutions, HIPAA Security, NIST 800-53 rev 4, NIST 800-53 rev 5, and PCI DSS 3.2.1 NagPacks. You can pick and choose different NagPacks and apply as many as you wish to a given scope.

cdk-nag rules can either be warnings or errors. Both warnings and errors will be displayed in the console and compliance reports. Only unsuppressed errors will prevent applications from deploying with the cdk deploy command.

You can see which rules are implemented in each of the NagPacks in the Rules Documentation in the GitHub repository.

Walkthrough

This walkthrough will setup a minimal AWS CDK v2 application, as well as demonstrate how to apply a NagPack to the application, how to suppress rules, and how to view a report of the findings. Although cdk-nag has support for Python, TypeScript, Java, and .NET AWS CDK applications, we’ll use TypeScript for this walkthrough.

Prerequisites

For this walkthrough, you should have the following prerequisites:

A local installation of and experience using the AWS CDK.

Create a baseline AWS CDK application

In this section you will create and synthesize a small AWS CDK v2 application with an Amazon Simple Storage Service (Amazon S3) bucket. If you are unfamiliar with using the AWS CDK, then learn how to install and setup the AWS CDK by looking at their open source GitHub repository.

Run the following commands to create the AWS CDK application:

mkdir CdkTest
cd CdkTest
cdk init app --language typescript

Replace the contents of the lib/cdk_test-stack.ts with the following:

import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    const bucket = new Bucket(this, 'Bucket')
  }
}

Run the following commands to install dependencies and synthesize our sample app:

npm install
npx cdk synth

You should see an AWS CloudFormation template with an S3 bucket both in your terminal and in cdk.out/CdkTestStack.template.json.

Apply a NagPack in your application

In this section, you’ll install cdk-nag, include the AwsSolutions NagPack in your application, and view the results.

Run the following command to install cdk-nag:

npm install cdk-nag

Replace the contents of the bin/cdk_test.ts with the following:

#!/usr/bin/env node
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import { CdkTestStack } from '../lib/cdk_test-stack';
import { AwsSolutionsChecks } from 'cdk-nag'
import { Aspects } from 'aws-cdk-lib';

const app = new cdk.App();
// Add the cdk-nag AwsSolutions Pack with extra verbose logging enabled.
Aspects.of(app).add(new AwsSolutionsChecks({ verbose: true }))
new CdkTestStack(app, 'CdkTestStack', {});

Run the following command to view the output and generate the compliance report:

npx cdk synth

The output should look similar to the following (Note: SSE stands for Server-side encryption):

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S1: The S3 Bucket has server access logs disabled. The bucket should have server access logging enabled to provide detailed records for the requests that are made to the bucket.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S2: The S3 Bucket does not have public access restricted and blocked. The bucket should have public access restricted and blocked to prevent unauthorized access.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S10: The S3 Bucket does not require requests to use SSL. You can use HTTPS (TLS) to help prevent potential attackers from eavesdropping on or manipulating network traffic using person-in-the-middle or similar attacks. You should allow only encrypted connections over HTTPS (TLS) using the aws:SecureTransport condition on Amazon S3 bucket policies.

Found errors

Note that applying the AwsSolutions NagPack to the application rendered several errors in the console (AwsSolutions-S1, AwsSolutions-S2, AwsSolutions-S3, and AwsSolutions-S10). Furthermore, the cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains the errors as well:

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Remediating and suppressing errors

In this section, you’ll remediate the AwsSolutions-S10 error, suppress the AwsSolutions-S1 error on a Stack level, suppress the AwsSolutions-S2 error on a Resource level errors, and not remediate the AwsSolutions-S3 error and view the results.

Replace the contents of the lib/cdk_test-stack.ts with the following:

import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';
import { NagSuppressions } from 'cdk-nag'

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    // The local scope 'this' is the Stack. 
    NagSuppressions.addStackSuppressions(this, [
      {
        id: 'AwsSolutions-S1',
        reason: 'Demonstrate a stack level suppression.'
      },
    ])
    // Remediating AwsSolutions-S10 by enforcing SSL on the bucket.
    const bucket = new Bucket(this, 'Bucket', { enforceSSL: true })
    NagSuppressions.addResourceSuppressions(bucket, [
      {
        id: 'AwsSolutions-S2',
        reason: 'Demonstrate a resource level suppression.'
      },
    ])
  }
}

Run the cdk synth command again:

npx cdk synth

The output should look similar to the following:

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

Found errors

The cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains more details about rule compliance, non-compliance, and suppressions.

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a stack level suppression.","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a resource level suppression.","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Moreover, note that the resultant cdk.out/CdkTestStack.template.json template contains the cdk-nag suppression data. This provides transparency with what rules weren’t applied to an application, as the suppression data is included in the resources.

{
  "Metadata": {
    "cdk_nag": {
      "rules_to_suppress": [
        {
          "id": "AwsSolutions-S1",
          "reason": "Demonstrate a stack level suppression."
        }
      ]
    }
  },
  "Resources": {
    "BucketDEB6E181": {
      "Type": "AWS::S3::Bucket",
      "UpdateReplacePolicy": "Retain",
      "DeletionPolicy": "Retain",
      "Metadata": {
        "aws:cdk:path": "CdkTestStack/Bucket/Resource",
        "cdk_nag": {
          "rules_to_suppress": [
            {
              "id": "AwsSolutions-S2",
              "reason": "Demonstrate a resource level suppression."
            }
          ]
        }
      }
    },
  ...
  },
  ...
}

Reflecting on the Walkthrough

In this section, you learned how to apply a NagPack to your application, remediate/suppress warnings and errors, and review the compliance reports. The reporting and suppression systems provide mechanisms for the development and security teams within organizations to work together to identify and mitigate potential security/compliance issues. Security can choose which NagPacks developers should apply to their applications. Then, developers can use the feedback to quickly remediate issues. Security can use the reports to validate compliances. Furthermore, developers and security can work together to use suppressions to transparently document exceptions to rules that they’ve decided not to follow.

Advanced usage and further reading

This section briefly covers some advanced options for using cdk-nag.

Unit Testing with the AWS CDK Assertions Library

The Annotations submodule of the AWS CDK assertions library lets you check for cdk-nag warnings and errors without AWS credentials by integrating a NagPack into your application unit tests. Read this post for further information about the AWS CDK assertions module. The following is an example of using assertions with a TypeScript AWS CDK application and Jest for unit testing.

import { Annotations, Match } from 'aws-cdk-lib/assertions';
import { App, Aspects, Stack } from 'aws-cdk-lib';
import { AwsSolutionsChecks } from 'cdk-nag';
import { CdkTestStack } from '../lib/cdk_test-stack';

describe('cdk-nag AwsSolutions Pack', () => {
  let stack: Stack;
  let app: App;
  // In this case we can use beforeAll() over beforeEach() since our tests 
  // do not modify the state of the application 
  beforeAll(() => {
    // GIVEN
    app = new App();
    stack = new CdkTestStack(app, 'test');

    // WHEN
    Aspects.of(stack).add(new AwsSolutionsChecks());
  });

  // THEN
  test('No unsuppressed Warnings', () => {
    const warnings = Annotations.fromStack(stack).findWarning(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(warnings).toHaveLength(0);
  });

  test('No unsuppressed Errors', () => {
    const errors = Annotations.fromStack(stack).findError(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(errors).toHaveLength(0);
  });
});

Additionally, many testing frameworks include watch functionality. This is a background process that reruns all of the tests when files in your project have changed for fast feedback. For example, when using the AWS CDK in JavaScript/Typescript, you can use the Jest CLI watch commands. When Jest watch detects a file change, it attempts to run unit tests related to the changed file. This can be used to automatically run cdk-nag-related tests when making changes to your AWS CDK application.

CDK Watch

When developing in non-production environments, consider using AWS CDK Watch with a NagPack for fast feedback. AWS CDK Watch attempts to synthesize and then deploy changes whenever you save changes to your files. Aspects are run during synthesis. Therefore, any NagPacks applied to your application will also run on save. As in the walkthrough, all of the unsuppressed errors will prevent deployments, all of the messages will be output to the console, and all of the compliance reports will be generated. Read this post for further information about AWS CDK Watch.

Conclusion

In this post, you learned how to use cdk-nag in your AWS CDK applications. To learn more about using cdk-nag in your applications, check out the README in the GitHub Repository. If you would like to learn how to create your own rules and NagPacks, then check out the developer documentation. The repository is open source and welcomes community contributions and feedback.

Author:

Streamlining evidence collection with AWS Audit Manager

2022-03-03 Nicholas Parks

Post Syndicated from Nicholas Parks original https://aws.amazon.com/blogs/security/streamlining-evidence-collection-with-aws-audit-manager/

In this post, we will show you how to deploy a solution into your Amazon Web Services (AWS) account that enables you to simply attach manual evidence to controls using AWS Audit Manager. Making evidence-collection as seamless as possible minimizes audit fatigue and helps you maintain a strong compliance posture.

As an AWS customer, you can use APIs to deliver high quality software at a rapid pace. If you have compliance-focused teams that rely on manual, ticket-based processes, you might find it difficult to document audit changes as those changes increase in velocity and volume.

As your organization works to meet audit and regulatory obligations, you can save time by incorporating audit compliance processes into a DevOps model. You can use modern services like Audit Manager to make this easier. Audit Manager automates evidence collection and generates reports, which helps reduce manual auditing efforts and enables you to scale your cloud auditing capabilities along with your business.

AWS Audit Manager uses services such as AWS Security Hub, AWS Config, and AWS CloudTrail to automatically collect and organize evidence, such as resource configuration snapshots, user activity, and compliance check results. However, for controls represented in your software or processes without an AWS service-specific metric to gather, you need to manually create and provide documentation as evidence to demonstrate that you have established organizational processes to maintain compliance. The solution in this blog post streamlines these types of activities.

Solution architecture

This solution creates an HTTPS API endpoint, which allows integration with other software development lifecycle (SDLC) solutions, IT service management (ITSM) products, and clinical trial management systems (CTMS) solutions that capture trial process change amendment documentation (in the case of pharmaceutical companies who use AWS to build robust pharmacovigilance solutions). The endpoint can also be a backend microservice to an application that allows contract research organizations (CRO) investigators to add their compliance supporting documentation.

In this solution’s current form, you can submit an evidence file payload along with the assessment and control details to the API and this solution will tie all the information together for the audit report. This post and solution is directed towards engineering teams who are looking for a way to accelerate evidence collection. To maximize the effectiveness of this solution, your engineering team will also need to collaborate with cross-functional groups, such as audit and business stakeholders, to design a process and service that constructs and sends the message(s) to the API and to scale out usage across the organization.

To download the code for this solution, and the configuration that enables you to set up auto-ingestion of manual evidence, see the aws-audit-manager-manual-evidence-automation GitHub repository.

Architecture overview

In this solution, you use AWS Serverless Application Model (AWS SAM) templates to build the solution and deploy to your AWS account. See Figure 1 for an illustration of the high-level architecture.

Figure 1. The architecture of the AWS Audit Manager automation solution

The SAM template creates resources that support the following workflow:

A client can call an Amazon API Gateway endpoint by sending a payload that includes assessment details and the evidence payload.
An AWS Lambda function implements the API to handle the request.
The Lambda function uploads the evidence to an Amazon Simple Storage Service (Amazon S3) bucket (3a) and uses AWS Key Management Service (AWS KMS) to encrypt the data (3b).
The Lambda function also initializes the AWS Step Functions workflow.
Within the Step Functions workflow, a Standard Workflow calls two Lambda functions. The first looks for a matching control within an assessment, and the second updates the control within the assessment with the evidence.
When the Step Functions workflow concludes, it sends a notification for success or failure to subscribers of an Amazon Simple Notification Service (Amazon SNS) topic.

Deploy the solution

The project available in the aws-audit-manager-manual-evidence-automation GitHub repository contains source code and supporting files for a serverless application you can deploy with the AWS SAM command line interface (CLI). It includes the following files and folders:

src	Code for the application’s Lambda implementation of the Step Functions workflow. It also includes a Step Functions definition file.
template.yml	A template that defines the application’s AWS resources.

Resources for this project are defined in the template.yml file. You can update the template to add AWS resources through the same deployment process that updates your application code.

Prerequisites

This solution assumes the following:

AWS Audit Manager is enabled.
You have already created an assessment in AWS Audit Manager.
You have the necessary tools to use the AWS SAM CLI (see details in the table that follows).

For more information about setting up Audit Manager and selecting a framework, see Getting started with Audit Manager in the blog post AWS Audit Manager Simplifies Audit Preparation.

The AWS SAM CLI is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. The AWS SAM CLI uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application’s build environment and API.

To use the AWS SAM CLI, you need the following tools:

AWS SAM CLI	Install the AWS SAM CLI
Node.js	Install Node.js 14, including the npm package management tool
Docker	Install Docker community edition

To deploy the solution

Open your terminal and use the following command to create a folder to clone the project into, then navigate to that folder. Be sure to replace <FolderName> with your own value.
mkdir Desktop/<FolderName> && cd $_
Clone the project into the folder you just created by using the following command.
git clone https://github.com/aws-samples/aws-audit-manager-manual-evidence-automation.git
Navigate into the newly created project folder by using the following command.
cd aws-audit-manager-manual-evidence-automation
In the AWS SAM shell, use the following command to build the source of your application.
sam build
In the AWS SAM shell, use the following command to package and deploy your application to AWS. Be sure to replace <DOC-EXAMPLE-BUCKET> with your own unique S3 bucket name.
sam deploy –guided –parameter-overrides paramBucketName=<DOC-EXAMPLE-BUCKET>
When prompted, enter the AWS Region where AWS Audit Manager was configured. For the rest of the prompts, leave the default values.
To activate the IAM authentication feature for API gateway, override the default value by using the following command.
paramUseIAMwithGateway=AWS_IAM

To test the deployed solution

After you deploy the solution, run an invocation like the one below for an assessment (using curl). Be sure to replace <YOURAPIENDPOINT> and <AWS REGION> with your own values.

curl –location –request POST
‘https://<YOURAPIENDPOINT>.execute-api.<AWS REGION>.amazonaws.com/Prod’ \
–header ‘x-api-key: ‘ \
–form ‘payload=@”<PATH TO FILE>”‘ \
–form ‘AssessmentName=”GxP21cfr11″‘ \
–form ‘ControlSetName=”General requirements”‘ \
–form ‘ControlIdName=”11.100(a)”‘

Check to see that your file is correctly attached to the control for your assessment.

Form-data interface parameters

The API implements a form-data interface that expects four parameters:

AssessmentName: The name for the assessment in Audit Manager. In this example, the AssessmentName is GxP21cfr11.
ControlSetName: The display name for a control set within an assessment. In this example, the ControlSetName is General requirements.
ControlIdName: this is a particular control within a control set. In this example, the ControlIdName is 11.100(a).
Payload: this is the file representing evidence to be uploaded.

As a refresher of Audit Manager concepts, evidence is collected for a particular control. Controls are grouped into control sets. Control sets can be grouped into a particular framework. The assessment is considered an implementation, or an instance, of the framework. For more information, see AWS Audit Manager concepts and terminology.

To clean up the deployed solution

To clean up the solution, use the following commands to delete the AWS CloudFormation stack and your S3 bucket. Be sure to replace <YourStackId> and <DOC-EXAMPLE-BUCKET> with your own values.

aws cloudformation delete-stack –stack-name <YourStackId>
aws s3 rb s3://<DOC-EXAMPLE-BUCKET> –force

Conclusion

This solution provides a way to allow for better coordination between your software delivery organization and compliance professionals. This allows your organization to continuously deliver new updates without overwhelming your security professionals with manual audit review tasks.

Next steps

There are various ways to extend this solution.

Update the API Lambda implementation to be a webhook for your favorite software development lifecycle (SDLC) or IT service management (ITSM) solution.
Modify the steps within the Step Functions state machine to more closely match your unique compliance processes.
Use AWS CodePipeline to start Step Functions state machines natively, or integrate a variation of this solution with any continuous compliance workflow that you have.

Learn more AWS Audit Manager, DevOps, and AWS for Health and start building!

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Data pipeline asset management with Dataflow

2022-02-09 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/data-pipeline-asset-management-with-dataflow-86525b3e21ca

by Sam Setegne, Jai Balani, Olek Gorajek

Glossary

asset — any business logic code in a raw (e.g. SQL) or compiled (e.g. JAR) form to be executed as part of the user defined data pipeline.
data pipeline — a set of tasks (or jobs) to be executed in a predefined order (a.k.a. DAG) for the purpose of transforming data using some business logic.
Dataflow — Netflix homegrown CLI tool for data pipeline management.
job — a.k.a task, an atomic unit of data transformation logic, a non-separable execution block in the workflow chain.
namespace — unique label, usually representing a business subject area, assigned to a workflow asset to identify it across all other assets managed by Dataflow (e.g. security).
workflow — see “data pipeline”

Intro

The problem of managing scheduled workflows and their assets is as old as the use of cron daemon in early Unix operating systems. The design of a cron job is simple, you take some system command, you pick the schedule to run it on and you are done. Example:

0 0 * * MON /home/alice/backup.sh

In the above example the system would wake up every Monday morning and execute the backup.sh script. Simple right? But what if the script does not exist in the given path, or what if it existed initially but then Alice let Bob access her home directory and he accidentally deleted it? Or what if Alice wanted to add new backup functionality and she accidentally broke existing code while updating it?

The answers to these questions is something we would like to address in this article and propose a clean solution to this problem.

Let’s define some requirements that we are interested in delivering to the Netflix data engineers or anyone who would like to schedule a workflow with some external assets in it. By external assets we simply mean some executable carrying the actual business logic of the job. It could be a JAR compiled from Scala, a Python script or module, or a simple SQL file. The important thing is that this business logic can be built in a separate repository and maintained independently from the workflow definition. Keeping all that in mind we would like to achieve the following properties for the whole workflow deployment:

Versioning: we want both the workflow definition and its assets to be versioned and we want the versions to be tied together in a clear way.
Transparency: we want to know which version of an asset is running along with every workflow instance, so if there are any issues we can easily identify which version caused the problem and to which one we could revert, if necessary.
ACID deployment: for every scheduler workflow definition change, we would like to have all the workflow assets bundled in an atomic, durable, isolated and consistent manner. This way, if necessary, all we need to know is which version of the workflow to roll back to, and the rest would be taken care of for us.

While all the above goals are our North Star, we also don’t want to negatively affect fast deployment, high availability and arbitrary life span of any deployed asset.

Previous solutions

The basic approach to pulling down arbitrary workflow resources during workflow execution has been known to mankind since the invention of cron, and with the advent of “infinite” cloud storage systems like S3, this approach has served us for many years. Its apparent flexibility and convenience can often fool us into thinking that by simply replacing the asset in the S3 location we can, without any hassle, introduce changes to our business logic. This method often proves very troublesome especially if there is more than one engineer working on the same pipeline and they are not all aware of the other folks’ “deployment process”.

The slightly improved approach is shown on the diagram below.

**Figure 1.** Manually constructed continuous delivery system.

In Figure 1, you can see an illustration of a typical deployment pipeline manually constructed by a user for an individual project. The continuous deployment tool submits a workflow definition with pointers to assets in fixed S3 locations. These assets are then separately deployed to these fixed locations. At runtime, the assets are retrieved from the defined locations in S3 and executed in the runtime container. Despite requiring users to construct the deployment pipeline manually, often by writing their own scripts from scratch, this design works and has been successfully used by many teams for years. That being said, it does have some drawbacks that are revealed as you try to add any amount of complexity to your deployment logic. Let’s discuss a few of them.

Does not consider branch/PR deployments

In any production pipeline, you want the flexibility of having a “safe” alternative deployment logic. For example, you may want to build your Scala code and deploy it to an alternative location in S3 while pushing a sandbox version of your workflow that points to this alternative location. Something this simple gets very complicated very quickly and requires the user to consider a number of things. Where should this alternative location be in S3? Is a single location enough? How do you set up your deployment logic to know when to deploy the workflow to a test or dev environment? Answers to these questions often end up being more custom logic inside of the user’s deployment scripts.

Cannot rollback to previous workflow versions

When you deploy a workflow, you really want it to encapsulate an atomic and idempotent unit of work. Part of the reason for that is the desire for the ability to rollback to a previous workflow version and knowing that it will always behave as it did in previous runs. There can be many reasons to rollback but the typical one is when you’ve recognized a regression in a recent deployment that was not caught during testing. In the current design, reverting to a previous workflow definition in your scheduling system is not enough! You have to rebuild your assets from source and move them to your fixed S3 location that your workflow points to. To enable atomic rollbacks, you can add more custom logic to your deployment scripts to always deploy your assets to a new location and generate new pointers for your workflows to use, but that comes with higher complexity that often just doesn’t feel worth it. More commonly, teams will opt to do more testing to try and catch regressions before deploying to production and will accept the extra burden of rebuilding all of their workflow dependencies in the event of a regression.

Runtime dependency on user-managed cloud storage locations

At runtime, the container must reach out to a user-defined storage location to retrieve the assets required. This causes the user-managed storage system to be a critical runtime dependency. If we zoom out to look at an entire workflow management system, the runtime dependencies can become unwieldy if it relies on various storage systems that are arbitrarily defined by the workflow developers!

Dataflow deployment with asset management

In the attempt to deliver a simple and robust solution to the managed workflow deployments we created a command line utility called Dataflow. It is a Python based CLI + library that can be installed anywhere inside the Netflix environment. This utility can build and configure workflow definitions and their assets during testing and deployment. See below diagram:

**Figure 2**. Dataflow asset management system.

In Figure 2, we show a variation of the typical manually constructed deployment pipeline. Every asset deployment is released to some newly calculated UUID. The workflow definition can then identify a specific asset by its UUID. Deploying the workflow to the scheduling system produces a “Deployment Bundle”. The bundle includes all of the assets that have been referenced by the workflow definition and the entire bundle is deployed to the scheduling system. At every scheduled runtime, the scheduling system can create an instance of your workflow without having to gather runtime dependencies from external systems.

The asset management system that we’ve created for Dataflow provides a strong abstraction over this deployment design. Deploying the asset, generating the UUID, and building the deployment bundle is all handled automatically by the Dataflow build logic. The user does not need to be aware of anything that’s happening on S3, nor that S3 is being used at all! Instead, the user is given a flexible UUID referencing system that’s layered on top of our scheduling system’s workflow DSL. Later in the article we’ll cover this referencing system in some detail. But first, let’s look at an example of deploying an asset and a workflow.

Deployment of an asset

Let’s walk through an example of a workflow asset build and deployment. Let’s assume we have a repository called stranger-data with the following structure:

.
├── dataflow.yaml
├── pyspark-workflow
│ ├── main.sch.yaml
│ └── hello_world
│     ├── ...
│     └── setup.py
└── scala-workflow
    ├── build.gradle
    ├── main.sch.yaml
    └── src
    ├── main
    │   └── ...
    └── test
        └── ...

Let’s now use Dataflow command to see what project components are visible:

stranger-data$ dataflow project list

Python Assets:
 -> ./pyspark-workflow/hello_world/setup.py
Summary: 1 found.

Gradle Assets:
 -> ./scala-workflow/build.gradle
Summary: 1 found.

Scheduler Workflows:
 -> ./scala-workflow/main.sch.yaml
 -> ./pyspark-workflow/main.sch.yaml
Summary: 2found.

Before deploying the assets, and especially if we made any changes to them, we can run unit tests to make sure that we didn’t break anything. In a typical Dataflow configuration this manual testing is optional because Dataflow continuous integration tests will do that for us on any pull-request.

stranger-data$ dataflow project test

Testing Python Assets:
 -> ./pyspark-workflow/hello_world/setup.py... PASSED
Summary: 1 successful, 0 failed.

Testing Gradle Assets:
 -> ./scala-workflow/build.gradle... PASSED
Summary: 1 successful, 0 failed.

Building Scheduler Workflows:
 -> ./scala-workflow/main.sch.yaml... CREATED ./.workflows/scala-workflow.main.sch.rendered.yaml
 -> ./pyspark-workflow/main.sch.yaml... CREATED ./.workflows/pyspark-workflow.main.sch.rendered.yaml
Summary: 2 successful, 0 failed.

Testing Scheduler Workflows:
 -> ./scala-workflow/main.sch.yaml... PASSED
 -> ./pyspark-workflow/main.sch.yaml... PASSED
Summary: 2 successful, 0 failed.

Notice that the test command we use above not only executes unit test suites defined in our Scala and Python sub-projects, but it also renders and statically validates all the workflow definitions in our repo, but more on that later…

Assuming all tests passed, let’s now use the Dataflow command to build and deploy a new version of the Scala and Python assets into the Dataflow asset registry.

stranger-data$ dataflow project deploy

Building Python Assets:
 -> ./pyspark-workflow/hello_world/setup.py... CREATED ./pyspark-workflow/hello_world/dist/hello_world-0.0.1-py3.7.egg
Summary: 1 successful, 0 failed.

Deploying Python Assets:
 -> ./pyspark-workflow/hello_world/setup.py... DEPLOYED AS dataflow.egg.hello_world.user.stranger-data.master.39206ee8.3
Summary: 1 successful, 0 failed.

Building Gradle Assets:
 -> ./scala-workflow/build.gradle... CREATED ./scala-workflow/build/libs/scala-workflow-all.jar
Summary: 1 successful, 0 failed.

Deploying Gradle Assets:
 -> ./scala-workflow/build.gradle... DEPLOYED AS dataflow.jar.scala-workflow.user.stranger-data.master.39206ee8.11
Summary: 1 successful, 0 failed.

...

Notice that the above command:

created a new version of the workflow assets
assigned the asset a “UUID” (consisting of the “dataflow” string, asset type, asset namespace, git repo owner, git repo name, git branch name, commit hash and consecutive build number)
and deployed them to a Dataflow managed S3 location.

We can check the existing assets of any given type deployed to any given namespace using the following Dataflow command:

stranger-data$ dataflow project list eggs --namespace hello_world --deployed

Project namespaces with deployed EGGS:

hello_world
 -> dataflow.egg.hello_world.user.stranger-data.master.39206ee8.3
 -> dataflow.egg.hello_world.user.stranger-data.master.39206ee8.2
 -> dataflow.egg.hello_world.user.stranger-data.master.39206ee8.1

The above list could come in handy, for example if we needed to find and access an older version of an asset deployed from a given branch and commit hash.

Deployment of a workflow

Now let’s have a look at the build and deployment of the workflow definition which references the above assets as part of its pipeline DAG.

Let’s list the workflow definitions in our repo again:

stranger-data$ dataflow project list workflows

Scheduler Workflows:
 -> ./scala-workflow/main.sch.yaml
 -> ./pyspark-workflow/main.sch.yaml
Summary: 2 found.

And let’s look at part of the content of one of these workflows:

stranger-data$ cat ./scala-workflow/main.sch.yaml
...
dag:
 - ddl -> write
 - write -> audit
 - audit -> publish
jobs:
 - ddl: ...
 - write:
     spark:
       script: ${dataflow.jar.scala-workflow}
       class: com.netflix.spark.ExampleApp
       conf: ...
       params: ...
 - audit: ...
 - publish: ...
...

You can see from the above snippet that the write job wants to access some version of the JAR from the scala-workflow namespace. A typical workflow definition, written in YAML, does not need any compilation before it is shipped to the Scheduler API, but Dataflow designates a special step called “rendering” to substitute all of the Dataflow variables and build the final version.

The above expression ${dataflow.jar.scala-workflow} means that the workflow will be rendered and deployed with the latest version of the scala-workflow JAR available at the time of the workflow deployment. It is possible that the JAR is built as part of the same repository in which case the new build of the JAR and a new version of the workflow may be coming from the same deployment. But the JAR may be built as part of a completely different project and in that case the testing and deployment of the new workflow version can be completely decoupled.

We showed above how one would request the latest asset version available during deployment, but with Dataflow asset management we can distinguish two more asset access patterns. An obvious next one is to specify it by all its attributes: asset type, asset namespace, git repo owner, git repo name, git branch name, commit hash and consecutive build number. There is one more extra method for a middle ground solution to pick a specific build for a given namespace and git branch, which can help during testing and development. All of this is part of the user-interface for determining how the deployment bundle will be created. See below diagram for a visual illustration.

**Figure 3**. A closer at the Deployment Bundle

In short, using the above variables gives the user full flexibility and allows them to pick any version of any asset in any workflow.

An example of the workflow deployment with the rendering step is shown below:

stranger-data$ dataflow project deploy

...

Building Scheduler Workflows:
 -> ./scala-workflow/main.sch.yaml... CREATED ./.workflows/scala-workflow.main.sch.rendered.yaml
 -> ./pyspark-workflow/main.sch.yaml... CREATED ./.workflows/pyspark-workflow.main.sch.rendered.yaml
Summary: 2 successful, 0 failed.

Deploying Scheduler Workflows:
 -> ./scala-workflow/main.sch.yaml… DEPLOYED AS https://hawkins.com/scheduler/sandbox:user.stranger-data.scala-workflow
 -> ./pyspark-workflow/main.sch.yaml… DEPLOYED AS https://hawkins.com/scheduler/sandbox:user.stranger-data.pyspark-workflow
Summary: 2 successful, 0 failed.

And here you can see what the workflow definition looks like before it is sent to the Scheduler API and registered as the latest version. Notice the value of the script variable of the write job. In the original code says ${dataflow.jar.scala-workflow} and in the rendered version it is translated to a specific file pointer:

stranger-data$ cat ./scala-workflow/main.sch.yaml
...
dag:
 - ddl -> write
 - write -> audit
 - audit -> publish
jobs:
 - ddl: ...
 - write:
     spark:
       script: s3://dataflow/jars/scala-workflow/user/stranger-data/master/39206ee8/1.jar
       class: com.netflix.spark.ExampleApp
       conf: ...
       params: ...
 - audit: ...
 - publish: ...
...

User perspective

The Infrastructure DSE team at Netflix is responsible for providing insights into data that can help the Netflix platform and service scale in a secure and effective way. Our team members partner with business units like Platform, OpenConnect, InfoSec and engage in enterprise level initiatives on a regular basis.

One side effect of such wide engagement is that over the years our repository evolved into a mono-repo with each module requiring a customized build, testing and deployment strategy packaged into a single Jenkins job. This setup required constant upkeep and also meant every time we had a build failure multiple people needed to spend a lot of time in communication to ensure they did not step on each other.

Last quarter we decided to split the mono-repo into separate modules and adopt Dataflow as our asset orchestration tool. Post deployment, the team relies on Dataflow for automated execution of unit tests, management and deployment of workflow related assets.

By the end of the migration process our Jenkins configuration went from:

Figure 4. Real example of a deployment script.

to:

cd /dataflow_workspace
dataflow project deploy

The simplicity of deployment enabled the team to focus on the problems they set out to solve while the branch based customization gave us the flexibility to be our most effective at solving them.

Conclusions

This new method available for Netflix data engineers makes workflow management easier, more transparent and more reliable. And while it remains fairly easy and safe to build your business logic code (in Scala, Python, etc) in the same repository as the workflow definition that invokes it, the new Dataflow versioned asset registry makes it easier yet to build that code completely independently and then reference it safely inside data pipelines in any other Netflix repository, thus enabling easy code sharing and reuse.

One more aspect of data workflow development that gets enabled by this functionality is what we call branch-driven deployment. This approach enables multiple versions of your business logic and workflows to be running at the same time in the scheduler ecosystem, and makes it easy, not only for individual users to run isolated versions of the code during development, but also to define isolated staging environments through which the code can pass before it reaches the production stage. Obviously, in order for the workflows to be safely used in that configuration they must comply with a few simple rules with regards to the parametrization of their inputs and outputs, but let’s leave this subject for another blog post.

Credits

Special thanks to Peter Volpe, Harrington Joseph and Daniel Watson for the initial design review.

Data pipeline asset management with Dataflow was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Auto-Diagnosis and Remediation in Netflix Data Platform

2022-01-14 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/auto-diagnosis-and-remediation-in-netflix-data-platform-5bcc52d853d1

By Vikram Srivastava and Marcelo Mayworm

Netflix has one of the most complex data platforms in the cloud on which our data scientists and engineers run batch and streaming workloads. As our subscribers grow worldwide and Netflix enters the world of gaming, the number of batch workflows and real-time data pipelines increases rapidly. The data platform is built on top of several distributed systems, and due to the inherent nature of these systems, it is inevitable that these workloads run into failures periodically. Troubleshooting these problems is not a trivial task and requires collecting logs and metrics from several different systems and analyzing them to identify the root cause. At our scale, even a tiny percentage of disrupted workloads can generate a substantial operational support burden for the data platform team when troubleshooting involves manual steps. And we can’t discount the productivity impact it causes on data platform users.

It motivates us to be proactive in detecting and handling failed workloads in our production environment, avoiding interruptions that could slow down our teams. We have been working on an auto-diagnosis and remediation system called Pensive in the data platform to address these concerns. With the goal of troubleshooting failing and slow workloads and remediating them without human intervention wherever possible. As our platform continues to grow and different scenarios and issues can disrupt the workloads, Pensive has to be proactive in detecting broad problems at the platform level in real-time and diagnosing the impact across the workloads.

Pensive infrastructure comprises two separate systems to support batch and streaming workloads. This blog will explore these two systems and how they perform auto-diagnosis and remediation across our Big Data Platform and Real-time infrastructure.

Batch Pensive

Batch workflows in the data platform run using a Scheduler service that launches containers on the Netflix container management platform called Titus to run workflow steps. These steps launch jobs on clusters running Apache Spark and Presto via Genie. If a workflow step fails, Scheduler asks Pensive to diagnose the step’s error. Pensive collects logs for the failed jobs launched by the step from the relevant data platform components and then extracts the stack traces. Pensive relies on a regular expression based rules engine that has been curated over time. The rules encode information about whether an error is due to a platform issue or a user bug and whether the error is transient or not. If a regular expression from one of the rules matches, then Pensive returns information about that error to the Scheduler. If the error is transient, Scheduler will retry that step with exponential backoff a few more times.

The most critical part of Pensive is the set of rules used to classify an error. We need to evolve them as the platform evolves to ensure that the percentage of errors that Pensive cannot classify remains low. Initially, the rules were added on an ad-hoc basis as requests came in from platform component owners and users. We have now moved to a more systematic approach where unknown errors are fed into a Machine Learning process that performs clustering to propose new regular expressions for commonly occurring errors. We take the proposals to platform component owners to then come up with the classification of the error source and whether it is of transitory nature. In the future, we are looking to automate this process.

Detection of Platform-wide Issues

Pensive does error classification on individual workflow step failures, but by doing real-time analytics on the errors detected by Pensive using Apache Kafka and Apache Druid, we can quickly identify platform issues affecting many workflows. Once the individual diagnoses get stored in a Druid table, our monitoring and alerting system called Atlas does aggregations every minute and sends out alerts if there is a sudden increase in the number of failures due to platform errors. This has led to a dramatic reduction in the time it takes to detect issues in hardware or bugs in recently rolled out data platform software.

Streaming Pensive

Apache Flink powers real-time stream processing jobs in the Netflix data platform. And most of the Flink jobs run under a managed platform called Keystone, which abstracts out the underlying Flink job details and allows users to consume data from Apache Kafka streams and publish them to different data stores like Elasticsearch and Apache Iceberg on AWS S3.

Since the data platform manages keystone pipelines, users expect platform issues to be detected and remediated by the Keystone team without any involvement from their end. Furthermore, data in Kafka streams have a finite retention period, which adds time pressure for resolving the issues to avoid data loss.

For every Flink job running as part of a Keystone pipeline, we monitor the metric indicating how far the Flink consumer lags behind the Kafka producer. If it crosses a threshold, Atlas sends a notification to Streaming Pensive.

Like its batch counterpart, Streaming Pensive also has a rules engine to diagnose errors. However, in addition to logs, Streaming Pensive also has rules for checking various metric values for multiple components in the Keystone pipeline. The issue may occur in the source Kafka stream, the main Flink job, or the sinks to which the Flink job is writing data. Streaming Pensive diagnoses it and tries to remediate the issue automatically when it happens. Some examples where we are able to auto-remediate are:

If Streaming Pensive finds that one or more Flink Task Managers are going out of memory, it can redeploy the Flink cluster with more Task Managers.
If Streaming Pensive finds that there is an unexpected increase in the rate of incoming messages on the source Kafka cluster, it can increase the topic retention size and period so that we don’t lose any data while the consumer is lagging. If the spike goes away after some time, Streaming Pensive can revert the retention changes. Otherwise, it will page the job owner to investigate if there is a bug causing the increased rate or if the consumers need to be reconfigured to handle the higher rate.

Even though we have a high success rate, there are still occasions where automation is not possible. If manual intervention is required, Streaming Pensive will page the relevant component team to take timely action to resolve the issue.

What’s Next?

Pensive has had a significant impact on the operability of the Netflix data platform. And helped engineering teams lower the burden of operations work, freeing them to tackle more critical and challenging problems. But our job is nowhere near done. We have a long roadmap ahead of us. Some of the features and expansions we have planned are:

Batch Pensive is currently diagnosing failed jobs only, and we want to increase the scope into optimization to determine why jobs have become slow.
Auto-configure batch workflows so that they finish successfully or become faster and use fewer resources when possible. One example where it can dramatically help is Spark jobs, where memory tuning is a significant challenge.
Expand Pensive with Machine Learning classifiers.
The streaming platform recently added Data Mesh, and we need to expand Streaming Pensive to cover that.

Acknowledgments

This work could not have been completed without the help of the Big Data Compute and the Real-time Data Infrastructure teams within the Netflix data platform. They have been great partners for us as we work on improving the Pensive infrastructure.

Auto-Diagnosis and Remediation in Netflix Data Platform was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

5 DevOps tips to speed up your developer workflow

2021-11-30 Damian Brady

Post Syndicated from Damian Brady original https://github.blog/2021-11-30-5-devops-tips-to-speed-up-your-developer-workflow/

TL;DR: From learning YAML to scripting with Bash, here are a few simple tips for developers who want to speed up their workflows.

From CI/CD to containerization management and server provisioning, DevOps gets a lot of buzz in tech today. You could even say that it’s a buzz … word.

As a developer, you might be part of a DevOps team, but you’re focused on building great software, not necessarily provisioning servers and managing containers.

Even still, a lot of what developers, DevOps engineers, and IT teams handle in today’s software development life cycle is focused on tools, testing, automations, and server orchestration. And, that’s even more true if you’re a team of one or engaging in a big open source project.

Here are five DevOps tips for any developer looking to work smarter and faster.

Tip #1: A little YAML can make frontend work easier

Initially released in 2001, YAML has become one of the defacto languages for a lot of declarative automation—and it’s commonly used in DevOps and development work for an array of frontend configurations, automations, and more.

YAML, which stands for Yet Another Markup Language, is a superset of JSON and is notable for being a human readable language. That means it focuses less on characters, like brackets, braces, and quotes ({}, [], “).

Here’s why this matters: Learning YAML (or even stepping up your YAML skills) makes it easier to store configurations for your own applications, like your settings in an easy-to-write and easy-to-read language.

For this reason, you’re likely to come across YAML files anywhere from enterprise development workflows to open source projects—and yes, you’ll see plenty of YAML files on GitHub (it powers a product we’re pretty fond of: GitHub Actions, but more on this later).

Whether you can apply YAML directly to your day-to-day dev workflows or leverage different tools that use YAML, there are some pretty big benefits to getting started with this language—or stepping up your YAML skills.

Looking to learn more about YAML? Try the Learn YAML in Y Minutes guide.

Tip #2: A few DevOps tools to keep you moving fast

Let’s clear up one thing first: “DevOps tools” is an umbrella term that covers everything from cloud platforms, server orchestration tools, code management, version control, and dozens of other things.

So when we talk about “DevOps tools,” we’re really talking about technologies that make it easier to write, test, host, and release software, as well as reduce any worries around unexpected failures.

Here are three “DevOps tools” that can speed up your workflows and let you focus on building great software.

Git

You’re on the GitHub Blog, so we’re pretty sure you’re familiar with Git as a version control system and distributed source code management tool. It’s a mainstay of developers and a popular DevOps tool.

Here’s why: Git makes version control easy and gives teams a straightforward way to collaborate, experiment with different branches, and merge new features into the main software branch.

Learn how Git works >

Cloud-hosted integrated development environments (IDE)

I know, I know, saying cloud-hosted integrated development environments, or cloud IDEs, out loud is a bit of a mouthful (thank you, marketing). But these platforms are something you should start exploring immediately, if you haven’t already.

Here’s why: Cloud IDEs are fully hosted developer environments that let you write, run, and debug code—and they make spinning up new, preconfigured environments fast. Do you need proof? We launched our own cloud IDE called Codespaces earlier this year and started using it internally to build GitHub. It used to take us up to 45 minutes to spin up new developer environments—now it takes 10 seconds :mindblown:.

Cloud IDEs give you a super simple way to quickly spin up new, pre-configured development environments (and disposable development environments). Also, since they’re hosted in the cloud, you don’t need to worry about how powerful the computer you’re coding on is (friendly shout out here goes to the intrepid folks who have started coding on tablets).

Picture this: Your laptop fries itself (which has happened to me once or twice). You might have versions of npm, tools for connecting to your cloud provider, and any number of other configurations that you just lost. If you use a cloud IDE, you can spin up an environment in the cloud with all of your configurations, and that’s a magical thing to see.

Learn how cloud IDEs work >

Containers

If you don’t want to use a cloud IDE, dev containers are something you can use locally or in the cloud. Containers have exploded in popularity over the past decade for their utility in microservices architectures, CI/CD, and cloud-native application development, among other things. By nature, containers are lightweight and efficient making it easy to build, test, stage, and deploy software.

Learning the basics of containerization can be really handy—especially when it comes to testing your code in a lightweight environment that imitates your production environment. If you need to upgrade a library or try using an application on the next version of Node, you can do that really easily with containers before you hit production.

This can be especially useful for ”shifting left,” which is an important DevOps strategy. Catching issues or problems before you ever hit production can save a lot of headaches. If you can find those issues while you’re writing the code, that’s even better. Any problems will eventually mean more work, so the earlier you can catch them the better. After all, catching a problem before you get to the compiling stage can save you a headache or two.

Learn how containers work >

Tip #3: Automated testing and continuous integration (CI) to stay one step ahead

In any conversation around DevOps, you’ll probably hear about automated testing and continuous integration (CI). Yet while automated testing is typically part of a good CI development practice, it’s not strictly a requirement (but it should be … or at least part of your continuous delivery phase).

Most teams have some basic unit testing as part of their CI process, but stop short of testing for security vulnerabilities, automated UI testing, integration testing, etc.

Even still, these are two things that can help you step up your workflows by: (A) making sure your code works with the main branch; and (B) catching things like security vulnerabilities and other problems, so you can lessen your DevOps team’s workload.

Here’s how:

Using GitHub Actions to run automated tests

From ordering pizza to triggering an alarm, there’s a lot you can do with GitHub Actions. It all comes down to workflow automations.When it comes to setting up automated tests with GitHub Actions, you can either build your own action or leverage pre-built actions in the GitHub Marketplace.

[Learn how to build your own GitHub Actions workflow automations.]> Pro tip: Using Actions workflows that run on pull requests is a great way to check for security vulnerabilities, problems in your code, or anything else before you merge to the main branch. Doing this means you’re one step ahead and helps keep your main branch clean.

[Want to learn more about GitHub Actions? Check out our guide.]You can also configure your workflows to deploy to ephemeral testing environments. This means you can run your tests and deploy your changes to an environment where you can test your application. You can even configure your workflow to automatically tear these testing environments down after you’re finished.

All this means you’re testing things as much as possible before it’s time to go to production.

Using GitHub Actions to create CI pipelines

CI, or continuous integration, is the process of automatically integrating code from multiple people for a given project. A good CI practice means you can work faster, make sure your code compiles correctly, merge code changes more efficiently, and be sure your code plays nice with everyone else’s work.

The most powerful CI workflows are the ones that test all of the things you care about every single time you push your code to the server.

If you’re working on GitHub, GitHub Actions can do this for you, too. There are plenty of pre-built CI workflows in the GitHub Marketplace (and you can always build your own), but there are a few things to keep in mind when you start incorporating CI into your development flow. These include:

Run the necessary tests: Think about what build, integration, and testing automations you ideally need. You’ll want to consider things that may have gone wrong with releases in the past, and see if you can add a test for that in your CI.
Balance the time it takes to test your code with how fast you’re pushing new code: Let’s say you have teams pushing new code every five minutes (hypothetically), but the tests you’re running take 10 minutes to execute … that’s not great. It’s always best to balance what you’re checking and when with how long it takes, which might mean trimming your ideal list of tests down to a more realistic number, at least for your CI builds.

Get a tutorial on creating a CI pipeline with GitHub Actions >

Tip #4: Server orchestration tips for flexibility and speed

If you’re building a cloud-native application (or really even just using a few different servers, VMs, containers, or hosting services), you’re probably dealing with a few environments. Being able to make sure your application and infrastructure play well together means you can rely a little less on an operations team trying to get your software to run on existing infrastructure at the last minute.

That’s where server orchestration comes in. Server orchestration—or infrastructure orchestration—is often the job of IT and DevOps teams and includes configuring, managing, provisioning, and coordinating systems, applications, and core infrastructure needed to run software.

Pro tip: There’s a suite of tools that allow you to define and update the infrastructure you need to use.

A big advantage of infrastructure automation is improved scalability—and defined environments means it’s easier to tear down and rebuild an environment when something goes wrong (instead of starting from scratch, but we’ve all been there).

There’s another big advantage: If you want to test something, you don’t have to worry about asking the operations team to go and set up a server for you. You can instead do that as part of a workflow. You don’t have to worry about manually provisioning hardware or system requirements.

How to get started: Don’t try to replace everything in your environment with automated infrastructure automation. Instead, look for a part that might be easy to automate and start there—then the next piece and the next piece after that.

And definitely never start in production. Instead, begin with your testing environment. Once that works, move to your staging environment (and if that works, you can trust it’s good for production).

Tip #5: Repeatable tasks? Try scripting them with Bash or PowerShell

Picture this: You have a bunch of repeatable tasks that you’re executing on a local basis, and you’re spending way too much time working through them every week. There’s a better—and more efficient—way to handle this. How? Scripting with either Bash or PowerShell.

Bash has deep roots in the Unix world, and it’s a mainstay of IT and DevOps teams, and more than a few developers too. PowerShell is comparatively newer. Designed by Microsoft and launched in 2006, PowerShell replaced the command shell and earlier scripting languages for task automation and configuration management in Windows environments.

Today, both Bash and PowerShell are cross-platform (though most people with a Windows background will use PowerShell, and most people familiar with Linux or macOS will use Bash out of habit).

Pro tip: Bash and PowerShell have different ways of working. Where PowerShell works with objects, Bash passes information around as strings. Even still, whatever you choose is largely up to personal preference.

One of the more useful things I’ve done with Bash and PowerShell, for example, is building a script that pulls down the latest version of the code, creates a new branch, switches to that branch, pushes a draft pull request up to GitHub, and then opens VSCode (sub in your editor of choice here) in that branch.

It’s a series of small steps to make your life much easier. It’s something you might do once or twice a week, and if you can script that—it gives you more time to focus on what matters: writing great code.

The bottom line

There’s a big difference between an IT pro, a DevOps engineer, and a developer. But in today’s world of software development, a lot of core DevOps practices are becoming everyone’s job. Plus, any developer that can learn a few DevOps tricks can have an easier time working independently (and more efficiently at that), and continue to focus on what matters most: building great software. That’s something we can all get behind.

Additional resources

Advanced Zabbix API – 5 API use cases to improve your API workfows

2021-11-18 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/advanced-zabbix-api-5-api-use-cases-to-improve-your-api-workfows/16801/

As your monitoring infrastructures evolve, you might hit a point when there’s no avoiding using the Zabbix API. The Zabbix API can be used to automate a particular part of your day-to-day workflow, troubleshoot your monitoring or to simply analyze or get statistics about a specific set of entities.

In this blog post, we will take a look at some of the more advanced API methods and specific method parameters and learn how they can be used to improve your API workflows.

1. Count entities with CountOutput

Let’s start with gathering some statistics. Let’s say you have to count the number of some matching entities – here we can use the CountOutput parameter. For a more advanced use case – what if we have to count the number of events for some time period? Let’s combine countOutput with time_from and time_till (in unixtime) and get the number of events created for the month of November. Let’s get all of the events for the month of November that have the Disaster severity:

{
"jsonrpc": "2.0",
"method": "event.get",
"params": {
"output": "extend",
"time_from": "1635717600",
"time_till": "1638223200",
"severities": "5",
"countOutput": "true"
},
"auth": "xxxxxx",
"id": 1
}

2. Use API to perform Configuration export/import

Next, let’s take a look at how we can use the configuration.export method to export one of our templates in yaml:

{
"jsonrpc": "2.0",
"method": "configuration.export",
"params": {
"options": {
"templates": [
"10001"
]
},
"format": "yaml"
},
"auth": "xxxxxx",
"id": 1
}

Now let’s copy and paste the result of the export and import the template into another environment. It’s extremely important to remember that for this method to work exactly as we intend to, we need to include the parameters that specify the behavior of particular entities contained in the configuration string, such as items/value maps/templates, etc. For example, if I exclude the templates parameter here, no templates will be imported.

{
"jsonrpc": "2.0",
"method": "configuration.import",
"params": {
"format": "yaml",
"rules": {
"valueMaps": {
"createMissing": true,
"updateExisting": true
},
"items": {
"createMissing": true,
"updateExisting": true,
"deleteMissing": true
},
"templates": {
"createMissing": true,
"updateExisting": true
},

"templateLinkage": {
"createMissing": true
}
},
"source": "zabbix_export:\n version: '5.4'\n date: '2021-11-13T09:31:29Z'\n groups:\n -\n uuid: 846977d1dfed4968bc5f8bdb363285bc\n name: 'Templates/Operating systems'\n templates:\n -\n uuid: e2307c94f1744af7a8f1f458a67af424\n template: 'Linux by Zabbix agent active'\n name: 'Linux by Zabbix agent active'\n 
...
},
"auth": "xxxxxx",
"id": 1
}

3. Expand trigger functions and macros with expand parameters

Using trigger.get to obtain information about a particular set of triggers is a relatively common practice. One particular caveat that we have to consider is that by default macros in trigger name, expression or descriptions are not expanded. To expand the available macros we need to use the expand parameters:

{
"jsonrpc": "2.0",
"method": "trigger.get",
"params": {
"triggerids": "18135",
"output": "extend",
"expandExpression":"1",
"selectFunctions": "extend"
},
"auth": "xxxxxx",
"id": 1
}

4. Obtaining additional LLD information for a discovered item

If we wish to display additional LLD information for a discovered entity, in this case – an item, we can use the selectDiscoveryRule and selectItemDiscovery parameters.
While selectDiscoveryRule will provide the ID of the LLD rule that created the item, selectItemDiscovery can point us at the parent item prototype id from which the item was created, last discovery time, item prototype key, and more.

The example below will return the item details and will also provide the LLD rule and Item prototype IDs, the time when the lost item will be deleted and the last time the item was discovered:

{
"jsonrpc": "2.0",
"method": "item.get",
"params": {
"itemids":"36717",
"selectDiscoveryRule":"1",
"selectItemDiscovery":["lastcheck","ts_delete","parent_itemid"]
}, "auth":"xxxxxx",
"id": 1
}

5. Searching through the matched entities with search parameters

Zabbix API provides a couple of standard parameters for performing a search. With search parameter, we can search string or text fields and try to find objects based on a single or multiple entries. searchByAny parameter is capable of extending the search – if you set this as true, we will search by ANY of the criteria in the search array, instead of trying to find an entity that matches ALL of them (default behavior).

The following API call will find items that match agent and Zabbix keys on a particular template:

{
"jsonrpc": "2.0",
"method": "item.get",
"params": {
"output": "extend",
"templateids": "10001",
"search": {
"key_": ["agent.","zabbix"]
},
"searchByAny":"true",
"sortfield": "name"
},
"auth": "xxxxxx",
"id": 1
}

Feel free to take the above examples, change them around so they fit your use case and you should be able to quite easily implement them in your environment. There are many other use cases that we might potentially cover down the line – if you have a specific API use case that you wish for us to cover, feel free to leave a comment under this post and we just might cover it in one of the upcoming blog posts!

Align with best practices while creating infrastructure using CDK Aspects

2021-10-08 Om Prakash Jha

Post Syndicated from Om Prakash Jha original https://aws.amazon.com/blogs/devops/align-with-best-practices-while-creating-infrastructure-using-cdk-aspects/

Organizations implement compliance rules for cloud infrastructure to ensure that they run the applications according to their best practices. They utilize AWS Config to determine overall compliance against the configurations specified in their internal guidelines. This is determined after the creation of cloud resources in their AWS account. This post will demonstrate how to use AWS CDK Aspects to check and align with best practices before the creation of cloud resources in your AWS account.

The AWS Cloud Development Kit (CDK) is an open-source software development framework that lets you define your cloud application resources using familiar programming languages, such as TypeScript, Python, Java, and .NET. The expressive power of programming languages to define infrastructure accelerates the development process and improves the developer experience.

AWS Config is a service that enables you to assess, audit, and evaluate your AWS resource configurations. Config continuously monitors and records your AWS resource configurations, as well as lets you automate the evaluation of recorded configurations against desired configurations. React to non-compliant resources and change their state either automatically or manually.

AWS Config helps customers run their workloads on AWS in a compliant manner. Some customers want to detect it up front, and then only provision compliant resources. Some configurations are important for the customers, so they might not provision resources without having them compliant from the beginning. The following are examples of such configurations:

Amazon S3 bucket must not be created with public access
Amazon S3 bucket encryption must be enabled
Database deletion protection must be enabled

CDK Aspects

CDK Aspects are a way to apply an operation to every construct in a given scope. The aspect could verify something about the state of the constructs, such as ensuring that all buckets are encrypted, or it could modify the constructs, such as by adding tags.

An aspect is a class that implements the IAspect interface shown below. Aspects employ visitor patter, which allows them to add a new operation to existing object structures without modifying the structures. In object-oriented programming and software engineering, the visitor design pattern is a method for separating an algorithm from an object structure on which it operates.

interface IAspect {
   visit(node: IConstruct): void;
}

An AWS CDK app goes through the following lifecycle phases when you call cdk deploy. These phases are also shown in the diagram below. Learn more about the CDK application lifecycle at this page.

Construction
Preparation
Validation
Synthesis
Deployment

Understanding the CDK Deploy

CDK Aspects become relevant during the Prepare phase, where it makes the final modifications round in the constructs to setup their final state. This Prepare phase happens automatically. All constructs have their internal list of Aspects which are called and applied during the Prepare phase. Add your custom aspects in a scope by calling the following method:

Aspects.of(myConstruct).add(new SomeAspect(...));

When you call the method above, constructs add the custom aspects to the list of internal aspects. When CDK application goes through the Prepare phase, then AWS CDK calls the visit method of the object for the constructs and all of its children in top-down order. The visit method is free to change anything in the construct.

How to align with or check configuration compliance using CDK Aspects

In the following sections, you will see how to implement CDK Aspects for some common use cases when provisioning the cloud resources. CDK Aspects are extensible, and you can extend it for any suitable use cases in order to implement additional rules. Apply CDK Aspects not only to verify against the best practices, but also to update the state before resource creation.

The code below creates the cloud resources to be verified against the best practices or to be updated using Aspects in the following sections.

export class AwsCdkAspectsStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    //Create a VPC with 3 availability zones
    const vpc = new ec2.Vpc(this, 'MyVpc', {
      maxAzs: 3,
    });

    //Create a security group
    const sg = new ec2.SecurityGroup(this, 'mySG', {
      vpc: vpc,
      allowAllOutbound: true
    })

    //Add ingress rule for SSH from the public internet
    sg.addIngressRule(ec2.Peer.anyIpv4(), ec2.Port.tcp(22), 'SSH access from anywhere')

    //Launch an EC2 instance in private subnet
    const instance = new ec2.Instance(this, 'MyInstance', {
      vpc: vpc,
      machineImage: ec2.MachineImage.latestAmazonLinux(),
      instanceType: new ec2.InstanceType('t3.small'),
      vpcSubnets: {subnetType: ec2.SubnetType.PRIVATE},
      securityGroup: sg
    })

    //Launch MySQL rds database instance in private subnet
    const database = new rds.DatabaseInstance(this, 'MyDatabase', {
      engine: rds.DatabaseInstanceEngine.mysql({
        version: rds.MysqlEngineVersion.VER_5_7
      }),
      vpc: vpc,
      vpcSubnets: {subnetType: ec2.SubnetType.PRIVATE},
      deletionProtection: false
    })

    //Create an s3 bucket
    const bucket = new s3.Bucket(this, 'MyBucket')
  }
}

In the first section, you will see the use cases and code where Aspects are used to verify the resources against the best practices.

VPC CIDR range must start with specific CIDR IP
Security Group must not have public ingress rule
EC2 instance must use approved AMI

//Verify VPC CIDR range
export class VPCCIDRAspect implements IAspect {
    public visit(node: IConstruct): void {
        if (node instanceof ec2.CfnVPC) {
            if (!node.cidrBlock.startsWith('192.168.')) {
                Annotations.of(node).addError('VPC does not use standard CIDR range starting with "192.168."');
            }
        }
    }
}

//Verify public ingress rule of security group
export class SecurityGroupNoPublicIngressAspect implements IAspect {
    public visit(node: IConstruct) {
        if (node instanceof ec2.CfnSecurityGroup) {
            checkRules(Stack.of(node).resolve(node.securityGroupIngress));
        }

        function checkRules (rules :Array<IngressProperty>) {
            if(rules) {
                for (const rule of rules.values()) {
                    if (!Tokenization.isResolvable(rule) && (rule.cidrIp == '0.0.0.0/0' || rule.cidrIp == '::/0')) {
                        Annotations.of(node).addError('Security Group allows ingress from public internet.');
                    }
                }
            }
        }
    }
}

//Verify AMI of EC2 instance
export class EC2ApprovedAMIAspect implements IAspect {
    public visit(node: IConstruct) {
        if (node instanceof ec2.CfnInstance) {
            if (node.imageId != 'approved-image-id') {
                Annotations.of(node).addError('EC2 Instance is not using approved AMI.');
            }
        }
    }
}

In the second section, you will see the use cases and code where Aspects are used to update the resources in order to make them compliant before creation.

S3 bucket encryption must be enabled. If not, then enable
S3 bucket versioning must be enabled. If not, then enable
RDS instance must have deletion protection enabled. If not, then enable

//Enable versioning of bucket if not enabled
export class BucketVersioningAspect implements IAspect {
    public visit(node: IConstruct): void {
        if (node instanceof s3.CfnBucket) {
            if (!node.versioningConfiguration
                || (!Tokenization.isResolvable(node.versioningConfiguration)
                    && node.versioningConfiguration.status !== 'Enabled')) {
                Annotations.of(node).addInfo('Enabling bucket versioning configuration.');
                node.addPropertyOverride('VersioningConfiguration', {'Status': 'Enabled'})
            }
        }
    }
}

//Enable server side encryption for the bucket if no encryption is enabled
export class BucketEncryptionAspect implements IAspect {
    public visit(node: IConstruct): void {
        if (node instanceof s3.CfnBucket) {
            if (!node.bucketEncryption) {
                Annotations.of(node).addInfo('Enabling default S3 server side encryption.');
                node.addPropertyOverride('BucketEncryption', {
                        "ServerSideEncryptionConfiguration": [
                            {
                                "ServerSideEncryptionByDefault": {
                                    "SSEAlgorithm": "AES256"
                                }
                            }
                        ]
                    }
                )
            }
        }
    }
}

//Enable deletion protection of DB instance if not already enabled
export class RDSDeletionProtectionAspect implements IAspect {
    public visit(node: IConstruct) {
        if (node instanceof rds.CfnDBInstance) {
            if (! node.deletionProtection) {
                Annotations.of(node).addInfo('Enabling deletion protection of DB instance.');
                node.addPropertyOverride('DeletionProtection', true);
            }
        }
    }
}

Once you create the aspects, add them in a particular scope. That scope can be App, Stack, or Construct. In the example below, all aspects are added in the scope of Stack.

const app = new cdk.App();

const stack = new AwsCdkAspectsStack(app, 'MyApplicationStack');

Aspects.of(stack).add(new VPCCIDRAspect());
Aspects.of(stack).add(new SecurityGroupNoPublicIngressAspect());
Aspects.of(stack).add(new EC2ApprovedAMIAspect());
Aspects.of(stack).add(new RDSDeletionProtectionAspect());
Aspects.of(stack).add(new BucketEncryptionAspect());
Aspects.of(stack).add(new BucketVersioningAspect());

app.synth();

Once you call cdk deploy for the above code with aspects added, you will see the output below. The deployment will not continue until you resolve the errors and modifications conducted to make some of the resources compliant.

Screenshot displaying CDK errors.

Also utilize Aspects to make general modifications to the resources regardless of any compliance checks. For example, use it apply mandatory tags to every taggable resource. Tags is an example of implementing CDK Aspects in order to achieve this functionality. Utilizing the code below, add or remove a tag from all taggable resources and their children in the scope of a Construct.

Tags.of(myConstruct).add('key', 'value');
Tags.of(myConstruct).remove('key');

Below is an example of adding the Department tag to every resource created in the scope of Stack.

Tags.of(stack).add('Department', 'Finance');

CDK Aspects are ways for developers to align with and check best practices in their infrastructure configurations using the programming language of choice. AWS CloudFormation Guard (cfn-guard) provides compliance administrators with a simple, policy-as-code language to author policies and apply them to enforce best practices. Aspects are applied before generation of the CloudFormation template in Prepare phase, but cfn-guard is applied after generation of the CloudFormation template and before the Deploy phase. Developers can use Aspects or cfn-guard or both as part of a CI/CD pipeline to stop deployment of non-compliant resources, but CloudFormation Guard is the way to go when you want to enforce compliances and prevent deployment of non-compliant resources.

Conclusion

If you are utilizing AWS CDK to provision your infrastructure, then you can start using Aspects to align with best practices before resources are created. If you are utilizing CloudFormation template to manage your infrastructure, then you can read this blog to learn how to migrate the CloudFormation template to AWS CDK. After the migration, utilize CDK Aspects not only to evaluate compliance of your resources against the best practices, but also modify their state to make them compliant before they are created.

About the Authors

Deploying Alexa Skills with the AWS CDK

2021-08-20 Jeff Gardner

Post Syndicated from Jeff Gardner original https://aws.amazon.com/blogs/devops/deploying-alexa-skills-with-aws-cdk/

So you’re expanding your reach by leveraging voice interfaces for your applications through the Alexa ecosystem. You’ve experimented with a new Alexa Skill via the Alexa Developer Console, and now you’re ready to productionalize it for your customers. How exciting!

You are also a proponent of Infrastructure as Code (IaC). You appreciate the speed, consistency, and change management capabilities enabled by IaC. Perhaps you have other applications that you provision and maintain via DevOps practices, and you want to deploy and maintain your Alexa Skill in the same way. Great idea!

That’s where AWS CloudFormation and the AWS Cloud Development Kit (AWS CDK) come in. AWS CloudFormation lets you treat infrastructure as code, so that you can easily model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycles. The AWS CDK is an open-source software development framework for modeling and provisioning your cloud application resources via familiar programming languages, like TypeScript, Python, Java, and .NET. AWS CDK utilizes AWS CloudFormation in the background in order to provision resources in a safe and repeatable manner.

In this post, we show you how to achieve Infrastructure as Code for your Alexa Skills by leveraging powerful AWS CDK features.

Concepts

Alexa Skills Kit (ASK)

In addition to the Alexa Developer Console, skill developers can utilize the Alexa Skills Kit (ASK) to build interactive voice interfaces for Alexa. ASK provides a suite of self-service APIs and tools for building and interacting with Alexa Skills, including the ASK CLI, the Skill Management API (SMAPI), and SDKs for Node.js, Java, and Python. These tools provide a programmatic interface for your Alexa Skills in order to update them with code rather than through a user interface.

AWS CloudFormation

AWS CloudFormation lets you create templates written in either YAML or JSON format to model your infrastructure in code form. CloudFormation templates are declarative and idempotent, allowing you to check them into a versioned code repository, deploy them automatically, and track changes over time.

The ASK CloudFormation resource allows you to incorporate Alexa Skills in your CloudFormation templates alongside your other infrastructure. However, this has limitations that we’ll discuss in further detail in the Problem section below.

AWS Cloud Development Kit (AWS CDK)

Think of the AWS CDK as a developer-centric toolkit that leverages the power of modern programming languages to define your AWS infrastructure as code. When AWS CDK applications are run, they compile down to fully formed CloudFormation JSON/YAML templates that are then submitted to the CloudFormation service for provisioning. Because the AWS CDK leverages CloudFormation, you still enjoy every benefit provided by CloudFormation, such as safe deployment, automatic rollback, and drift detection. AWS CDK currently supports TypeScript, JavaScript, Python, Java, C#, and Go (currently in Developer Preview).

Perhaps the most compelling part of AWS CDK is the concept of constructs—the basic building blocks of AWS CDK apps. The three levels of constructs reflect the level of abstraction from CloudFormation. A construct can represent a single resource, like an AWS Lambda Function, or it can represent a higher-level component consisting of multiple AWS resources.

The three different levels of constructs begin with low-level constructs, called L1 (short for “level 1”) or Cfn (short for CloudFormation) resources. These constructs directly represent all of the resources available in AWS CloudFormation. The next level of constructs, called L2, also represents AWS resources, but it has a higher-level and intent-based API. They provide not only similar functionality, but also the defaults, boilerplate, and glue logic you’d be writing yourself with a CFN Resource construct. Finally, the AWS Construct Library includes even higher-level constructs, called L3 constructs, or patterns. These are designed to help you complete common tasks in AWS, often involving multiple resource types. Learn more about constructs in the AWS CDK developer guide.

One L2 construct example is the Custom Resources module. This lets you execute custom logic via a Lambda Function as part of your deployment in order to cover scenarios that the AWS CDK doesn’t support yet. While the Custom Resources module leverages CloudFormation’s native Custom Resource functionality, it also greatly reduces the boilerplate code in your CDK project and simplifies the necessary code in the Lambda Function. The open-source construct library referenced in the Solution section of this post utilizes Custom Resources to avoid some limitations of what CloudFormation and CDK natively support for Alexa Skills.

Problem

The primary issue with utilizing the Alexa::ASK::Skill CloudFormation resource, and its corresponding CDK CfnSkill construct, arises when you define the Skill’s backend Lambda Function in the same CloudFormation template or CDK project. When the Skill’s endpoint is set to a Lambda Function, the ASK service validates that the Skill has the appropriate permissions to invoke that Lambda Function. The best practice is to enable Skill ID verification in your Lambda Function. This effectively restricts the Lambda Function to be invokable only by the configured Skill ID. The problem is that in order to configure Skill ID verification, the Lambda Permission must reference the Skill ID, so it cannot be added to the Lambda Function until the Alexa Skill has been created. If we try creating the Alexa Skill without the Lambda Permission in place, insufficient permissions will cause the validation to fail. The endpoint validation causes a circular dependency preventing us from defining our desired end state with just the native CloudFormation resource.

Unfortunately, the AWS CDK also does not yet support any L2 constructs for Alexa skills. While the ASK Skill Management API is another option, managing imperative API calls within a CI/CD pipeline would not be ideal.

Solution

Overview

AWS CDK is extensible in that if there isn’t a native construct that does what you want, you can simply create your own! You can also publish your custom constructs publicly or privately for others to leverage via package registries like npm, PyPI, NuGet, Maven, etc.

We could write our own code to solve the problem, but luckily this use case allows us to leverage an open-source construct library that addresses our needs. This library is currently available for TypeScript (npm) and Python (PyPI).

The complete solution can be found at the GitHub repository, here. The code is in TypeScript, but you can easily port it to another language if necessary. See the AWS CDK Developer Guide for more guidance on translating between languages.

Prerequisites

You will need the following in order to build and deploy the solution presented below. Please be mindful of any prerequisites for these tools.

Alexa Developer Account
AWS Account
Docker
- Used by CDK for bundling assets locally during synthesis and deployment.
- See Docker website for installation instructions based on your operating system.
AWS CLI
- Used by CDK to deploy resources to your AWS account.
- See AWS CLI user guide for installation instructions based on your operating system.
Node.js
- The CDK Toolset and backend runs on Node.js regardless of the project language. See the detailed requirements in the AWS CDK Getting Started Guide.
- See the Node.js website to download the specific installer for your operating system.

Clone Code Repository and Install Dependencies

The code for the solution in this post is located in this repository on GitHub. First, clone this repository and install its local dependencies by executing the following commands in your local Terminal:

# clone repository
git clone https://github.com/aws-samples/aws-devops-blog-alexa-cdk-walkthrough
# navigate to project directory
cd aws-devops-blog-alexa-cdk-walkthrough
# install dependencies
npm install

Note that CLI commands in the sections below (ask, cdk) use npx. This executes the command from local project binaries if they exist, or, if not, it installs the binaries required to run the command. In our case, the local binaries are installed as part of the npm install command above. Therefore, npx will utilize the local version of the binaries even if you already have those tools installed globally. We use this method to simplify setup and alleviate any issues arising from version discrepancies.

Get Alexa Developer Credentials

To create and manage Alexa Skills via CDK, we will need to provide Alexa Developer account credentials, which are separate from our AWS credentials. The following values must be supplied in order to authenticate:

Vendor ID: Represents the Alexa Developer account.
Client ID: Represents the developer, tool, or organization requiring permission to perform a list of operations on the skill. In this case, our AWS CDK project.
Client Secret: The secret value associated with the Client ID.
Refresh Token: A token for reauthentication. The ASK service uses access tokens for authentication that expire one hour after creation. Refresh tokens do not expire and can retrieve a new access token when needed.

Follow the steps below to retrieve each of these values.

Get Alexa Developer Vendor ID

Easily retrieve your Alexa Developer Vendor ID from the Alexa Developer Console.

Navigate to the Alexa Developer console and login with your Amazon account.
After logging in, on the main screen click on the “Settings” tab.

Screenshot of Alexa Developer console showing location of Settings tab

Your Vendor ID is listed in the “My IDs” section. Note this value.

Screenshot of Alexa Developer console showing location of Vendor ID

Create Login with Amazon (LWA) Security Profile

The Skill Management API utilizes Login with Amazon (LWA) for authentication, so first we must create a security profile for LWA under the same Amazon account that we will use to create the Alexa Skill.

Navigate to the LWA console and login with your Amazon account.
Click the “Create a New Security Profile” button.

Screenshot of Login with Amazon console showing location of Create a New Security Profile button

Fill out the form with a Name, Description, and Consent Privacy Notice URL, and then click “Save”.

Screenshot of Login with Amazon console showing Create a New Security Profile form

The new Security Profile should now be listed. Hover over the gear icon, located to the right of the new profile name, and click “Web Settings”.

Screenshot of Login with Amazon console showing location of Web Settings link

Click the “Edit” button and add the following under “Allowed Return URLs”:
- http://127.0.0.1:9090/cb
- https://s3.amazonaws.com/ask-cli/response_parser.html
Click the “Save” button to save your changes.
Click the “Show Secret” button to reveal your Client Secret. Note your Client ID and Client Secret.

Screenshot of Login with Amazon console showing location of Client ID and Client Secret values

Get Refresh Token from ASK CLI

Your Client ID and Client Secret let you generate a refresh token for authenticating with the ASK service.

Navigate to your local Terminal and enter the following command, replacing <your Client ID> and <your Client Secret> with your Client ID and Client Secret, respectively:

# ensure you are in the root directory of the repository
npx ask util generate-lwa-tokens --client-id "<your Client ID>" --client-confirmation "<your Client Secret>" --scopes "alexa::ask:skills:readwrite alexa::ask:models:readwrite"

A browser window should open with a login screen. Supply credentials for the same Amazon account with which you created the LWA Security Profile previously.
Click the “Allow” button to grant the refresh token appropriate access to your Amazon Developer account.
Return to your Terminal. The credentials, including your new refresh token, should be printed. Note the value in the refresh_token field.

NOTE: If your Terminal shows an error like CliFileNotFoundError: File ~/.ask/cli_config not exists., you need to first initialize the ASK CLI with the command npx ask configure. This command will open a browser with a login screen, and you should enter the credentials for the Amazon account with which you created the LWA Security Profile previously. After signing in, return to your Terminal and enter n to decline linking your AWS account. After completing this process, try the generate-lwa-tokens command above again.

NOTE: If your Terminal shows an error like CliError: invalid_client, make sure that you have included the quotation marks (") around the --client_id and --client-confirmation arguments.

Add Alexa Developer Credentials to AWS SSM Parameter Store / AWS Secrets Manager

Our AWS CDK project requires access to the Alexa Developer credentials we just generated (Client ID, Client Secret, Refresh Token) in order to create and manage our Skill. To avoid hard-coding these values into our code, we can store the values in AWS Systems Manager (SSM) Parameter Store and AWS Secrets Manager, and then retrieve them programmatically when deploying our CDK project. In our case, we are using SSM Parameter Store to store the non-sensitive values in plaintext, and Secrets Manager to store the secret values in encrypted form.

The repository contains a shell script at scripts/upload-credentials.sh that can create the appropriate parameters and secrets via AWS CLI. You’ll just need to supply the credential values from the previous steps. Alternatively, instructions for creating parameters and secrets via the AWS Console or AWS CLI can each be found in the AWS Systems Manager User Guide and AWS Secrets Manager User Guide.

You will need the following resources created in your AWS account before proceeding:

Name	Service	Type
/alexa-cdk-blog/alexa-developer-vendor-id	SSM Parameter Store	String
/alexa-cdk-blog/lwa-client-id	SSM Parameter Store	String
/alexa-cdk-blog/lwa-client-secret	Secrets Manager	Plaintext / secret-string
/alexa-cdk-blog/lwa-refresh-token	Secrets Manager	Plaintext / secret-string

Code Walkthrough

Skill Package

When you programmatically create an Alexa Skill, you supply a Skill Package, which is a zip file consisting of a set of files defining your Skill. A skill package includes a manifest JSON file, and optionally a set of interaction model files, in-skill product files, and/or image assets for your skill. See the Skill Management API documentation for details regarding skill packages.

The repository contains a skill package that defines a simple Time Teller Skill at src/skill-package. If you want to use an existing Skill instead, replace the contents of src/skill-package with your skill package.

If you want to export the skill package of an existing Skill, use the ASK CLI:

Navigate to the Alexa Developer console and log in with your Amazon account.
Find the Skill you want to export and click the link under the name “Copy Skill ID”. Either make sure this stays on your clipboard or note the Skill ID for the next step.
Navigate to your local Terminal and enter the following command, replacing <your Skill ID> with your Skill ID:

# ensure you are in the root directory of the repository
cd src
npx ask smapi export-package --stage development --skill-id <your Skill ID>

NOTE: To export the skill package for a live skill, replace --stage development with --stage live.

NOTE: The CDK code in this solution will dynamically populate the manifest.apis section in skill.json. If that section is populated in your skill package, either clear it out or know that it will be replaced when the project is deployed.

Skill Backend Lambda Function

The Lambda Function code for the Time Teller Alexa Skill’s backend also resides within the CDK project at src/lambda/skill-backend. If you want to use an existing Skill instead, replace the contents of src/lambda/skill-backend with your Lambda code. Also note the following if you want to use your own Lambda code:

The CDK code in the repository assumes that the Lambda Function runtime is Python. However, you can modify for another runtime if necessary by using either the aws-lambda or aws-lambda-nodejs CDK module instead of aws-lambda-python.
If you’re using your own Python Lambda Function code, please note the following to ensure the Lambda Function definition compatibility in the sample CDK project. If your Lambda Function varies from what is below, then you may need to modify the CDK code. See the Python Lambda code in the repository for an example.
- The skill-backend/ directory should contain all of the necessary resources for your Lambda Function. For Python functions, this should include at least a file named index.py that contains your Lambda entrypoint, and a requirements.txt file containing your pip dependencies.
- For Python functions, your Lambda handler function should be called handler(). This generally looks like handler = SkillBuilder().lambda_handler() when using the Python ASK SDK.

Open-Source Alexa Skill Construct Library

As mentioned above, this solution utilizes an open-source construct library to create and manage the Alexa Skill. This construct library utilizes the L1 CfnSkill construct along with other L1 and L2 constructs to create a complete Alexa Skill with a functioning backend Lambda Function. Utilizing this construct library means that we are no longer limited by the shortcomings of only using the Alexa::ASK::Skill CloudFormation resource or L1 CfnSkill construct.

Look into the construct library code if you’re curious. There’s only one construct—Skill—and you can follow the code to see how it dodges the Lambda Permission issue.

CDK Stack

The CDK stack code is located in lib/alexa-cdk-stack.ts. Let’s dive in to understand what’s happening. We’ll look at one section at a time:

...
const PARAM_PREFIX = '/alexa-cdk-blog/'

export class AlexaCdkStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Get Alexa Developer credentials from SSM Parameter Store/Secrets Manager.
    // NOTE: Parameters and secrets must have been created in the appropriate account before running `cdk deploy` on this stack.
    //       See sample script at scripts/upload-credentials.sh for how to create appropriate resources via AWS CLI.
    const alexaVendorId = ssm.StringParameter.valueForStringParameter(this, `${PARAM_PREFIX}alexa-developer-vendor-id`);
    const lwaClientId = ssm.StringParameter.valueForStringParameter(this, `${PARAM_PREFIX}lwa-client-id`);
    const lwaClientSecret = cdk.SecretValue.secretsManager(`${PARAM_PREFIX}lwa-client-secret`);
    const lwaRefreshToken = cdk.SecretValue.secretsManager(`${PARAM_PREFIX}lwa-refresh-token`);
    ...
  }
}

First, within the stack’s constructor, after calling the constructor of the base class, we retrieve the credentials we uploaded earlier to SSM and Secrets Manager. This lets us to store our account credentials in a safe place—encrypted in the case of our lwaClientSecret and lwaRefreshToken secrets—and we avoid storing sensitive data in plaintext or source control.

...
export class AlexaCdkStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    ...
    // Create the Lambda Function for the Skill Backend
    const skillBackend = new lambdaPython.PythonFunction(this, 'SkillBackend', {
      entry: 'src/lambda/skill-backend',
      timeout: cdk.Duration.seconds(7)
    });
    ...
  }
}

Next, we create the Lambda Function containing the skill’s backend logic. In this case, we are using the aws-lambda-python module. This transparently handles every aspect of the dependency installation and packaging for us. Rather than leave the default 3-second timeout, specify a 7-second timeout to correspond with the Alexa service timeout of 8 seconds.

...

export class AlexaCdkStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    ...
    // Create the Alexa Skill
    const skill = new Skill(this, 'Skill', {
      endpointLambdaFunction: skillBackend,
      skillPackagePath: 'src/skill-package',
      alexaVendorId: alexaVendorId,
      lwaClientId: lwaClientId,
      lwaClientSecret: lwaClientSecret,
      lwaRefreshToken: lwaRefreshToken
    });
  }
}

Finally, we create our Skill! All we need to do is pass the Lambda Function with the Skill’s backend code into where the skill package is located, as well as the credentials for authenticating into our Alexa Developer account. All of the wiring for deploying the skill package and connecting the Lambda Function to the Skill is handled transparently within the construct code.

Deploy CDK project

Now that all of our code is in place, we can deploy our project and test it out!

Make sure that you have bootstrapped your AWS account for CDK. If not, you can bootstrap with the following command:

# ensure you are in the root directory of the repository
npx cdk bootstrap

Make sure that the Docker daemon is running locally. This is generally done by starting the Docker Desktop application.
- You can also use the following Terminal command to determine whether the Docker daemon is running. The command will return an error if the daemon is not running.

docker ps -q

- See more details regarding starting the Docker daemon based on your operating system via the Docker website.
Synthesize your CDK project in order to confirm that your project is building properly.

# ensure you are in the root directory of the repository
npx cdk synth

NOTE: In addition to generating the CloudFormation template for this project, this command also bundles the Lambda Function code via Docker, so it may take a few minutes to complete.

Deploy!

# ensure you are in the root directory of the repository
npx cdk deploy

- Feel free to review the IAM policies that will be created, and enter y to continue when prompted.
- If you would like to skip the security approval requirement and deploy in one step, use cdk deploy --require-approval never instead.

Check it out!

Once your project finishes deploying, take a look at your new Skill!

Navigate to the Alexa Developer console and log in with your Amazon account.
After logging in, on the main screen you should now see your new Skill listed. Click on the name to go to the “Build” screen.
Investigate the console to confirm that your Skill was created as expected.
On the left-hand navigation menu, click “Endpoint” and confirm that the ARN for your backend Lambda Function is showing in the “Default Region” field. This ARN was added dynamically by our CDK project.

Screenshot of Alexa Developer console showing location of Endpoint text box

Test the Skill to confirm that it functions properly.
1. Click on the “Test” tab and enable testing for the “Development” stage of your skill.
2. Type your Skill’s invocation name in the Alexa Simulator in order to launch the skill and invoke a response.
  - If you deployed the sample skill package and Lambda Function, the invocation name is “time teller”. If Alexa responds with the current time in UTC, it is working properly!

Bonus Points

Now that you can deploy your Alexa Skill via the AWS CDK, can you incorporate your new project into a CI/CD pipeline for automated deployments? Extra kudos if the pipeline is defined with the CDK 🙂 Follow these links for some inspiration:

Cleanup

After you are finished, delete the resources you created to avoid incurring future charges. This can be easily done by deleting the CloudFormation stack from the CloudFormation console, or by executing the following command in your Terminal, which has the same effect:

# ensure you are in the root directory of the repository
npx cdk destroy

Conclusion

You can, and should, strive for IaC and CI/CD in every project, and the powerful AWS CDK features make that easier with a set of simple yet flexible constructs. Leverage the simplicity of declarative infrastructure definitions with convenient default configurations and helper methods via the AWS CDK. This example also reveals that if there are any gaps in the built-in functionality, you can easily fill them with a custom resource construct, or one of the thousands of open-source construct libraries shared by fellow CDK developers around the world. Happy coding!

Jeff Gardner

Jeff Gardner is a Solutions Architect with Amazon Web Services (AWS). In his role, Jeff helps enterprise customers through their cloud journey, leveraging his experience with application architecture and DevOps practices. Outside of work, Jeff enjoys watching and playing sports and chasing around his three young children.

Blue/Green deployment with AWS Developer tools on Amazon EC2 using Amazon EFS to host application source code

2021-07-28 Rakesh Singh

Post Syndicated from Rakesh Singh original https://aws.amazon.com/blogs/devops/blue-green-deployment-with-aws-developer-tools-on-amazon-ec2-using-amazon-efs-to-host-application-source-code/

Many organizations building modern applications require a shared and persistent storage layer for hosting and deploying data-intensive enterprise applications, such as content management systems, media and entertainment, distributed applications like machine learning training, etc. These applications demand a centralized file share that scales to petabytes without disrupting running applications and remains concurrently accessible from potentially thousands of Amazon EC2 instances.

Simultaneously, customers want to automate the end-to-end deployment workflow and leverage continuous methodologies utilizing AWS developer tools services for performing a blue/green deployment with zero downtime. A blue/green deployment is a deployment strategy wherein you create two separate, but identical environments. One environment (blue) is running the current application version, and one environment (green) is running the new application version. The blue/green deployment strategy increases application availability by generally isolating the two application environments and ensuring that spinning up a parallel green environment won’t affect the blue environment resources. This isolation reduces deployment risk by simplifying the rollback process if a deployment fails.

Amazon Elastic File System (Amazon EFS) provides a simple, scalable, and fully-managed elastic NFS file system for use with AWS Cloud services and on-premises resources. It scales on demand, thereby eliminating the need to provision and manage capacity in order to accommodate growth. Utilize Amazon EFS to create a shared directory that stores and serves code and content for numerous applications. Your application can treat a mounted Amazon EFS volume like local storage. This means you don’t have to deploy your application code every time the environment scales up to multiple instances to distribute load.

In this blog post, I will guide you through an automated process to deploy a sample web application on Amazon EC2 instances utilizing Amazon EFS mount to host application source code, and utilizing a blue/green deployment with AWS code suite services in order to deploy the application source code with no downtime.

How this solution works

This blog post includes a CloudFormation template to provision all of the resources needed for this solution. The CloudFormation stack deploys a Hello World application on Amazon Linux 2 EC2 Instances running behind an Application Load Balancer and utilizes Amazon EFS mount point to store the application content. The AWS CodePipeline project utilizes AWS CodeCommit as the version control, AWS CodeBuild for installing dependencies and creating artifacts, and AWS CodeDeploy to conduct deployment on EC2 instances running in an Amazon EC2 Auto Scaling group.

Figure 1 below illustrates our solution architecture.

Figure 1: Sample solution architecture

The event flow in Figure 1 is as follows:

A developer commits code changes from their local repo to the CodeCommit repository. The commit triggers CodePipeline execution.
CodeBuild execution begins to compile source code, install dependencies, run custom commands, and create deployment artifact as per the instructions in the Build specification reference file.
During the build phase, CodeBuild copies the source-code artifact to Amazon EFS file system and maintains two different directories for current (green) and new (blue) deployments.
After successfully completing the build step, CodeDeploy deployment kicks in to conduct a Blue/Green deployment to a new Auto Scaling Group.
During the deployment phase, CodeDeploy mounts the EFS file system on new EC2 instances as per the CodeDeploy AppSpec file reference and conducts other deployment activities.
After successful deployment, a Lambda function triggers in order to store a deployment environment parameter in Systems Manager parameter store. The parameter stores the current EFS mount name that the application utilizes.
The AWS Lambda function updates the parameter value during every successful deployment with the current EFS location.

Prerequisites

For this walkthrough, the following are required:

An AWS account
Access to an AWS account with administrator or PowerUser (or equivalent) AWS Identity and Access Management(IAM) role policies attached
Git Command Line installed and configured in your local environment

Deploy the solution

Once you’ve assembled the prerequisites, download or clone the GitHub repo and store the files on your local machine. Utilize the commands below to clone the repo:

mkdir -p ~/blue-green-sample/
cd ~/blue-green-sample/
git clone https://github.com/aws-samples/blue-green-deployment-pipeline-for-efs

Once completed, utilize the following steps to deploy the solution in your AWS account:

Create a private Amazon Simple Storage Service (Amazon S3) bucket by using this documentation

Figure 2: AWS S3 console view when creating a bucket
Upload the cloned or downloaded GitHub repo files to the root of the S3 bucket. the S3 bucket objects structure should look similar to Figure 3:

Figure 3: AWS S3 bucket object structure
Go to the S3 bucket and select the template name solution-stack-template.yml, and then copy the object URL.
Open the CloudFormation console. Choose the appropriate AWS Region, and then choose Create Stack. Select With new resources.
Select Amazon S3 URL as the template source, paste the object URL that you copied in Step 3, and then choose Next.
On the Specify stack details page, enter a name for the stack and provide the following input parameter. Modify the default values for other parameters in order to customize the solution for your environment. You can leave everything as default for this walkthrough.

ArtifactBucket– The name of the S3 bucket that you created in the first step of the solution deployment. This is a mandatory parameter with no default value.

Figure 4: Defining the stack name and input parameters for the CloudFormation stack

Choose Next.
On the Options page, keep the default values and then choose Next.
On the Review page, confirm the details, acknowledge that CloudFormation might create IAM resources with custom names, and then choose Create Stack.
Once the stack creation is marked as CREATE_COMPLETE, the following resources are created:

A virtual private cloud (VPC) configured with two public and two private subnets.
NAT Gateway, an EIP address, and an Internet Gateway.
Route tables for private and public subnets.
Auto Scaling Group with a single EC2 Instance.
Application Load Balancer and a Target Group.
Three security groups—one each for ALB, web servers, and EFS file system.
Amazon EFS file system with a mount target for each Availability Zone.
CodePipeline project with CodeCommit repository, CodeBuild, and CodeDeploy resources.
SSM parameter to store the environment current deployment status.
Lambda function to update the SSM parameter for every successful pipeline execution.
Required IAM Roles and policies.

Note: It may take anywhere from 10-20 minutes to complete the stack creation.

Test the solution

Now that the solution stack is deployed, follow the steps below to test the solution:

Validate CodePipeline execution status

After successfully creating the CloudFormation stack, a CodePipeline execution automatically triggers to deploy the default application code version from the CodeCommit repository.

In the AWS console, choose Services and then CloudFormation. Select your stack name. On the stack Outputs tab, look for the CodePipelineURL key and click on the URL.
Validate that all steps have successfully completed. For a successful CodePipeline execution, you should see something like Figure 5. Wait for the execution to complete in case it is still in progress.

Figure 5: CodePipeline console showing execution status of all stages

Validate the Website URL

After completing the pipeline execution, hit the website URL on a browser to check if it’s working.

On the stack Outputs tab, look for the WebsiteURL key and click on the URL.
For a successful deployment, it should open a default page similar to Figure 6.

Figure 6: Sample “Hello World” application (Green deployment)

Validate the EFS share

After the website deployed successfully, we will get into the application server and validate the EFS mount point and the application source code directory.

Open the Amazon EC2 console, and then choose Instances in the left navigation pane.
Select the instance named bg-sample and choose
For Connection method, choose Session Manager, and then choose connect

After the connection is made, run the following bash commands to validate the EFS mount and the deployed content. Figure 7 shows a sample output from running the bash commands.

sudo df –h | grep efs
ls –la /efs/green
ls –la /var/www/

Figure 7: Sample output from the bash command (Green deployment)

Deploy a new revision of the application code

After verifying the application status and the deployed code on the EFS share, commit some changes to the CodeCommit repository in order to trigger a new deployment.

On the stack Outputs tab, look for the CodeCommitURL key and click on the corresponding URL.
Click on the file html.
Click on
Uncomment line 9 and comment line 10, so that the new lines look like those below after the changes:

background-color: #0188cc; 
#background-color: #90ee90;

Add Author name, Email address, and then choose Commit changes.

After you commit the code, the CodePipeline triggers and executes Source, Build, Deploy, and Lambda stages. Once the execution completes, hit the Website URL and you should see a new page like Figure 8.

Figure 8: New Application version (Blue deployment)

On the EFS side, the application directory on the new EC2 instance now points to /efs/blue as shown in Figure 9.

Figure 9: Sample output from the bash command (Blue deployment)

Solution review

Let’s review the pipeline stages details and what happens during the Blue/Green deployment:

1) Build stage

For this sample application, the CodeBuild project is configured to mount the EFS file system and utilize the buildspec.yml file present in the source code root directory to run the build. Following is the sample build spec utilized in this solution:

version: 0.2
phases:
  install:
    runtime-versions:
      php: latest   
  build:
    commands:
      - current_deployment=$(aws ssm get-parameter --name $SSM_PARAMETER --query "Parameter.Value" --region $REGION --output text)
      - echo $current_deployment
      - echo $SSM_PARAMETER
      - echo $EFS_ID $REGION
      - if [[ "$current_deployment" == "null" ]]; then echo "this is the first GREEN deployment for this project" ; dir='/efs/green' ; fi
      - if [[ "$current_deployment" == "green" ]]; then dir='/efs/blue' ; else dir='/efs/green' ; fi
      - if [ ! -d $dir ]; then  mkdir $dir >/dev/null 2>&1 ; fi
      - echo $dir
      - rsync -ar $CODEBUILD_SRC_DIR/ $dir/
artifacts:
  files:
      - '**/*'

During the build job, the following activities occur:

Installs latest php runtime version.
Reads the SSM parameter value in order to know the current deployment and decide which directory to utilize. The SSM parameter value flips between green and blue for every successful deployment.
Synchronizes the latest source code to the EFS mount point.
Creates artifacts to be utilized in subsequent stages.

Note: Utilize the default buildspec.yml as a reference and customize it further as per your requirement. See this link for more examples.

2) Deploy Stage

The solution is utilizing CodeDeploy blue/green deployment type for EC2/On-premises. The deployment environment is configured to provision a new EC2 Auto Scaling group for every new deployment in order to deploy the new application revision. CodeDeploy creates the new Auto Scaling group by copying the current one. See this link for more details on blue/green deployment configuration with CodeDeploy. During each deployment event, CodeDeploy utilizes the appspec.yml file to run the deployment steps as per the defined life cycle hooks. Following is the sample AppSpec file utilized in this solution.

version: 0.0
os: linux
hooks:
  BeforeInstall:
    - location: scripts/install_dependencies
      timeout: 180
      runas: root
  AfterInstall:
    - location: scripts/app_deployment
      timeout: 180
      runas: root
  BeforeAllowTraffic :
     - location: scripts/check_app_status
       timeout: 180
       runas: root

Note: The scripts mentioned in the AppSpec file are available in the scripts directory of the CodeCommit repository. Utilize these sample scripts as a reference and modify as per your requirement.

For this sample, the following steps are conducted during a deployment:

BeforeInstall:
- Installs required packages on the EC2 instance.
- Mounts the EFS file system.
- Creates a symbolic link to point the apache home directory /var/www/html to the appropriate EFS mount point. It also ensures that the new application version deploys to a different EFS directory without affecting the current running application.

AfterInstall:
- Stops apache web server.
- Fetches current EFS directory name from Systems Manager.
- Runs some clean up commands.
- Restarts apache web server.

BeforeAllowTraffic:
- Checks application status if running fine.
- Exits the deployment with error if the app returns a non 200 HTTP status code.

3) Lambda Stage

After completing the deploy stage, CodePipeline triggers a Lambda function in order to update the SSM parameter value with the updated EFS directory name. This parameter value alternates between “blue” and “green” to help CodePipeline identify the right EFS file system path during the next deployment.

CodeDeploy Blue/Green deployment

Let’s review the sequence of events flow during the CodeDeploy deployment:

CodeDeploy creates a new Auto Scaling group by copying the original one.
Provisions a replacement EC2 instance in the new Auto Scaling Group.
Conducts the deployment on the new instance as per the instructions in the yml file.
Sets up health checks and redirects traffic to the new instance.
Terminates the original instance along with the Auto Scaling Group.
After completing the deployment, it should appear as shown in Figure 10.

AWS CodeDeploy console view of a Blue/Green CodeDeploy deployment on Ec2

Figure 10: AWS console view of a Blue/Green CodeDeploy deployment on Ec2

Troubleshooting

To troubleshoot any service-related issues, see the following links:

More information

Now that you have tested the solution, here are some additional points worth noting:

The sample template and code utilized in this blog can work in any AWS region and are mainly intended for demonstration purposes. Utilize the sample as a reference and modify it further as per your requirement.
This solution works with single account, Region, and VPC combination.
For this sample, we have utilized AWS CodeCommit as version control, but you can also utilize any other source supported by AWS CodePipeline like Bitbucket, GitHub, or GitHub Enterprise Server

Clean up

Follow these steps to delete the components and avoid any future incurring charges:

Open the AWS CloudFormation console.
On the Stacks page in the CloudFormation console, select the stack that you created for this blog post. The stack must be currently running.
In the stack details pane, choose Delete.
Select Delete stack when prompted.
Empty and delete the S3 bucket created during deployment step 1.

Conclusion

In this blog post, you learned how to set up a complete CI/CD pipeline for conducting a blue/green deployment on EC2 instances utilizing Amazon EFS file share as mount point to host application source code. The EFS share will be the central location hosting your application content, and it will help reduce your overall deployment time by eliminating the need for deploying a new revision on every EC2 instance local storage. It also helps to preserve any dynamically generated content when the life of an EC2 instance ends.

Author bio

Rakesh Singh

Rakesh is a Senior Technical Account Manager at Amazon. He loves automation and enjoys working directly with customers to solve complex technical issues and provide architectural guidance. Outside of work, he enjoys playing soccer, singing karaoke, and watching thriller movies.

Our Journey to Continuous Delivery at Grab (Part 2)

2021-05-10 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab-part2

In the first part of this blog post, you’ve read about the improvements made to our build and staging deployment process, and how plenty of manual tasks routinely taken by engineers have been automated with Conveyor: an in-house continuous delivery solution.

This new post begins with the introduction of the hermeticity principle for our deployments, and how it improves the confidence with promoting changes to production. Changes sent to production via Conveyor’s deployment pipelines are then described in detail.

Finally, looking back at the engineering efficiency improvements around velocity and reliability over the last 2 years, we answer the big question – was the investment on a custom continuous delivery solution like Conveyor the right decision for Grab?

Improving Confidence in our Production Deployments with Hermeticity

The term deployment hermeticity is borrowed from build systems. A build system is called hermetic if builds always produce the same artefacts regardless of changes in the environment they run on. Similarly, we call our deployments hermetic if they always result in the same deployed artefacts regardless of the environment’s change or the number of times they are executed.

The behaviour of a service is rarely controlled by a single variable. The application that makes up your service is an important driver of its behaviour, but its configuration is an important contributor, for example. The behaviour for traditional microservices at Grab is dictated mainly by 3 versioned artefacts: application code, static and dynamic configuration.

Conveyor has been integrated with the systems that operate changes in each of these parameters. By tracking all 3 parameters at every deployment, Conveyor can reproducibly deploy microservices with similar behaviour: its deployments are hermetic.

Building upon this property, Conveyor can ensure that all deployments made to production have been tested before with the same combination of parameters. This is valuable to us:

An outcome of staging deployments for a specific set of parameters is a good predictor of outcomes in production deployments for the same set of parameters and thus it makes testing in staging more relevant.
Rollbacks are hermetic; we never rollback to a combination of parameters that has not been used previously.

In the past, incidents had resulted from an application rollback not compatible with the current dynamic configuration version; this was aggravating since rollbacks are expected to be a safe recovery mechanism. The introduction of hermetic deployments has largely eliminated this category of problems.

Hermeticity is maintained by registering the deployment parameters as artefacts after each successfully completed pipeline. Users must then select one of the registered deployment metadata to promote to production.

At this point, you might be wondering: why not use a single pipeline that includes both staging and production deployments? This was indeed how it started, with a single pipeline spanning multiple environments. However, engineers soon complained about it.

The most obvious reason for the complaint was that less than 20% of changes deployed in staging will make their way to production. This meant that engineers would have toil associated with each completed staging deployment since the pipeline must be manually cancelled rather than continued to production.

The other reason is that this multi-environment pipeline approach reduced flexibility when promoting changes to production. There are different ways to apply changes to a cluster. For example, lengthy pipelines that refresh instances can be used to deploy any combination of changes, while there are quicker pipelines restricted to dynamic configuration changes (such as feature flags rollouts). Regardless of the order in which the changes are made and how they are applied, Conveyor tracks the change.

Eventually, engineers promote a deployment artefact to production. However they do not need to apply changes in the same sequence with which were applied to staging. Furthermore, to prevent erroneous actions, Conveyor presents only changes that can be applied with the requested pipeline (and sometimes, no changes are available). Not being forced into a specific method of deploying changes is one of added benefits of hermetic deployments.

Returning to Our Journey Towards Engineering Efficiency

If you can recall, the first part of this blog post series ended with a description of staging deployment. Our deployment to production starts with a verification that we uphold our hermeticity principle, as explained above.

Our production deployment pipelines can run for several hours for large clusters with rolling releases (few run for days), so we start by acquiring locks to ensure there are no concurrent deployments for any given cluster.

Before making any changes to the environment, we automatically generate release notes, giving engineers a chance to abort if the wrong set of changes are sent to production.

The pipeline next waits for a deployment slot. Early on, engineers adopted deployment windows that coincide with office hours, such that if anything goes wrong, there is always someone on hand to help. Prior to the introduction of Conveyor, however, engineers would manually ask a Slack bot for approval. This interaction is now automated, and the only remaining action left is for the engineer to approve that the deployment can proceed via a single click, in line with our hands-off deployment principle.

When the canary is in production, Conveyor automates monitoring it. This process is similar to the one already discussed in the first part of this blog post: Engineers can configure a set of alerts that Conveyor will keep track of. As soon as any one of the alerts is triggered, Conveyor automatically rolls back the service.

If no alert is raised for the duration of the monitoring period, Conveyor waits again for a deployment slot. It then publishes the release notes for that deployment and completes the deployments for the cluster. After the lock is released and the deployment registered, the pipeline finally comes to its successful completion.

Benefits of Our Journey Towards Engineering Efficiency

All these improvements made over the last 2 years have reduced the effort spent by engineers on deployment while also reducing the failure rate of our deployments.

If you are an engineer working on DevOps in your organisation, you know how hard it can be to measure the impact you made on your organisation. To estimate the time saved by our pipelines, we can model the activities that were previously done manually with a rudimentary weighted graph. In this graph, each edge carries a probability of the activity being performed (100% when unspecified), while each vertex carries the time taken for that activity.

Focusing on our regular staging deployments only, such a graph would look like this:

The overall amount of effort automated by the staging pipelines () is represented in the graph above. It can be converted into the equation below:

This equation shows that for each staging deployment, around 16 minutes of work have been saved. Similarly, for regular production deployments, we find that 67 minutes of work were saved for each deployment:

Moreover, efficiency was not the only benefit brought by the use of deployment pipelines for our traditional microservices. Surprisingly perhaps, the rate of failures related to production changes is progressively reducing while the amount of production changes that were made with Conveyor increased across the organisation (starting at 1.5% of failures per deployments, and finishing at 0.3% on average over the last 3 months for the period of data collected):

Keep Calm and Automate

Since the first draft for this post was written, we’ve made many more improvements to our pipelines. We’ve begun automating Database Migrations; we’ve extended our set of hermetic variables to Amazon Machine Image (AMI) updates; and we’re working towards supporting container deployments.

Through automation, all of Conveyor’s deployment pipelines have contributed to save more than 5,000 man-days of efforts in 2020 alone, across all supported teams. That’s around 20 man-years worth of effort, which is around 3 times the capacity of the team working on the project! Investments in our automation pipelines have more than paid for themselves, and the gains go up every year as more workflows are automated and more teams are onboarded.

If Conveyor has saved efforts for engineering teams, has it then helped to improve velocity? I had opened the first part of this blog post with figures on the deployment funnel for microservice teams at Grab, towards the end of 2018. So where do the figures stand today for these teams?

In the span of 2 years, the average number of build and staging deployment performed each day has not varied much. However, in the last 3 months of 2020, engineers have sent twice more changes to production than they did for the same period in 2018.

Perhaps the biggest recognition received by the team working on the project, was from Grab’s engineers themselves. In the 2020 internal NPS survey for engineering experience at Grab, Conveyor received the highest score of any tools (built in-house or not).

All these improvements in efficiency for our engineers would never have been possible without the hard work of all team members involved in the project, past and present: Tanun Chalermsinsuwan, Aufar Gilbran, Deepak Ramakrishnaiah, Repon Kumar Roy (Kowshik), Su Han, Voislav Dimitrijevikj, Stanley Goh, Htet Aung Shine, Evan Sebastian, Qijia Wang, Oscar Ng, Jacob Sunny, Subhodip Mandal and many others who have contributed and collaborated with them.

Join Us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How Grab Leveraged Performance Marketing Automation to Improve Conversion Rates by 30%

2021-03-22 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/learn-how-grab-leveraged-performance-marketing-automation

Grab, Southeast Asia’s leading superapp, is a hyperlocal three-sided marketplace that operates across hundreds of cities in Southeast Asia. Grab started out as a taxi hailing company back in 2012 and in less than a decade, the business has evolved tremendously and now offers a diverse range of services for consumers’ everyday needs.

To fuel our business growth in newer service offerings such as GrabFood, GrabMart and GrabExpress, user acquisition efforts play a pivotal role in ensuring we create a sustainable Grab ecosystem that balances the marketplace dynamics between our consumers, driver-partners and merchant-partners.

Part of our user growth strategy is centred around our efforts in running direct-response app campaigns to increase trials on our superapp offerings. Executing these campaigns brings about a set of unique challenges against the diverse cultural backdrop present in Southeast Asia, challenging the team to stay hyperlocal in our strategies while driving user volumes at scale. To address these unique challenges, Grab’s performance marketing team is constantly seeking ways to leverage automation and innovate on our operations, improving our marketing efficiency and effectiveness.

Managing Grab’s Ever-expanding Business, Geographical Coverage and New User Acquisition

Grab’s ever-expanding services, extensive geographical coverage and hyperlocal strategies result in an extremely dynamic, yet complex ad account structure. This also means that whenever there is a new business vertical launch or hyperlocal campaign, the team would spend valuable hours rolling out a large volume of new ads across our accounts in the region.

Sample Google Ads account structure — *A sample of our Google Ads account structure.*

The granular structure of our Google Ads account provided us with flexibility to execute hyperlocal strategies, but this also resulted in thousands of ad groups that had to be individually maintained.

In 2019, Grab’s growth was simply outpacing our team’s resources and we finally hit a bottleneck. This challenged the team to take a step back and make the decision to pursue a fully automated solution built on the following principles for long term sustainability:

Building ad-tech solutions in-house instead of acquiring off-the-shelf solutions

Grab’s unique business model calls for several tailor-made features, none of which the existing ad tech solutions were able to provide.
Shifting our mindset to focus on the infinite game

In order to sustain the exponential volume in the ads we run, we had to seek the path of automation.

For our very first automation project, we decided to look into automating creative refresh and upload for our Google Ads account. With thousands of ad groups running multiple creatives each, this had become a growing problem for the team. Overtime, manually monitoring these creatives and refreshing them on a regular basis had become impossible.

The Automation Fundamentals

Grab’s superapp nature means that any automation solution fundamentally needs to be robust:

Performance-driven – to maintain and improve conversion efficiency over time
Flexibility – to fit needs across business verticals and hyperlocal execution
Inclusivity – to account for future service launches and marketing tech (e.g. product feeds and more)
Scalability – to account for future geography/campaign coverage

With these in mind, we incorporated them in our requirements for the custom creative automation tool we planned to build.

Performance-driven – while many advertising platforms, such as Google’s App Campaigns, have built-in algorithms to prevent low-performing creatives from being served, the fundamental bottleneck lies in the speed in which these low-performing creatives can be replaced with new assets to improve performance. Thus, solving this bottleneck would become the primary goal of our tool.
Flexibility – to accommodate our broad range of services, geographies and marketing objectives, a mapping logic would be required to make sure the right creatives are added into the right campaigns.

To solve this, we relied on a standardised creative naming convention, using key attributes in the file name to map an asset to a specific campaign and ad group based on:
- Market
- City
- Service type
- Language
- Creative theme
- Asset type
- Campaign optimisation goal
Inclusivity – to address coverage of future service offerings and interoperability with existing ad-tech vendors, we designed and built our tool conforming to many industry API and platform standards.
Scalability – to ensure full coverage of future geographies/campaigns, the in-house solution’s frontend and backend had to be robust enough to handle volume. Working hand in glove with Google, the solution was built by leveraging multiple APIs including Google Ads and Youtube to host and replace low-performing assets across our ad groups. The solution was then deployed on AWS’ serverless compute engine.

Enter CARA

CARA is an automation tool that scans for any low-performing creatives and replaces them with new assets from our creative library:

CARA Workflow — *A sneak peek of how CARA works*

In a controlled experimental launch, we saw nearly 2,000 underperforming assets automatically replaced across more than 8,000 active ad groups, translating to an 18-30% increase in clickthrough and conversion rates.

Subset of results from CARA experimental launch — *A subset of results from CARA’s experimental launch*

Through automation, Grab’s performance marketing team has been able to significantly improve clickthrough and conversion rates while saving valuable man-hours. We have also established a scalable foundation for future growth. The best part? We are just getting started.

Authored on behalf of the performance marketing team @ Grab. Special thanks to the CRM data analytics team, particularly Milhad Miah and Vaibhav Vij for making this a reality.

Join Us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Integrating AWS Device Farm with your CI/CD pipeline to run cross-browser Selenium tests

2021-03-04 Mahesh Biradar

Post Syndicated from Mahesh Biradar original https://aws.amazon.com/blogs/devops/integrating-aws-device-farm-with-ci-cd-pipeline-to-run-cross-browser-selenium-tests/

Continuously building, testing, and deploying your web application helps you release new features sooner and with fewer bugs. In this blog, you will create a continuous integration and continuous delivery (CI/CD) pipeline for a web app using AWS CodeStar services and AWS Device Farm’s desktop browser testing service. AWS CodeStar is a suite of services that help you quickly develop and build your web apps on AWS.

AWS Device Farm’s desktop browser testing service helps developers improve the quality of their web apps by executing their Selenium tests on different desktop browsers hosted in AWS. For each test executed on the service, Device Farm generates action logs, web driver logs, video recordings, to help you quickly identify issues with your app. The service offers pay-as-you-go pricing so you only pay for the time your tests are executing on the browsers with no upfront commitments or additional costs. Furthermore, Device Farm provides a default concurrency of 50 Selenium sessions so you can run your tests in parallel and speed up the execution of your test suites.

After you’ve completed the steps in this blog, you’ll have a working pipeline that will build your web app on every code commit and test it on different versions of desktop browsers hosted in AWS.

Solution overview

In this solution, you use AWS CodeStar to create a sample web application and a CI/CD pipeline. The following diagram illustrates our solution architecture.

Figure 1: Deployment pipeline architecture

Prerequisites

Before you deploy the solution, complete the following prerequisite steps:

On the AWS Management Console, search for AWS CodeStar.
Choose Getting Started.
Choose Create Project.
Choose a sample project.

For this post, we choose a Python web app deployed on Amazon Elastic Compute Cloud (Amazon EC2) servers.

For Templates, select Python (Django).
Choose Next.

Figure 2: Choose a template in CodeStar

For Project name, enter Demo CICD Selenium.
Leave the remaining settings at their default.

You can choose from a list of existing key pairs. If you don’t have a key pair, you can create one.

Choose Next.
Choose Create project.

Figure 3: Verify the project configuration

AWS CodeStar creates project resources for you as listed in the following table.

Service	Resource Created
AWS CodePipeline project	demo-cicd-selen-Pipeline
AWS CodeCommit repository	Demo-CICD-Selenium
AWS CodeBuild project	demo-cicd-selen
AWS CodeDeploy application	demo-cicd-selen
AWS CodeDeploy deployment group	demo-cicd-selen-Env
Amazon EC2 server	Tag: Environment = demo-cicd-selen-WebApp
IAM role	AWSCodeStarServiceRole, CodeStarWorker*

Your EC2 instance needs to have access to run AWS Device Farm to run the Selenium test scripts. You can use service roles to achieve that.

Attach policy AWSDeviceFarmFullAccess to the IAM role CodeStarWorker-demo-cicd-selen-WebApp.

You’re now ready to create an AWS Cloud9 environment.

Check the Pipelines tab of your AWS CodeStar project; it should show success.

On the IDE tab, under Cloud 9 environments, choose Create environment.
For Environment name, enter Demo-CICD-Selenium.
Choose Create environment.
Wait for the environment to be complete and then choose Open IDE.
In the IDE, follow the instructions to set up your Git and make sure it’s up to date with your repo.

The following screenshot shows the verification that the environment is set up.

Figure 4: Verify AWS Cloud9 setup

You can now verify the deployment.

On the Amazon EC2 console, choose Instances.
Choose demo-cicd-selen-WebApp.
Locate its public IP and choose open address.

Figure 5: Locating IP address of the instance

A webpage should open. If it doesn’t, check your VPN and firewall settings.

Now that you have a working pipeline, let’s move on to creating a Device Farm project for browser testing.

Creating a Device Farm project

To create your browser testing project, complete the following steps:

On the Device Farm console, choose Desktop browser testing project.
Choose Create a new project.
For Project name, enter Demo cicd selenium.
Choose Create project.

Figure 6: Creating AWS Device Farm project

Note down the project ARN; we use this in the Selenium script for the remote web driver.

Figure 7: Project ARN for AWS Device Farm project

Testing the solution

This solution uses the following script to run browser testing. We call this script in the validate service lifecycle hook of the CodeDeploy project.

Open your AWS Cloud9 IDE (which you made as a prerequisite).
Create a folder tests under the project root directory.
Add the sample Selenium script browser-test-sel.py under tests folder with the following content (replace <sample_url> with the url of your web application refer pre-requisite step 18):

import boto3
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

devicefarm_client = boto3.client("devicefarm", region_name="us-west-2")
testgrid_url_response = devicefarm_client.create_test_grid_url(
    projectArn="arn:aws:devicefarm:us-west-2:<your project ARN>",
    expiresInSeconds=300)

driver = webdriver.Remote(testgrid_url_response["url"],
                          webdriver.DesiredCapabilities.FIREFOX)
try:
    driver.implicitly_wait(30)
    driver.maximize_window()
    driver.get("<sample_url>")
    if driver.find_element_by_id("Layer_1"):
        print("graphics generated in full screen")
    assert driver.find_element_by_id("Layer_1")
    driver.set_window_position(0, 0) and driver.set_window_size(1000, 400)
    driver.get("<sample_url>")
    tower = driver.find_element_by_id("tower1")
    if tower.is_displayed():
        print("graphics generated after resizing")
    else:
        print("graphics not generated at this window size")
       # this is where you can fail the script with error if you expect the graphics to load. And pipeline will terminate
except Exception as e:
    print(e)
finally:
    driver.quit()

This script launches a website (in Firefox) created by you in the prerequisite step using the AWS CodeStar template and verifies if a graphic element is loaded.

The following screenshot shows a full-screen application with graphics loaded.

Figure 8: Full screen application with graphics loaded

The following screenshot shows the resized window with no graphics loaded.

Figure 9: Resized window application with No graphics loaded

Create the file validate_service in the scripts folder under the root directory with the following content:

#!/bin/bash
if [ "$DEPLOYMENT_GROUP_NAME" == "demo-cicd-selen-Env" ]
then
cd /home/ec2-user
source environment/bin/activate
python tests/browser-test-sel.py
fi

This script is part of CodeDeploy scripts and determines whether to stop the pipeline or continue based on the output from the browser testing script from the preceding step.

Modify the file appspec.yml under the root directory, add the tests files and ValidateService hook , the file should look like following:

version: 0.0
os: linux
files:
 - source: /ec2django/
   destination: /home/ec2-user/ec2django
 - source: /helloworld/
   destination: /home/ec2-user/helloworld
 - source: /manage.py
   destination: /home/ec2-user
 - source: /supervisord.conf
   destination: /home/ec2-user
 - source: /requirements.txt
   destination: /home/ec2-user
 - source: /requirements/
   destination: /home/ec2-user/requirements
 - source: /tests/
   destination: /home/ec2-user/tests

permissions:
  - object: /home/ec2-user/manage.py
    owner: ec2-user
    mode: 644
    type:
      - file
  - object: /home/ec2-user/supervisord.conf
    owner: ec2-user
    mode: 644
    type:
      - file
hooks:
  AfterInstall:
    - location: scripts/install_dependencies
      timeout: 300
      runas: root
    - location: scripts/codestar_remote_access
      timeout: 300
      runas: root
    - location: scripts/start_server
      timeout: 300
      runas: root

  ApplicationStop:
    - location: scripts/stop_server
      timeout: 300
      runas: root

  ValidateService:
    - location: scripts/validate_service
      timeout: 600
      runas: root

This file is used by AWS CodeDeploy service to perform the deployment and validation steps.

Modify the artifacts section in the buildspec.yml file. The section should look like the following:

artifacts:
  files:
    - 'template.yml'
    - 'ec2django/**/*'
    - 'helloworld/**/*'
    - 'scripts/**/*'
    - 'tests/**/*'
    - 'appspec.yml'
    - 'manage.py'
    - 'requirements/**/*'
    - 'requirements.txt'
    - 'supervisord.conf'
    - 'template-configuration.json'

This file is used by AWS CodeBuild service to package the code

Modify the file Common.txt in the requirements folder under the root directory, the file should look like the following:

# dependencies common to all environments 
Django=2.1.15 
selenium==3.141.0 
boto3 >= 1.10.44
pytest

Save All the changes, your folder structure should look like the following:

── README.md
├── appspec.yml*
├── buildspec.yml*
├── db.sqlite3
├── ec2django
├── helloworld
├── manage.py
├── requirements
│   ├── common.txt*
│   ├── dev.txt
│   ├── prod.txt
├── requirements.txt
├── scripts
│   ├── codestar_remote_access
│   ├── install_dependencies
│   ├── start_server
│   ├── stop_server
│   └── validate_service**
├── supervisord.conf
├── template-configuration.json
├── template.yml
├── tests
│   └── browser-test-sel.py**

**newly added files
*modified file

Running the tests

The Device Farm desktop browsing project is now integrated with your pipeline. All you need to do now is commit the code, and CodePipeline takes care of the rest.

On Cloud9 terminal, go to project root directory.
Run git add . to stage all changed files for commit.
Run git commit -m “<commit message>” to commit the changes.
Run git push to push the changes to repository, this should trigger the Pipeline.
Check the Pipelines tab of your AWS CodeStar project; it should show success.
Go to AWS Device Farm console and click on Desktop browser testing projects.
Click on your Project Demo cicd selenium.

You can verify the running of your Selenium test cases using the recorded run steps shown on the Device Farm console, the video of the run, and the logs, all of which can be downloaded on the Device Farm console, or using the AWS SDK and AWS Command Line Interface (AWS CLI).

The following screenshot shows the project run details on the console.

Figure 10: Viewing AWS Device Farm project run details

To Test the Selenium script locally you can run the following commands.

1. Create a Python virtual environment for your Django project. This virtual environment allows you to isolate this project and install any packages you need without affecting the system Python installation. At the terminal, go to project root directory and type the following command:

$ python3 -m venv ./venv

2. Activate the virtual environment:

$ source ./venv/bin/activate

3. Install development Python dependencies for this project:

$ pip install -r requirements/dev.txt

4. Run Selenium Script:

$ python tests/browser-test-sel.py

Testing the failure scenario (optional)

To test the failure scenario, you can modify the sample script browser-test-sel.py at the else statement.

The following code shows the lines to change:

else:
print("graphics was not generated at this form size")
# this is where you can fail the script with error if you expect the graphics to load. And pipeline will terminate

The following is the updated code:

else:
exit(1)
# this is where you can fail the script with error if you expect the graphics to load. And pipeline will terminate

Commit the change, and the pipeline should fail and stop the deployment.

Conclusion

Integrating Device Farm with CI/CD pipelines allows you to control deployment based on browser testing results. Failing the Selenium test on validation failures can stop and roll back the deployment, and a successful testing can continue the pipeline to deploy the solution to the final stage. Device Farm offers a one-stop solution for testing your native and web applications on desktop browsers and real mobile devices.

Mahesh Biradar is a Solutions Architect at AWS. He is a DevOps enthusiast and enjoys helping customers implement cost-effective architectures that scale..

Automate Amazon EC2 instance isolation by using tags

2021-03-01 Jose Obando

Post Syndicated from Jose Obando original https://aws.amazon.com/blogs/security/automate-amazon-ec2-instance-isolation-by-using-tags/

Containment is a crucial part of an overall Incident Response Strategy, as this practice allows time for responders to perform forensics, eradication and recovery during an Incident. There are many different approaches to containment. In this post, we will be focusing on isolation—the ability to keep multiple targets separated so that each target only sees and affects itself—as a containment strategy.

I’ll show you how to automate isolation of an Amazon Elastic Compute Cloud (Amazon EC2) instance by using an AWS Lambda function that’s triggered by tag changes on the instance, as reported by Amazon CloudWatch Events.

CloudWatch Event Rules deliver a near real-time stream of system events that describe changes in Amazon Web Services (AWS) resources. See also Amazon EventBridge.

Preparing for an incident is important as outlined in the Security Pillar of the AWS Well-Architected Framework.

Out of the 7 Design Principles for Security in the Cloud, as per the Well-Architected Framework, this solution will cover the following:

Enable traceability: Monitor, alert, and audit actions and changes to your environment in real time. Integrate log and metric collection with systems to automatically investigate and take action.
Automate security best practices: Automated software-based security mechanisms can improve your ability to securely scale more rapidly and cost-effectively. Create secure architectures, including through the implementation of controls that can be defined and managed by AWS as code in version-controlled templates.
Prepare for security events: Prepare for an incident by implementing incident management and investigation policy and processes that align to your organizational requirements. Run incident response simulations and use tools with automation to help increase your speed for detection, investigation, and recovery.

After detecting an event in the Detection phase and analyzing in the Analysis phase, you can automate the process of logically isolating an instance from a Virtual Private Cloud (VPC) in Amazon Web Services (AWS).

In this blog post, I describe how to automate EC2 instance isolation by using the tagging feature that a responder can use to identify instances that need to be isolated. A Lambda function then uses AWS API calls to isolate the instances by performing the actions described in the following sections.

Use cases

Your organization can use automated EC2 instance isolation for scenarios like these:

A security analyst wants to automate EC2 instance isolation in order to respond to security events in a timely manner.
A security manager wants to provide their first responders with a way to quickly react to security incidents without providing too much access to higher security features.

High-level architecture and design

The example solution in this blog post uses a CloudWatch Events rule to trigger a Lambda function. The CloudWatch Events rule is triggered when a tag is applied to an EC2 instance. The Lambda code triggers further actions based on the contents of the event. Note that the CloudFormation template includes the appropriate permissions to run the function.

The event flow is shown in Figure 1 and works as follows:

The EC2 instance is tagged.
The CloudWatch Events rule filters the event.
The Lambda function is invoked.
If the criteria are met, the isolation workflow begins.

When the Lambda function is invoked and the criteria are met, these actions are performed:

Checks for IAM instance profile associations.
If the instance is associated to a role, the Lambda function disassociates that role.
Applies the isolation role that you defined during CloudFormation stack creation.
Checks the VPC where the EC2 instance resides.
- If there is no isolation security group in the VPC (if the VPC is new, for example), the function creates one.
Applies the isolation security group.

Note: If you had a security group with an open (0.0.0.0/0) outbound rule, and you apply this Isolation security group, your existing SSH connections to the instance are immediately dropped. On the other hand, if you have a narrower inbound rule that initially allows the SSH connection, the existing connection will not be broken by changing the group. This is known as Connection Tracking.

Figure 1: High-level diagram showing event flow

For the deployment method, we will be using an AWS CloudFormation Template. AWS CloudFormation gives you an easy way to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycles, by treating infrastructure as code.

The AWS CloudFormation template that I provide here deploys the following resources:

An EC2 instance role for isolation – this is attached to the EC2 Instance to prevent further communication with other AWS Services thus limiting the attack surface to your overall infrastructure.
An Amazon CloudWatch Events rule – this is used to detect changes to an AWS EC2 resource, in this case a “tag change event”. We will use this as a trigger to our Lambda function.
An AWS Identity and Access Management (IAM) role for Lambda functions – this is what the Lambda function will use to execute the workflow.
A Lambda function for automation – this function is where all the decision logic sits, once triggered it will follow a set of steps described in the following section.
Lambda function permissions – this is used by the Lambda function to execute.
An IAM instance profile – this is a container for an IAM role that you can use to pass role information to an EC2 instance.

Supporting functions within the Lambda function

Let’s dive deeper into each supporting function inside the Lambda code.

The following function identifies the virtual private cloud (VPC) ID for a given instance. This is needed to identify which security groups are present in the VPC.

def identifyInstanceVpcId(instanceId):
    instanceReservations = ec2Client.describe_instances(InstanceIds=[instanceId])['Reservations']
    for instanceReservation in instanceReservations:
        instancesDescription = instanceReservation['Instances']
        for instance in instancesDescription:
            return instance['VpcId']

The following function modifies the security group of an EC2 instance.

def modifyInstanceAttribute(instanceId,securityGroupId):
    response = ec2Client.modify_instance_attribute(
        Groups=[securityGroupId],
        InstanceId=instanceId)

The following function creates a security group on a VPC that blocks all egress access to the security group.

def createSecurityGroup(groupName, descriptionString, vpcId):
    resource = boto3.resource('ec2')
    securityGroupId = resource.create_security_group(GroupName=groupName, Description=descriptionString, VpcId=vpcId)
    securityGroupId.revoke_egress(IpPermissions= [{'IpProtocol': '-1','IpRanges': [{'CidrIp': '0.0.0.0/0'}],'Ipv6Ranges': [],'PrefixListIds': [],'UserIdGroupPairs': []}])
    return securityGroupId.

Deploy the solution

To deploy the solution provided in this blog post, first download the CloudFormation template, and then set up a CloudFormation stack that specifies the tags that are used to trigger the automated process.

Download the CloudFormation template

To get started, download the CloudFormation template from Amazon S3. Alternatively, you can launch the CloudFormation template by selecting the following Launch Stack button:

Deploy the CloudFormation stack

Start by uploading the CloudFormation template to your AWS account.

To upload the template

From the AWS Management Console, open the CloudFormation console.
Choose Create Stack.
Choose With new resources (standard).
Choose Upload a template file.
Select Choose File, and then select the YAML file that you just downloaded.

Figure 2: CloudFormation stack creation

Specify stack details

You can leave the default values for the stack as long as there aren’t any resources provisioned already with the same name, such as an IAM role. For example, if left with default values an IAM role named “SecurityIsolation-IAMRole” will be created. Otherwise, the naming convention is fully customizable from this screen and you can enter your choice of name for the CloudFormation stack, and modify the parameters as you see fit. Figure 3 shows the parameters that you can set.

The Evaluation Parameters section defines the tag key and value that will initiate the automated response. Keep in mind that these values are case-sensitive.

Figure 3: CloudFormation stack parameters

Choose Next until you reach the final screen, shown in Figure 4, where you acknowledge that an IAM role is created and you trust the source of this template. Select the check box next to the statement I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create Stack.

Figure 4: CloudFormation IAM notification

After you complete these steps, the following resources will be provisioned, as shown in Figure 5:

EC2IsolationRole
EC2TagChangeEvent
IAMRoleForLambdaFunction
IsolationLambdaFunction
IsolationLambdaFunctionInvokePermissions
RootInstanceProfile

Figure 5: CloudFormation created resources

Testing

To start your automation, tag an EC2 instance using the tag defined during the CloudFormation setup. If you’re using the Amazon EC2 console, you can apply tags to resources by using the Tags tab on the relevant resource screen, or you can use the Tags screen, the AWS CLI or an AWS SDK. A detailed walkthrough for each approach can be found in the Amazon EC2 Documentation page.

Reverting Changes

If you need to remove the restrictions applied by this workflow, complete the following steps.

From the EC2 dashboard, in the Instances section, check the box next to the instance you want to modify.

Figure 6: Select the instance to modify
In the top right, select Actions, choose Instance settings, and then choose Modify IAM role.

Figure 7: Choose Actions > Instance settings > Modify IAM role
Under IAM role, choose the IAM role to attach to your instance, and then select Save.

Figure 8: Choose the IAM role to attach
Select Actions, choose Networking, and then choose Change security groups.

Figure 9: Choose Actions > Networking > Change security groups
Under Associated security groups, select Remove and add a different security group with the access you want to grant to this instance.

Summary

Using the CloudFormation template provided in this blog post, a Security Operations Center analyst could have only tagging privileges and isolate an EC2 instance based on this tag. Or a security service such as Amazon GuardDuty could trigger a lambda to apply the tag as part of a workflow. This means the Security Operations Center analyst wouldn’t need administrative privileges over the EC2 service.

This solution creates an isolation security group for any new VPCs that don’t have one already. The security group would still follow the naming convention defined during the CloudFormation stack launch, but won’t be part of the provisioned resources. If you decide to delete the stack, manual cleanup would be required to remove these security groups.

This solution terminates established inbound Secure Shell (SSH) sessions that are associated to the instance, and isolates the instance from new connections either inbound or outbound. For outbound connections that are already established (for example, reverse shell), you either need to shut down the network interface card (NIC) at the operating system (OS) level, restart the instance network stack at the OS level, terminate the established connections, or apply a network access control list (network ACL).

For more information, see the following documentation:

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Growth Engineering at Netflix — Automated Imagery Generation

2021-02-09 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/growth-engineering-at-netflix-automated-imagery-generation-5a105fd51569

Growth Engineering at Netflix — Automated Imagery Generation

by Eric Eiswerth

Background

There’s a good chance you’ve probably visited the Netflix homepage. In the Growth Engineering team, we refer to this as the top of the signup funnel. For more background on the signup funnel and Growth Engineering’s role in the signup funnel, please read our initial post on the topic: Growth Engineering at Netflix — Accelerating Innovation. The primary focus of this post will be the top of the signup funnel. In particular, the Netflix homepage:

As discussed in our previous post, Growth Engineering owns the business logic and protocols that allow our UI partners to build lightweight and flexible applications for almost any platform. In some cases, like the homepage, this even involves providing appropriate imagery (e.g., the background image shown above). In this post, we’ll take a deep dive into the journey of content-based imagery on the Netflix homepage.

Motivation

At Netflix we do one thing — entertainment — and we aim to do it really well. We live and breathe TV shows and films, and we want everyone to be able to enjoy them too. That’s why we aspire to have best in class stories, across genres and believe people should have access to new voices, cultures and perspectives. The member-focused teams at Netflix are responsible for making sure the member experience is relevant and personalized, ensuring that this content is shown to the right people at the right time. But what about non-members; those who are simply interested in signing up for Netflix, how should we highlight our content and convey our value propositions to them?

The Solution

The main mechanism for highlighting our content in the signup flow is through content-based imagery. Before designing a solution it’s important to understand the main product requirements for such a feature:

The content needs to be new, relevant, and regional (not all countries have the same catalogue).
The artwork needs to appeal to a broader audience. The non-member homepage serves a very broad audience and is not personalized to the extent of the member experience.
The imagery needs to be localized.
We need to be able to easily determine what imagery is present for a given platform, region, and language.
The homepage needs to load in a reasonable amount of time, even in poor network conditions.

Unpacking Product Requirements

Given the scale we require and the product requirements listed above, there are a number of technical requirements:

A list of titles for the asset, in some order.
Ensure the titles are appropriate for a broad audience, which means all titles need to be tagged with metadata.
Localized images for each of the titles.
Different assets for different device types and screen sizes.
Server-generated assets, since client-side generation would require the retrieval of many individual images, which would increase latency and time-to-render.
To reduce latency, assets should be generated in an offline fashion and not in real time.
The assets need to be compressed, without reducing quality significantly.
The assets will need to be stored somewhere and we’ll need to generate URLs for each of them.
We’ll need to figure out how to provide the correct asset URL for a given request.
We’ll need to build a search index so that the assets can be searchable.

Given this set of requirements, we can effectively break this work down into 3 functional buckets:

The Design

For our design, we decided to build 3 separate microservices, mapping to the aforementioned functional buckets. Let’s take a look at each of these services in turn.

Asset Generation

The Asset Generation Service is responsible for generating groups of assets. We call these groups of assets, asset groups. Each request will generate a single asset group that will contain one or more assets. To support the demands of our stakeholders we designed a Domain Specific Language (DSL) that we call an asset generation recipe. An asset generation request contains a recipe. Below is an example of a simple recipe:

{
  "titleIds": [12345, 23456, 34567, …],
  "countries": [“US”],
  "type": “perspective”, // this is the design of the asset
  "rows": 10, // the number of rows and columns can control density
  "cols": 15,
  "padding": 10, // padding between individual images
  "columnOffsets": [0, 0, 0, 0…], // the y-offset for each column
  "rowOffsets": [0, -100, 0, -100, …], // the x-offset for each row
  "size": [1920, 1080] // size in pixels
}

This recipe can then be issued via an HTTP POST request to the Asset Generation Service. The recipe can then be translated into ImageMagick commands that can do the heavy lifting. At a high level, the following diagram captures the necessary steps required to build an asset.

Generating a single localized asset is a big achievement, but we still need to store the asset somewhere and have the ability to search for it. This requires an asset storage solution.

Asset Storage

We refer to asset storage and management simply as asset management. We felt it would be beneficial to create a separate microservice for asset management for 2 reasons. First, asset generation is CPU intensive and bursty. We can leverage high performance VMs in AWS to generate the assets. We can scale up when generation is occurring and scale down when there is no batch in the queue. However, it would be cost-inefficient to leverage this same hardware for lightweight and more consistent traffic patterns that an asset management service requires.

Let’s take a look at the internals of the Asset Management Service.

At this point we’ve laid out all the details in order to generate a content-based asset and have it stored as part of an asset group, which is persisted and indexed. The next thing to consider is, how do we retrieve an asset in real time and surface it on the Netflix homepage?

If you recall in our previous blog post, Growth Engineering owns a service called the Orchestration Service. It is a mid-tier service that emits a custom JSON data structure that contains fields that are consumed by the UI. The UI can then use these fields to control the presentation in the UI layer. There are two approaches for adding fields to the Orchestration Service’s response. First, the fields can be coded by hand. Second, fields can be added via configuration via a service we call the Customization Service. Since assets will need to be periodically refreshed and we want this process to be entirely automated, it makes sense to pursue the configuration-based approach. To accomplish this, the Asset Management Service needs to translate an asset group into a rule definition for the Customization Service.

Customization Service

Let’s review the Orchestration Service and introduce the Customization Service. The Orchestration Service emits fields in response to upstream requests. For the homepage, there are typically only a small number of fields provided by the Orchestration Service. The following fields are supplied by application code. For example:

{
  “fields”: {
    “email” : {
      “type”: “StringField”,
      “value”: “”
    },
    “nextAction”: {
      “type”: “Action”,
      “withFields” [“email”]
    }
  }
}

The Orchestration Service also supports fields supplied by configuration. We call these adaptive fields. Adaptive fields are provided by the Customization Service. The Customization Service is a rules engine that emits the adaptive fields. For example, a rule to provide the background image for the homepage in the en-US locale would look as follows:

{
  “country”: “US”,
  “language”: “en”,
  “platform”: “browser”,
  “resolution”: “high”
}

The corresponding payload for such a rule might look as follows:

{
  “backgroundImage”: “https://cdn.netflix.com/bgimageurl.jpg”
}

Bringing this all together, the response from the Orchestration Service would now look as follows:

{
  “fields”: {
    “email” : {
      “type”: “StringField”,
      “value”: “”
    },
    “nextAction”: {
      “type”: “Action”,
      “withFields” [“email”]
    }
  },
  “adaptiveFields”: {
    “backgroundImage”: “https://cdn.netflix.com/bgimageurl.jpg”
  }
}

At this point, we are now able to generate an asset, persist it, search it, and generate customization rules for it. The generated rules then enable us to return a particular asset for a particular request. Let’s put it all together and review the system interaction diagram.

We now have all the pieces in place to automatically generate artwork and have that artwork appear on the Netflix homepage for a given request. At least one open question remains, how can we scale asset generation?

Scaling Asset Generation

Arguably, there are a number of approaches that could be used to scale asset generation. We decided to opt for an all-or-nothing approach. Meaning, all assets for a given recipe need to be generated as a single asset group. This enables smooth rollback in case of any errors. Additionally, asset generation is CPU intensive and each recipe can produce 1000s of assets as a result of the number of platform, region, and language permutations. Even with high performance VMs, generating 1000s of assets can take a long time. As a result, we needed to find a way to distribute asset generation across multiple VMs. Here’s what the final architecture looked like.

Briefly, let’s review the steps:

The batch process is initiated by a cron job. The job executes a script that contains an asset generation recipe.
The Asset Generation Service receives the request and creates asset generation tasks that can be distributed across any number of Asset Generation Worker nodes. One of the nodes is elected as the leader via Zookeeper. Its job is to coordinate asset generation across the other workers and ensure all assets get generated.
Once the primary worker node has all the assets, it creates an asset group in the Asset Management Service. The Asset Management Service persists, indexes, and uploads the assets to the CDN.
Finally, the Asset Management Service creates rules from the asset group and pushes the rules to the Customization Service. Once the data is published in the Customization Service, the Orchestration Service can supply the correct URLs in its JSON response by invoking the Customization Service with a request context that matches a given set of rules.

Conclusion

Automated asset generation has proven to be an extremely valuable investment. It is low-maintenance, high-leverage, and has allowed us to experiment with a variety of different types of assets on different platforms and on different parts of the product. This project was technically challenging and highly rewarding, both to the engineers involved in the project, and to the business. The Netflix homepage has come a long way over the last several years.

We’re hiring! Join Growth Engineering and help us build the future of Netflix.

Growth Engineering at Netflix — Automated Imagery Generation was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to deploy the AWS Solution for Security Hub Automated Response and Remediation

2020-11-20 Ramesh Venkataraman

Post Syndicated from Ramesh Venkataraman original https://aws.amazon.com/blogs/security/how-to-deploy-the-aws-solution-for-security-hub-automated-response-and-remediation/

In this blog post I show you how to deploy the Amazon Web Services (AWS) Solution for Security Hub Automated Response and Remediation. The first installment of this series was about how to create playbooks using Amazon CloudWatch Events, AWS Lambda functions, and AWS Security Hub custom actions that you can run manually based on triggers from Security Hub in a specific account. That solution requires an analyst to directly trigger an action using Security Hub custom actions and doesn’t work for customers who want to set up fully automated remediation based on findings across one or more accounts from their Security Hub master account.

The solution described in this post automates the cross-account response and remediation lifecycle from executing the remediation action to resolving the findings in Security Hub and notifying users of the remediation via Amazon Simple Notification Service (Amazon SNS). You can also deploy these automated playbooks as custom actions in Security Hub, which allows analysts to run them on-demand against specific findings. You can deploy these remediations as custom actions or as fully automated remediations.

Currently, the solution includes 10 playbooks aligned to the controls in the Center for Internet Security (CIS) AWS Foundations Benchmark standard in Security Hub, but playbooks for other standards such as AWS Foundational Security Best Practices (FSBP) will be added in the future.

Solution overview

Figure 1 shows the flow of events in the solution described in the following text.

Figure 1: Flow of events

Detect

Security Hub gives you a comprehensive view of your security alerts and security posture across your AWS accounts and automatically detects deviations from defined security standards and best practices.

Security Hub also collects findings from various AWS services and supported third-party partner products to consolidate security detection data across your accounts.

Ingest

All of the findings from Security Hub are automatically sent to CloudWatch Events and Amazon EventBridge and you can set up CloudWatch Events and EventBridge rules to be invoked on specific findings. You can also send findings to CloudWatch Events and EventBridge on demand via Security Hub custom actions.

Remediate

The CloudWatch Event and EventBridge rules can have AWS Lambda functions, AWS Systems Manager automation documents, or AWS Step Functions workflows as the targets of the rules. This solution uses automation documents and Lambda functions as response and remediation playbooks. Using cross-account AWS Identity and Access Management (IAM) roles, the playbook performs the tasks to remediate the findings using the AWS API when a rule is invoked.

Log

The playbook logs the results to the Amazon CloudWatch log group for the solution, sends a notification to an Amazon Simple Notification Service (Amazon SNS) topic, and updates the Security Hub finding. An audit trail of actions taken is maintained in the finding notes. The finding is updated as RESOLVED after the remediation is run. The security finding notes are updated to reflect the remediation performed.

Here are the steps to deploy the solution from this GitHub project.

In the Security Hub master account, you deploy the AWS CloudFormation template, which creates an AWS Service Catalog product along with some other resources. For a full set of what resources are deployed as part of an AWS CloudFormation stack deployment, you can find the full set of deployed resources in the Resources section of the deployed AWS CloudFormation stack. The solution uses the AWS Service Catalog to have the remediations available as a product that can be deployed after granting the users the required permissions to launch the product.
Add an IAM role that has administrator access to the AWS Service Catalog portfolio.
Deploy the CIS playbook from the AWS Service Catalog product list using the IAM role you added in the previous step.
Deploy the AWS Security Hub Automated Response and Remediation template in the master account in addition to the member accounts. This template establishes AssumeRole permissions to allow the playbook Lambda functions to perform remediations. Use AWS CloudFormation StackSets in the master account to have a centralized deployment approach across the master account and multiple member accounts.

Deployment steps for automated response and remediation

This section reviews the steps to implement the solution, including screenshots of the solution launched from an AWS account.

Launch AWS CloudFormation stack on the master account

As part of this AWS CloudFormation stack deployment, you create custom actions to configure Security Hub to send findings to CloudWatch Events. Lambda functions are used to provide remediation in response to actions sent to CloudWatch Events.

Note: In this solution, you create custom actions for the CIS standards. There will be more custom actions added for other security standards in the future.

To launch the AWS CloudFormation stack

Deploy the AWS CloudFormation template in the Security Hub master account. In your AWS console, select CloudFormation and choose Create new stack and enter the S3 URL.
Select Next to move to the Specify stack details tab, and then enter a Stack name as shown in Figure 2. In this example, I named the stack SO0111-SHARR, but you can use any name you want.

Figure 2: Creating a CloudFormation stack
Creating the stack automatically launches it, creating 21 new resources using AWS CloudFormation, as shown in Figure 3.

Figure 3: Resources launched with AWS CloudFormation
An Amazon SNS topic is automatically created from the AWS CloudFormation stack.
When you create a subscription, you’re prompted to enter an endpoint for receiving email notifications from Amazon SNS as shown in Figure 4. To subscribe to that topic that was created using CloudFormation, you must confirm the subscription from the email address you used to receive notifications.

Figure 4: Subscribing to Amazon SNS topic

Enable Security Hub

You should already have enabled Security Hub and AWS Config services on your master account and the associated member accounts. If you haven’t, you can refer to the documentation for setting up Security Hub on your master and member accounts. Figure 5 shows an AWS account that doesn’t have Security Hub enabled.

Figure 5: Enabling Security Hub for first time

AWS Service Catalog product deployment

In this section, you use the AWS Service Catalog to deploy Service Catalog products.

To use the AWS Service Catalog for product deployment

In the same master account, add roles that have administrator access and can deploy AWS Service Catalog products. To do this, from Services in the AWS Management Console, choose AWS Service Catalog. In AWS Service Catalog, select Administration, and then navigate to Portfolio details and select Groups, roles, and users as shown in Figure 6.

Figure 6: AWS Service Catalog product
After adding the role, you can see the products available for that role. You can switch roles on the console to assume the role that you granted access to for the product you added from the AWS Service Catalog. Select the three dots near the product name, and then select Launch product to launch the product, as shown in Figure 7.

Figure 7: Launch the product
While launching the product, you can choose from the parameters to either enable or disable the automated remediation. Even if you do not enable fully automated remediation, you can still invoke a remediation action in the Security Hub console using a custom action. By default, it’s disabled, as highlighted in Figure 8.

Figure 8: Enable or disable automated remediation
After launching the product, it can take from 3 to 5 minutes to deploy. When the product is deployed, it creates a new CloudFormation stack with a status of CREATE_COMPLETE as part of the provisioned product in the AWS CloudFormation console.

AssumeRole Lambda functions

Deploy the template that establishes AssumeRole permissions to allow the playbook Lambda functions to perform remediations. You must deploy this template in the master account in addition to any member accounts. Choose CloudFormation and create a new stack. In Specify stack details, go to Parameters and specify the Master account number as shown in Figure 9.

Figure 9: Deploy AssumeRole Lambda function

Test the automated remediation

Now that you’ve completed the steps to deploy the solution, you can test it to be sure that it works as expected.

To test the automated remediation

To test the solution, verify that there are 10 actions listed in Custom actions tab in the Security Hub master account. From the Security Hub master account, open the Security Hub console and select Settings and then Custom actions. You should see 10 actions, as shown in Figure 10.

Figure 10: Custom actions deployed
Make sure you have member accounts available for testing the solution. If not, you can add member accounts to the master account as described in Adding and inviting member accounts.
For testing purposes, you can use CIS 1.5 standard, which is to require that the IAM password policy requires at least one uppercase letter. Check the existing settings by navigating to IAM, and then to Account Settings. Under Password policy, you should see that there is no password policy set, as shown in Figure 11.

Figure 11: Password policy not set
To check the security settings, go to the Security Hub console and select Security standards. Choose CIS AWS Foundations Benchmark v1.2.0. Select CIS 1.5 from the list to see the Findings. You will see the Status as Failed. This means that the password policy to require at least one uppercase letter hasn’t been applied to either the master or the member account, as shown in Figure 12.

Figure 12: CIS 1.5 finding
Select CIS 1.5 – 1.11 from Actions on the top right dropdown of the Findings section from the previous step. You should see a notification with the heading Successfully sent findings to Amazon CloudWatch Events as shown in Figure 13.

Figure 13: Sending findings to CloudWatch Events
Return to Findings by selecting Security standards and then choosing CIS AWS Foundations Benchmark v1.2.0. Select CIS 1.5 to review Findings and verify that the Workflow status of CIS 1.5 is RESOLVED, as shown in Figure 14.

Figure 14: Resolved findings
After the remediation runs, you can verify that the Password policy is set on the master and the member accounts. To verify that the password policy is set, navigate to IAM, and then to Account Settings. Under Password policy, you should see that the account uses a password policy, as shown in Figure 15.

Figure 15: Password policy set
To check the CloudWatch logs for the Lambda function, in the console, go to Services, and then select Lambda and choose the Lambda function and within the Lambda function, select View logs in CloudWatch. You can see the details of the function being run, including updating the password policy on both the master account and the member account, as shown in Figure 16.

Figure 16: Lambda function log

Conclusion

In this post, you deployed the AWS Solution for Security Hub Automated Response and Remediation using Lambda and CloudWatch Events rules to remediate non-compliant CIS-related controls. With this solution, you can ensure that users in member accounts stay compliant with the CIS AWS Foundations Benchmark by automatically invoking guardrails whenever services move out of compliance. New or updated playbooks will be added to the existing AWS Service Catalog portfolio as they’re developed. You can choose when to take advantage of these new or updated playbooks.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security Hub forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.