Tag Archives: Continuous Delivery

Blue/Green deployments using AWS CDK Pipelines and AWS CodeDeploy

2023-10-10 Luiz Decaro

Post Syndicated from Luiz Decaro original https://aws.amazon.com/blogs/devops/blue-green-deployments-using-aws-cdk-pipelines-and-aws-codedeploy/

Customers often ask for help with implementing Blue/Green deployments to Amazon Elastic Container Service (Amazon ECS) using AWS CodeDeploy. Their use cases usually involve cross-Region and cross-account deployment scenarios. These requirements are challenging enough on their own, but in addition to those, there are specific design decisions that need to be considered when using CodeDeploy. These include how to configure CodeDeploy, when and how to create CodeDeploy resources (such as Application and Deployment Group), and how to write code that can be used to deploy to any combination of account and Region.

Today, I will discuss those design decisions in detail and how to use CDK Pipelines to implement a self-mutating pipeline that deploys services to Amazon ECS in cross-account and cross-Region scenarios. At the end of this blog post, I also introduce a demo application, available in Java, that follows best practices for developing and deploying cloud infrastructure using AWS Cloud Development Kit (AWS CDK).

The Pipeline

CDK Pipelines is an opinionated construct library used for building pipelines with different deployment engines. It abstracts implementation details that developers or infrastructure engineers need to solve when implementing a cross-Region or cross-account pipeline. For example, in cross-Region scenarios, AWS CloudFormation needs artifacts to be replicated to the target Region. For that reason, AWS Key Management Service (AWS KMS) keys, an Amazon Simple Storage Service (Amazon S3) bucket, and policies need to be created for the secondary Region. This enables artifacts to be moved from one Region to another. In cross-account scenarios, CodeDeploy requires a cross-account role with access to the KMS key used to encrypt configuration files. This is the sort of detail that our customers want to avoid dealing with manually.

AWS CodeDeploy is a deployment service that automates application deployment across different scenarios. It deploys to Amazon EC2 instances, On-Premises instances, serverless Lambda functions, or Amazon ECS services. It integrates with AWS Identity and Access Management (AWS IAM), to implement access control to deploy or re-deploy old versions of an application. In the Blue/Green deployment type, it is possible to automate the rollback of a deployment using Amazon CloudWatch Alarms.

CDK Pipelines was designed to automate AWS CloudFormation deployments. Using AWS CDK, these CloudFormation deployments may include deploying application software to instances or containers. However, some customers prefer using CodeDeploy to deploy application software. In this blog post, CDK Pipelines will deploy using CodeDeploy instead of CloudFormation.

A pipeline build with CDK Pipelines that deploys to Amazon ECS using AWS CodeDeploy. It contains at least 5 stages: Source, Build, UpdatePipeline, Assets and at least one Deployment stage.

Design Considerations

In this post, I’m considering the use of CDK Pipelines to implement different use cases for deploying a service to any combination of accounts (single-account & cross-account) and regions (single-Region & cross-Region) using CodeDeploy. More specifically, there are four problems that need to be solved:

CodeDeploy Configuration

The most popular options for implementing a Blue/Green deployment type using CodeDeploy are using CloudFormation Hooks or using a CodeDeploy construct. I decided to operate CodeDeploy using its configuration files. This is a flexible design that doesn’t rely on using custom resources, which is another technique customers have used to solve this problem. On each run, a pipeline pushes a container to a repository on Amazon Elastic Container Registry (ECR) and creates a tag. CodeDeploy needs that information to deploy the container.

I recommend creating a pipeline action to scan the AWS CDK cloud assembly and retrieve the repository and tag information. The same action can create the CodeDeploy configuration files. Three configuration files are required to configure CodeDeploy: appspec.yaml, taskdef.json and imageDetail.json. This pipeline action should be executed before the CodeDeploy deployment action. I recommend creating template files for appspec.yaml and taskdef.json. The following script can be used to implement the pipeline action:

##
#!/bin/sh
#
# Action Configure AWS CodeDeploy
# It customizes the files template-appspec.yaml and template-taskdef.json to the environment
#
# Account = The target Account Id
# AppName = Name of the application
# StageName = Name of the stage
# Region = Name of the region (us-east-1, us-east-2)
# PipelineId = Id of the pipeline
# ServiceName = Name of the service. It will be used to define the role and the task definition name
#
# Primary output directory is codedeploy/. All the 3 files created (appspec.json, imageDetail.json and 
# taskDef.json) will be located inside the codedeploy/ directory
#
##
Account=$1
Region=$2
AppName=$3
StageName=$4
PipelineId=$5
ServiceName=$6
repo_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages[] | .destinations[] | .repositoryName' | head -1) 
tag_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages | to_entries[0].key')  
echo ${repo_name} 
echo ${tag_name} 
printf '{"ImageURI":"%s"}' "$Account.dkr.ecr.$Region.amazonaws.com/${repo_name}:${tag_name}" > codedeploy/imageDetail.json                     
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-appspec.yaml > codedeploy/appspec.yaml 
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-taskdef.json | sed 's#TASK_EXEC_ROLE#arn:aws:iam::'$Account':role/'$ServiceName'#g' | sed 's#fargate-task-definition#'$ServiceName'#g' > codedeploy/taskdef.json 
cat codedeploy/appspec.yaml
cat codedeploy/taskdef.json
cat codedeploy/imageDetail.json

Using a Toolchain

A good strategy is to encapsulate the pipeline inside a Toolchain to abstract how to deploy to different accounts and regions. This helps decoupling clients from the details such as how the pipeline is created, how CodeDeploy is configured, and how cross-account and cross-Region deployments are implemented. To create the pipeline, deploy a Toolchain stack. Out-of-the-box, it allows different environments to be added as needed. Depending on the requirements, the pipeline may be customized to reflect the different stages or waves that different components might require. For more information, please refer to our best practices on how to automate safe, hands-off deployments and its reference implementation.

In detail, the Toolchain stack follows the builder pattern used throughout the CDK for Java. This is a convenience that allows complex objects to be created using a single statement:

 Toolchain.Builder.create(app, Constants.APP_NAME+"Toolchain")
        .stackProperties(StackProps.builder()
                .env(Environment.builder()
                        .account(Demo.TOOLCHAIN_ACCOUNT)
                        .region(Demo.TOOLCHAIN_REGION)
                        .build())
                .build())
        .setGitRepo(Demo.CODECOMMIT_REPO)
        .setGitBranch(Demo.CODECOMMIT_BRANCH)
        .addStage(
                "UAT",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                             
        .build();

In the statement above, the continuous deployment pipeline is created in the TOOLCHAIN_ACCOUNT and TOOLCHAIN_REGION. It implements a stage that builds the source code and creates the Java archive (JAR) using Apache Maven. The pipeline then creates a Docker image containing the JAR file.

The UAT stage will deploy the service to the SERVICE_ACCOUNT and SERVICE_REGION using the deployment configuration CANARY_10_PERCENT_5_MINUTES. This means 10 percent of the traffic is shifted in the first increment and the remaining 90 percent is deployed 5 minutes later.

To create additional deployment stages, you need a stage name, a CodeDeploy deployment configuration and an environment where it should deploy the service. As mentioned, the pipeline is, by default, a self-mutating pipeline. For example, to add a Prod stage, update the code that creates the Toolchain object and submit this change to the code repository. The pipeline will run and update itself adding a Prod stage after the UAT stage. Next, I show in detail the statement used to add a new Prod stage. The new stage deploys to the same account and Region as in the UAT environment:

... 
        .addStage(
                "Prod",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                                                      
        .build();

In the statement above, the Prod stage will deploy new versions of the service using a CodeDeploy deployment configuration CANARY_10_PERCENT_5_MINUTES. It means that 10 percent of traffic is shifted in the first increment of 5 minutes. Then, it shifts the rest of the traffic to the new version of the application. Please refer to Organizing Your AWS Environment Using Multiple Accounts whitepaper for best-practices on how to isolate and manage your business applications.

Some customers might find this approach interesting and decide to provide this as an abstraction to their application development teams. In this case, I advise creating a construct that builds such a pipeline. Using a construct would allow for further customization. Examples are stages that promote quality assurance or deploy the service in a disaster recovery scenario.

The implementation creates a stack for the toolchain and another stack for each deployment stage. As an example, consider a toolchain created with a single deployment stage named UAT. After running successfully, the DemoToolchain and DemoService-UAT stacks should be created as in the next image:

Two stacks are needed to create a Pipeline that deploys to a single environment. One stack deploys the Toolchain with the Pipeline and another stack deploys the Service compute infrastructure and CodeDeploy Application and DeploymentGroup. In this example, for an application named Demo that deploys to an environment named UAT, the stacks deployed are: DemoToolchain and DemoService-UAT

CodeDeploy Application and Deployment Group

CodeDeploy configuration requires an application and a deployment group. Depending on the use case, you need to create these in the same or in a different account from the toolchain (pipeline). The pipeline includes the CodeDeploy deployment action that performs the blue/green deployment. My recommendation is to create the CodeDeploy application and deployment group as part of the Service stack. This approach allows to align the lifecycle of CodeDeploy application and deployment group with the related Service stack instance.

CodePipeline allows to create a CodeDeploy deployment action that references a non-existing CodeDeploy application and deployment group. This allows us to implement the following approach:

Toolchain stack deploys the pipeline with CodeDeploy deployment action referencing a non-existing CodeDeploy application and deployment group
When the pipeline executes, it first deploys the Service stack that creates the related CodeDeploy application and deployment group
The next pipeline action executes the CodeDeploy deployment action. When the pipeline executes the CodeDeploy deployment action, the related CodeDeploy application and deployment will already exist.

Below is the pipeline code that references the (initially non-existing) CodeDeploy application and deployment group.

private IEcsDeploymentGroup referenceCodeDeployDeploymentGroup(
        final Environment env, 
        final String serviceName, 
        final IEcsDeploymentConfig ecsDeploymentConfig, 
        final String stageName) {

    IEcsApplication codeDeployApp = EcsApplication.fromEcsApplicationArn(
            this,
            Constants.APP_NAME + "EcsCodeDeployApp-"+stageName,
            Arn.format(ArnComponents.builder()
                    .arnFormat(ArnFormat.COLON_RESOURCE_NAME)
                    .partition("aws")
                    .region(env.getRegion())
                    .service("codedeploy")
                    .account(env.getAccount())
                    .resource("application")
                    .resourceName(serviceName)
                    .build()));

    IEcsDeploymentGroup deploymentGroup = EcsDeploymentGroup.fromEcsDeploymentGroupAttributes(
            this,
            Constants.APP_NAME + "-EcsCodeDeployDG-"+stageName,
            EcsDeploymentGroupAttributes.builder()
                    .deploymentGroupName(serviceName)
                    .application(codeDeployApp)
                    .deploymentConfig(ecsDeploymentConfig)
                    .build());

    return deploymentGroup;
}

To make this work, you should use the same application name and deployment group name values when creating the CodeDeploy deployment action in the pipeline and when creating the CodeDeploy application and deployment group in the Service stack (where the Amazon ECS infrastructure is deployed). This approach is necessary to avoid a circular dependency error when trying to create the CodeDeploy application and deployment group inside the Service stack and reference these objects to configure the CodeDeploy deployment action inside the pipeline. Below is the code that uses Service stack construct ID to name the CodeDeploy application and deployment group. I set the Service stack construct ID to the same name I used when creating the CodeDeploy deployment action in the pipeline.

   // configure AWS CodeDeploy Application and DeploymentGroup
   EcsApplication app = EcsApplication.Builder.create(this, "BlueGreenApplication")
           .applicationName(id)
           .build();

   EcsDeploymentGroup.Builder.create(this, "BlueGreenDeploymentGroup")
           .deploymentGroupName(id)
           .application(app)
           .service(albService.getService())
           .role(createCodeDeployExecutionRole(id))
           .blueGreenDeploymentConfig(EcsBlueGreenDeploymentConfig.builder()
                   .blueTargetGroup(albService.getTargetGroup())
                   .greenTargetGroup(tgGreen)
                   .listener(albService.getListener())
                   .testListener(listenerGreen)
                   .terminationWaitTime(Duration.minutes(15))
                   .build())
           .deploymentConfig(deploymentConfig)
           .build();

CDK Pipelines roles and permissions

CDK Pipelines creates roles and permissions the pipeline uses to execute deployments in different scenarios of regions and accounts. When using CodeDeploy in cross-account scenarios, CDK Pipelines deploys a cross-account support stack that creates a pipeline action role for the CodeDeploy action. This cross-account support stack is defined in a JSON file that needs to be published to the AWS CDK assets bucket in the target account. If the pipeline has the self-mutation feature on (default), the UpdatePipeline stage will do a cdk deploy to deploy changes to the pipeline. In cross-account scenarios, this deployment also involves deploying/updating the cross-account support stack. For this, the SelfMutate action in UpdatePipeline stage needs to assume CDK file-publishing and a deploy roles in the remote account.

The IAM role associated with the AWS CodeBuild project that runs the UpdatePipeline stage does not have these permissions by default. CDK Pipelines cannot grant these permissions automatically, because the information about the permissions that the cross-account stack needs is only available after the AWS CDK app finishes synthesizing. At that point, the permissions that the pipeline has are already locked-in. Hence, for cross-account scenarios, the toolchain should extend the permissions of the pipeline’s UpdatePipeline stage to include the file-publishing and deploy roles.

In cross-account environments it is possible to manually add these permissions to the UpdatePipeline stage. To accomplish that, the Toolchain stack may be used to hide this sort of implementation detail. In the end, a method like the one below can be used to add these missing permissions. For each different mapping of stage and environment in the pipeline it validates if the target account is different than the account where the pipeline is deployed. When the criteria is met, it should grant permission to the UpdatePipeline stage to assume CDK bootstrap roles (tagged using key aws-cdk:bootstrap-role) in the target account (with the tag value as file-publishing or deploy). The example below shows how to add permissions to the UpdatePipeline stage:

private void grantUpdatePipelineCrossAccoutPermissions(Map<String, Environment> stageNameEnvironment) {

    if (!stageNameEnvironment.isEmpty()) {

        this.pipeline.buildPipeline();
        for (String stage : stageNameEnvironment.keySet()) {

            HashMap<String, String[]> condition = new HashMap<>();
            condition.put(
                    "iam:ResourceTag/aws-cdk:bootstrap-role",
                    new String[] {"file-publishing", "deploy"});
            pipeline.getSelfMutationProject()
                    .getRole()
                    .addToPrincipalPolicy(PolicyStatement.Builder.create()
                            .actions(Arrays.asList("sts:AssumeRole"))
                            .effect(Effect.ALLOW)
                            .resources(Arrays.asList("arn:*:iam::"
                                    + stageNameEnvironment.get(stage).getAccount() + ":role/*"))
                            .conditions(new HashMap<String, Object>() {{
                                    put("ForAnyValue:StringEquals", condition);
                            }})
                            .build());
        }
    }
}

The Deployment Stage

Let’s consider a pipeline that has a single deployment stage, UAT. The UAT stage deploys a DemoService. For that, it requires four actions: DemoService-UAT (Prepare and Deploy), ConfigureBlueGreenDeploy and Deploy.

The
DemoService-UAT.Deploy action will create the ECS resources and the CodeDeploy application and deployment group. The
ConfigureBlueGreenDeploy action will read the AWS CDK
cloud assembly. It uses the configuration files to identify the Amazon Elastic Container Registry (Amazon ECR) repository and the container image tag pushed. The pipeline will send this information to the
Deploy action. The
Deploy action starts the deployment using CodeDeploy.

Solution Overview

As a convenience, I created an application, written in Java, that solves all these challenges and can be used as an example. The application deployment follows the same 5 steps for all deployment scenarios of account and Region, and this includes the scenarios represented in the following design:

A pipeline created by a Toolchain should be able to deploy to any combination of accounts and regions. This includes four scenarios: single-account and single-Region, single-account and cross-Region, cross-account and single-Region and cross-account and cross-Region

Conclusion

In this post, I identified, explained and solved challenges associated with the creation of a pipeline that deploys a service to Amazon ECS using CodeDeploy in different combinations of accounts and regions. I also introduced a demo application that implements these recommendations. The sample code can be extended to implement more elaborate scenarios. These scenarios might include automated testing, automated deployment rollbacks, or disaster recovery. I wish you success in your transformative journey.

Securing GitOps pipelines

2023-03-01 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/securing-gitops-pipeline

Introduction

Grab’s real-time data platform team, Coban, has been managing infrastructure resources via Infrastructure-as-code (IaC). Through the IaC approach, Terraform is used to maintain infrastructure consistency, automation, and ease of deployment of our streaming infrastructure, notably:

Flink pipelines
Kafka topics
Kafka Connect connectors

With Grab’s exponential growth, there needs to be a better way to scale infrastructure automatically. Moving towards GitOps processes benefits us in many ways:

Versioned and immutable: With our source code being stored in Git repositories, the desired state of infrastructure is stored in an environment that enforces immutability, versioning, and retention of version history, which helps with auditing and traceability.
Faster deployment: By automating the process of deploying resources after code is merged, we eliminate manual steps and improve overall engineering productivity while maintaining consistency.
Easier rollbacks: It’s as simple as making a revert for a Git commit as compared to creating a merge request (MR) and commenting Atlantis commands, which add extra steps and contribute to a higher mean-time-to-resolve (MTTR) for incidents.

Background

Originally, Coban implemented automation on Terraform resources using Atlantis, an application that operates based on user comments on MRs.

We have come a long way with Atlantis. It has helped us to automate our workflows and enable self-service capabilities for our engineers. However, there were a few limitations in our setup, which we wanted to improve:

Course grained: There is no way to restrict the kind of Terraform resources users can create, which introduces security issues. For example, if a user is one of the Code owners, they can create another IAM role with Admin privileges with approval from their own team anywhere in the repository.
Limited automation: Users are still required to make comments in their MR such as atlantis apply. This requires the learning of Atlantis commands and is prone to human errors.
Limited capability: Having to rely entirely on Terraform and Hashicorp Configuration Language (HCL) functions to validate user input comes with limitations. For example, the ability to validate an input variable based on the value of another has been a requested feature for a long time.
Not adhering to Don’t Repeat Yourself (DRY) principle: Users need to create an entire Terraform project with boilerplate codes such as Terraform environment, local variables, and Terraform provider configurations to create a simple resource such as a Kafka topic.

Solution

We have developed an in-house GitOps solution named Khone. Its name was inspired by the Khone Phapheng Waterfall. We have evaluated some of the best and most widely used GitOps products available but chose not to go with any as the majority of them aim to support Kubernetes native or custom resources, and we needed infrastructure provisioning that is beyond Kubernetes. With our approach, we have full control of the entire user flow and its implementation, and thus we benefit from:

Security: The ability to secure the pipeline with many customised scripts and workflows.
Simple user experience (UX): Simplified user flow and prevents human errors with automation.
DRY: Minimise boilerplate codes. Users only need to create a single Terraform resource and not an entire Terraform project.

With all types of streaming infrastructure resources that we support, be it Kafka topics or Flink pipelines, we have identified they all have common properties such as namespace, environment, or cluster name such as Kafka cluster and Kubernetes cluster. As such, using those values as file paths help us to easily validate users input and de-couple them from the resource specific configuration properties in their HCL source code. Moreover, it helps to remove redundant information to maintain consistency. If the piece of information is in the file path, it won’t be elsewhere in resource definition.

With this approach, we can utilise our pipeline scripts, which are written in Python and perform validations on the types of resources and resource names using Regular Expressions (Regex) without relying on HCL functions. Furthermore, we helped prevent human errors and improved developers’ efficiency by deriving these properties and reducing boilerplate codes by automatically parsing out other necessary configurations such as Kafka brokers endpoint from the cluster name and environment.

Pipeline stages

Khone’s pipeline implementation is designed with three stages. Each stage has different duties and responsibilities in verifying user input and securely creating the resources.

Initialisation stage

At this stage, we categorise the changes into Deleted, Created or Changed resources and filter out unsupported resource types. We also prevent users from creating unintended resources by validating them based on resource path and inspecting the HCL source code in their Terraform module. This stage also prepares artefacts for subsequent stages.

*Fig. 5 Terraform changes detected by Khone*

Terraform stage

This is a downstream pipeline that runs either the Terraform plan or Terraform apply command depending on the state of the MR, which can either be pending review or merged. Individual jobs run in parallel for each resource change, which helps with performance and reduces the overall pipeline run time.

For each individual job, we implemented multiple security checkpoints such as:

Code inspection: We use the python-hcl2 library to read HCL content of Terraform resources to perform validation, restrict the types of Terraform resources users can create, and ensure that resources have the intended configurations. We also validate whitelisted Terraform module source endpoint based on the declared resource type. This enables us to inherit the flexibility of Python as a programming language and perform validations more dynamically rather than relying on HCL functions.
Resource validation: We validate configurations based on resource path to ensure users are following the correct and intended directory structure.
Linting and formatting: Perform HCL code linting and formatting using Terraform CLI to ensure code consistency.

Furthermore, our Terraform module independently validates parameters by verifying the working directory instead of relying on user input, acting as an additional layer of defence for validation.

path = one(regexall(join("/",
[
    "^*",
    "(?P<repository>khone|khone-dev)",
    "resources",
    "(?P<namespace>[^/]*)",
    "(?P<resource_type>[^/]*)",
    "(?P<env>[^/]*)",
    "(?P<cluster_name>[^/]*)",
    "(?P<resource_name>[^/]*)$"
]), path.cwd))

Metric stage

In this stage, we consolidate previous jobs’ status and publish our pipeline metrics such as success or error rate.

For our metrics, we identified actual users by omitting users from Coban. This helps us measure success metrics more consistently as we could isolate metrics from test continuous integration/continuous deployment (CI/CD) pipelines.

For the second half of 2022, we achieved a 100% uptime for Khone pipelines.

*Fig. 6 Khone’s success metrics for the second half of 2022*

Preventing pipeline config tampering

By default, with each repository on GitLab that has CI/CD pipelines enabled, owners or administrators would need to have a pipeline config file at the root directory of the repository with the name .gitlab-ci.yml. Other scripts may also be stored somewhere within the repository.

With this setup, whenever a user creates an MR, if the pipeline config file is modified as part of the MR, the modified version of the config file will be immediately reflected in the pipeline’s run. Users can exploit this by running arbitrary code on the privileged GitLab runner.

In order to prevent this, we utilise GitLab’s remote pipeline config functionality. We have created another private repository, khone-admin, and stored our pipeline config there.

In Fig. 7, our configuration is set to a file called khone-gitlab-ci.yml residing in the khone-admin repository under snd group.

Preventing pipeline scripts tampering

We had scripts that ran before the MR and they were approved and merged to perform preliminary checks or validations. They were also used to run the Terraform plan command. Users could modify these existing scripts to perform malicious actions. For example, they could bypass all validations and directly run the Terraform apply command to create unintended resources.

This can be prevented by storing all of our scripts in the khone-admin repository and cloning them in each stage of our pipeline using the before_script clause.

default:
  before_script:
    - rm -rf khone_admin
    - git clone --depth 1 --single-branch https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git khone_admin

Even though this adds an overhead to each of our pipeline jobs and increases run time, the amount is insignificant as we have optimised the process by using shallow cloning. The Git clone command included in the above script with depth=1 and single-branch flag has reduced the time it takes to clone the scripts down to only 0.59 seconds.

Testing our pipeline

With all the security measures implemented for Khone, this raises a question of how did we test the pipeline? We have done this by setting up an additional repository called khone-dev.

Pipeline config

Within this khone-dev repository, we have set up a remote pipeline config file following this format:

<File Name>@<Repository Ref>:<Branch Name>

*Fig. 9 Khone-dev’s remote pipeline config*

In Fig. 9, our configuration is set to a file called khone-gitlab-ci.yml residing in the khone-admin repository under the snd group and under a branch named ci-test. With this approach, we can test our pipeline config without having to merge it to master branch that affects the main Khone repository. As a security measure, we only allow users within a certain GitLab group to push changes to this branch.

Pipeline scripts

Following the same method for pipeline scripts, instead of cloning from the master branch in the khone-admin repository, we have implemented a logic to clone them from the branch matching our lightweight directory access protocol (LDAP) user account if it exists. We utilised the GITLAB_USER_LOGIN environment variable that is injected by GitLab to each individual CI job to get the respective LDAP account to perform this logic.

default:
  before_script:
    - rm -rf khone_admin
    - |
      if git ls-remote --exit-code --heads "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" "$GITLAB_USER_LOGIN" > /dev/null; then
        echo "Cloning khone-admin from dev branch ${GITLAB_USER_LOGIN}"
        git clone --depth 1 --branch "$GITLAB_USER_LOGIN" --single-branch "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" khone_admin
      else
        echo "Dev branch ${GITLAB_USER_LOGIN} not found, cloning from master instead"
        git clone --depth 1 --single-branch "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" khone_admin
      fi

What’s next?

With security being our main focus for our Khone GitOps pipeline, we plan to abide by the principle of least privilege and implement separate GitLab runners for different types of resources and assign them with just enough IAM roles and policies, and minimal network security group rules to access our Kafka or Kubernetes clusters.

Furthermore, we also plan to maintain high standards and stability by including unit tests in our CI scripts to ensure that every change is well-tested before being deployed.

References

Special thanks to Fabrice Harbulot for kicking off this project and building a strong foundation for it.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How we reduced our CI YAML files from 1800 lines to 50 lines

2022-04-19 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/how-we-reduced-our-ci-yaml

This article illustrates how the Cauldron Machine Learning (ML) Platform team uses GitLab parent-child pipelines to dynamically generate GitLab CI files to solve several limitations of GitLab for large repositories, namely:

Limitations to the number of includes (100 by default).
Simplifying the GitLab CI file from 1800 lines to 50 lines.
Reducing the need for nested gitlab-ci yml files.

Introduction

Cauldron is the Machine Learning (ML) Platform team at Grab. The Cauldron team provides tools for ML practitioners to manage the end to end lifecycle of ML models, from training to deployment. GitLab and its tooling are an integral part of our stack, for continuous delivery of machine learning.

One of our core products is MerLin Pipelines. Each team has a dedicated repo to maintain the code for their ML pipelines. Each pipeline has its own subfolder. We rely heavily on GitLab rules to detect specific changes to trigger deployments for the different stages of different pipelines (for example, model serving with Catwalk, and so on).

Background

Approach 1: Nested child files

Our initial approach was to rely heavily on static code generation to generate the child gitlab-ci.yml files in individual stages. See Figure 1 for an example directory structure. These nested yml files are pre-generated by our cli and committed to the repository.

Figure 1: Example directory structure with nested gitlab-ci.yml files.

Child gitlab-ci.yml files are added by using the include keyword.

Figure 2: Example root .gitlab-ci.yml file, and include clauses.

Figure 3: Example child `.gitlab-ci.yml` file for a given stage (Deploy Model) in a pipeline (pipeline 1).

As teams add more pipelines and stages, we soon hit a limitation in this approach:

There was a soft limit in the number of includes that could be in the base .gitlab-ci.yml file.

It became evident that this approach would not scale to our use-cases.

Approach 2: Dynamically generating a big CI file

Our next attempt to solve this problem was to try to inject and inline the nested child gitlab-ci.yml contents into the root gitlab-ci.yml file, so that we no longer needed to rely on the in-built GitLab “include” clause.

To achieve it, we wrote a utility that parsed a raw gitlab-ci file, walked the tree to retrieve all “included” child gitlab-ci files, and to replace the includes to generate a final big gitlab-ci.yml file.

Figure 4 illustrates the resulting file is generated from Figure 3.

Figure 4: “Fat” YAML file generated through this approach, assumes the original raw file of Figure 3.

This approach solved our issues temporarily. Unfortunately, we ended up with GitLab files that were up to 1800 lines long. There is also a soft limit to the size of gitlab-ci.yml files. It became evident that we would eventually hit the limits of this approach.

Solution

Our initial attempt at using static code generation put us partially there. We were able to pre-generate and infer the stage and pipeline names from the information available to us. Code generation was definitely needed, but upfront generation of code had some key limitations, as shown above. We needed a way to improve on this, to somehow generate GitLab stages on the fly. After some research, we stumbled upon Dynamic Child Pipelines.

Quoting the official website:

Instead of running a child pipeline from a static YAML file, you can define a job that runs your own script to generate a YAML file, which is then used to trigger a child pipeline.

This technique can be very powerful in generating pipelines targeting content that changed or to build a matrix of targets and architectures.

We were already on the right track. We just needed to combine code generation with child pipelines, to dynamically generate the necessary stages on the fly.

Architecture details

Figure 5: Flow diagram of how we use dynamic yaml generation. The user raises a merge request in a branch, and subsequently merges the branch to master.

Implementation

The user Git flow can be seen in Figure 5, where the user modifies or adds some files in their respective Git team repo. As a refresher, a typical repo structure consists of pipelines and stages (see Figure 1). We would need to extract the information necessary from the branch environment in Figure 5, and have a stage to programmatically generate the proper stages (for example, Figure 3).

In short, our requirements can be summarized as:

Detecting the files being changed in the Git branch.
Extracting the information needed from the files that have changed.
Passing this to be templated into the necessary stages.

Let’s take a very simple example, where a user is modifying a file in stage_1 in pipeline_1 in Figure 1. Our desired output would be:

Figure 6: Desired output that should be dynamically generated.

Our template would be in the form of:

$Figure 7: Example template, and information needed. Let’s call it template\_file.yml.$
Figure 7: Example template, and information needed. Let’s call it template_file.yml.

First, we need to detect the files being modified in the branch. We achieve this with native git diff commands, checking against the base of the branch to track what files are being modified in the merge request. The output (let’s call it diff.txt) would be in the form of:

M        pipelines/pipeline_1/stage_1/modelserving.yaml

Figure 8: Example diff.txt generated from git diff.

We must extract the yellow and green information from the line, corresponding to pipeline_name and stage_name.

Figure 9: Information that needs to be extracted from the file.

We take a very simple approach here, by introducing a concept called stop patterns.

Stop patterns are defined as a comma separated list of variable names, and the words to stop at. The colon (:) denotes how many levels before the stop word to stop.

For example, the stop pattern:

pipeline_name:pipelines

tells the parser to look for the folder pipelines and stop before that, extracting pipeline_1 from the example above tagged to the variable name pipeline_name.

The stop pattern with two colons (::):

stage_name::pipelines

tells the parser to stop two levels before the folder pipelines, and extract stage_1 as stage_name.

Our cli tool allows the stop patterns to be comma separated, so the final command would be:

cauldron_repo_util diff.txt template_file.yml
pipeline_name:pipelines,stage_name::pipelines > generated.yml

We elected to write the util in Rust due to its high performance, and its rich templating libraries (for example, Tera) and decent cli libraries (clap).

Combining all these together, we are able to extract the information needed from git diff, and use stop patterns to extract the necessary information to be passed into the template. Stop patterns are flexible enough to support different types of folder structures.

Figure 10: Example Rust code snippet for parsing the Git diff file.

When triggering pipelines in the master branch (see right side of Figure 5), the flow is the same, with a small caveat that we must retrieve the same diff.txt file from the source branch. We achieve this by using the rich GitLab API, retrieving the pipeline artifacts and using the same util above to generate the necessary GitLab steps dynamically.

Impact

After implementing this change, our biggest success was reducing one of the biggest ML pipeline Git repositories from 1800 lines to 50 lines. This approach keeps the size of the .gitlab-ci.yaml file constant at 50 lines, and ensures that it scales with however many pipelines are added.

Our users, the machine learning practitioners, also find it more productive as they no longer need to worry about GitLab yaml files.

Learnings and conclusion

With some creativity, and the flexibility of GitLab Child Pipelines, we were able to invest some engineering effort into making the configuration re-usable, adhering to DRY principles.

Special thanks to the Cauldron ML Platform team.

What’s next

We might open source our solution.

References

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Safe Updates of Client Applications at Netflix

2021-10-07 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/safe-updates-of-client-applications-at-netflix-1d01c71a930c

By Minal Mishra

Quality of a client application is of paramount importance to global digital products, as it is the primary way customers interact with a brand. At Netflix, we have significant investments in ensuring new versions of our applications are well tested. However, Netflix is available for streaming on thousands of types of devices and it is powered by hundreds of micro-services which are deployed independently, making it extremely challenging to comprehensively test internally. Hence, it became important to supplement our release decisions with strong evidence received from the field during the update process.

Our team was formed to mine health signals from the field to quickly evaluate new versions of the client applications. As we invested in systems to enable this vision, it led to increased development velocity, which arguably led to better development practices and quality of the applications. The goal of this blog post is to highlight the investment areas for this vision and the challenges we are facing today.

Client Applications

We deal with two classes of client application updates. The first is where an application package is downloaded from the service or a CDN. An example of this is Netflix’s video player or the TV UI javascript package. The second is one where an application is hosted behind an app store, for example mobile phones or even game consoles. We have more flexibility to control the distribution of the application in the former than in the latter case.

Deployment Strategies

We are all familiar with the advantages of releasing frequently and in smaller chunks. It helps bring a healthy balance to the velocity and quality equation. The challenge for clients is that each instance of the application runs on a Netflix member’s device and signals are derived from a firehose of events being sent by devices across the globe. Depending on the type of client, we need to determine the right strategy to sample consumer devices, and provide a system that can enable various client engineering teams to look for their signals. Hence, the sampling strategy is different if it is a mobile application versus a smart TV. In contrast, a server application runs on servers which are typically identical and a routing abstraction can serve sampled traffic to new versions. And the signals to evaluate a new version are derived from comparatively few thousands of homogenous servers instead of millions of heterogeneous devices.

**Staged rollouts** of apps mimic the different phases of moon

A widely adopted technique for client applications is gradually rolling out a new version of software rather than making the release available to all users instantly, also known as staged or phased rollout. There are two main benefits to this approach.

First, if something were to fail catastrophically, the release can be paused for triage, limiting the number of customers impacted.
Second, backend services or infrastructure can be scaled intelligently as adoption ramps up.

Application version adoption over time for a **staged rollout**

This chart represents a counter metric, which exhibits version adoption over the duration of a staged rollout. There is a gradual increase in the percentage of devices switching to N+1 version. In the past, during this period client engineering teams would visually monitor their metric dashboards to evaluate signals as more consumers migrated to a new version of their application.

Client-side error rate during the **staged rollout**

The chart of client-side error rate during the same time period as the version migration is shown here. We observe that the metric for the new version N+1 stabilizes as the rollout ramps up and reaches closer to 100%, whereas the metric for the current version N becomes noisy over the same time duration. Trying to compare any metric during this time period can be a futile effort, as obvious in this case where there was no customer impacting shift in the error rate but we cannot interpret that from the chart. Typically, teams time-shift one metric over the other to visually detect metric deviations, but time can still be a confounder. Staged rollouts have a lot of benefits, but there is a significant opportunity cost to wait before the new version reaches a critical level of adoption.

AB Tests/Client Canaries

So we brought the science of controlled testing into this decision framework by using what has been utilized for feature evaluations. The main goal of A/B testing is to design a robust experiment that is going to yield repeatable results and enable us to make sound decisions about whether or not to launch a product feature (read more about A/B tests at Netflix here). In the application update use case, we recommend an extreme version of A/B testing: we test the entire application. The new version may include a user facing feature which is designed to be A/B tested and resides behind a feature flag. However, most times it is adding new obvious improvements, simple bug fixes, performance enhancements, productizing outcomes from previous A/B tests, logging etc that are being shipped in the application. If we apply A/B tests methodology (or client canaries as we like to call them to differentiate from traditional feature based A/B tests) the allocation would look identical for both the versions at any time.

Client Canary and Control allocation along with the client-side error rate metric

This chart is showing the new and the baseline version allocations growing over time. Although, majority of users are already on the baseline version we are randomly “allocating” a fraction of those users to be the control group of our experiment. This ensures there is no sampling mismatch between the treatment and the control group. It is easier to visually compare the client side error rate for both versions and even apply statistical inference to change the conversation from “we think” there is a shift in metrics to “we know”.

But there are differences between feature related A/B tests at Netflix and the incremental product changes used for Client Canaries. The main distinctions are: a shorter runtime, multiple executions of analysis sometimes concurrent with allocation, and use of data to support the null hypothesis. The runtime is predetermined, which in a way, is the stopping rule for client canaries. Unlike feature A/B tests at Netflix, we limit our evidence collection to a few hours, so we can release updates within a working day. We continuously analyze metrics to find egregious regressions sooner rather than once all the evidence has been collected.

The three key phases of any A/B tests can be split into Allocation, Metric Collection and Analysis. We use orchestration to connect and manage client applications through the A/B test lifecycle, thereby reducing the cognitive load of deploying them frequently.

Allocation

Sampling is the first stage once your new application has been packaged, tested and published. As time is of the essence here, we rely on dynamic allocation and allocate devices which come to the service during the canary time period based on pre-configured rules. We leverage the allocation service used for all experimentation at Netflix for this purpose.

However, for applications gated behind an external app store (example mobile apps), we only have access to staged rollout solutions provided by the app stores. We can control the percentage of users receiving updated apps, which can increase over time. In order to mimic the client canary solution, we built a synthetic allocation service to perform sampling post-installation of the app updates. This service tries to allocate a device to the control group that typically matches the profile of a device seen in the treatment group, which was allocated by the app store’s staged rollout solution. This ensures we are controlling for key variables which have the potential to impact the analysis.

Metrics

Metrics are a foundational component for client canaries and A/B tests as they give us the necessary insight required to make decisions. And for our use case, metrics need to be computed in real time from millions of user events being sent to our service. Operating at Netflix’s scale, we have to process the event streams on a scalable platform like Mantis and store the time-series data in Apache Druid. To be further cost-efficient with the time-series data we store the metrics for a sliding time window of a few weeks and compress it to a 1 minute time granularity.

The other challenge is to enable client application engineers to contribute to metrics and dimensions as they are aware of what can be a valuable insight. To do this, our real-time metric data pipeline provides the right abstractions to remove the complexity of a distributed stream processing system and also enables these contributions to be used in offline computations for feature A/B test evaluations.The former reduces the barrier to entry and the latter provides additional motivation for client engineers to contribute. Additionally, this gets us closer to consistent metric definitions in both realtime and offline systems.

As we accept contributions, we have to have the right checks in place to ensure the data pipeline is reliable and robust. Changes in user events, stream processing jobs or even in the platform can impact metrics, and so it is imperative that we actively monitor the data pipeline and ingestion.

Analysis

Historically, we have relied on conventional statistical tests built into Kayenta to detect metric shifts for the release of new versions of applications. It has served us well over the last few years, however at Netflix we are always looking to improve. Some reasons to explore alternate solutions:

Under the hood ACA uses a fixed horizon statistical hypothesis test which is subject to peeking due to frequent analysis execution during the canary time period. And without a correction, this can erode our false positive guarantees, and the correction itself is a function of the number of peeks — which is not known in advance. This often leads to more false errors in the outcomes.
Due to limited time for the canary, rare event metrics such as errors can often be missing from control or treatment and hence might not get evaluated.
Our intuition suggests any form of metric compression, like aggregating to 1 minute granularity, leads to a loss in power for the analysis, and the tradeoff is that we need more time to confidently detect the metric shifts.

We are actively working on a promising solution to tackle some of these limitations and hope to share more in future.

Orchestration

Orchestration reduces the cognitive load of frequently setting up, executing, analyzing and making decisions for client application canaries. To manage A/B test lifecycle, we decided to build a node.js powered extensible backend service to serve the javascript competency of client engineers while complementing the continuous deployment platform Spinnaker. The drawbacks of most orchestration solutions is the lack of version control and testing. So the main design tenets for this service along with reusability and extensibility are testability and traceability.

Conclusion

Today, most client applications at Netflix use the client canary model to continuously update their applications. We have seen a significant increase in adoption of this methodology over the past 4 years as shown in this cumulative graph of client canary counts.

Year-over-year increase in Client Canaries at Netflix

Time constraints, the need for speed and quality have created several challenges in the client application’s frequent update domain that our team at Netflix aims to solve. We covered some metric related ones in a previous post describing “How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience”. We intend to share more in the future diving into the challenges and solutions in the Allocation, Analysis and Orchestration space.

Safe Updates of Client Applications at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Keeping up with your dependencies: building a feedback loop for shared libraries

2021-06-26 Joerg Woehrle

Post Syndicated from Joerg Woehrle original https://aws.amazon.com/blogs/devops/keeping-up-with-your-dependencies-building-a-feedback-loop-for-shared-libraries/

In a microservices world, it’s common to share as little as possible between services. This enables teams to work independently of each other, helps to reduce wait times and decreases coupling between services.

However, it’s also a common scenario that libraries for cross-cutting-concerns (such as security or logging) are developed one time and offered to other teams for consumption. Although it’s vital to offer an opt-out of those libraries (namely, use your own code to address the cross-cutting-concern, such as when there is no version for a given language), shared libraries also provide the benefit of better governance and time savings.

To avoid these pitfalls when sharing artifacts, two points are important:

For consumers of shared libraries, it’s important to stay up to date with new releases in order to benefit from security, performance, and feature improvements.
For producers of shared libraries, it’s important to get quick feedback in case of an involuntarily added breaking change.

Based on those two factors, we’re looking for the following solution:

A frictionless and automated way to update consumer’s code to the latest release version of a given library
Immediate feedback to the library producer in case of a breaking change (the new version of the library breaks the build of a downstream system)

In this blog post I develop a solution that takes care of both those problems. I use Amazon EventBridge to be notified on new releases of a library in AWS CodeArtifact. I use an AWS Lambda function along with an AWS Fargate task to automatically create a pull request (PR) with the new release version on AWS CodeCommit. Finally, I use AWS CodeBuild to kick off a build of the PR and notify the library producer via EventBridge and Amazon Simple Notification Service (Amazon SNS) in case of a failure.

Overview of solution

Let’s start with a short introduction on the services I use for this solution:

CodeArtifact – A fully managed artifact repository service that makes it easy for organizations of any size to securely store, publish, and share software packages used in their software development process. CodeArtifact works with commonly used package managers and build tools like Maven, Gradle, npm, yarn, twine, and pip.
CodeBuild – A fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy.
CodeCommit – A fully-managed source control service that hosts secure Git-based repositories.
EventBridge – A serverless event bus that makes it easy to connect applications together using data from your own applications, integrated software as a service (SaaS) applications, and AWS services. EventBridge makes it easy to build event-driven applications because it takes care of event ingestion and delivery, security, authorization, and error handling.
Fargate – A serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design.
Lambda – Lets you run code without provisioning or managing servers. You pay only for the compute time you consume.
Amazon SNS – A fully managed messaging service for both application-to-application (A2A) and application-to-person (A2P) communication.

The resulting flow through the system looks like the following diagram.

Architecture Diagram

In my example, I look at two independent teams working in two different AWS accounts. Team A is the provider of the shared library, and Team B is the consumer.

Let’s do a high-level walkthrough of the involved steps and components:

A new library version is released by Team A and pushed to CodeArtifact.
CodeArtifact creates an event when the new version is published.
I send this event to the default event bus in Team B’s AWS account.
An EventBridge rule in Team B’s account triggers a Lambda function for further processing.
The function filters SNAPSHOT releases (in Maven a SNAPSHOT represents an artifact still under development that doesn’t have a final release yet) and runs an Amazon ECS Fargate task for non-SNAPSHOT versions.
The Fargate task checks out the source that uses the shared library, updates the library’s version in the pom.xml, and creates a pull request to integrate the change into the mainline of the code repository.
The pull request creation results in an event being published.
An EventBridge rule triggers the CodeBuild project of the downstream artifact.
The result of the build is published as an event.
If the build fails, this failure is propagated back to the event bus of Team A.
The failure is forwarded to an SNS topic that notifies the subscribers of the failure.

Amazon EventBridge

A central component of the solution is Amazon EventBridge. I use EventBridge to receive and react on events emitted by the various AWS services in the solution (e.g., whenever a new version of an artifact gets uploaded to CodeArtifact, when a PR is created within CodeCommit or when a build fails in CodeBuild). Let’s have a high-level look on some of the central concepts of EventBridge:

Event Bus – An event bus is a pipeline that receives events. There is a default event bus in each account which receives events from AWS services. One can send events to an event bus via the PutEvents API.
Event – An event indicates a change in e.g., an AWS environment, a SaaS partner service or application or one of your applications.
Rule – A rule matches incoming events on an event bus and sends them to targets for processing. To react on a particular event, one creates a rule which matches this event. To learn more about the rule concept check out Rules on the EventBridge documentation.
Target – When an event matches the event pattern defined in a rule it is send to a target. There are currently more than 20 target types available in EventBridge. In this blog post I use the targets provided for: an event bus in a different account, a Lambda function, a CodeBuild project and an SNS topic. For a detailed list on available targets see Amazon EventBridge targets.

Solution Details:

In this section I walk through the most important parts of the solution. The complete code can be found on GitHub. For a detailed view on the resources created in each account please refer to the GitHub repository.

I use the AWS Cloud Development Kit (CDK) to create my infrastructure. For some of the resource types I create, no higher-level constructs are available yet (at the time of writing, I used AWS CDK version 1.108.1). This is why I sometimes use low-level AWS CloudFormation constructs or even use the provided escape hatches to use AWS CloudFormation constructs directly.

The code for the shared library producer and consumer is written in Java and uses Apache Maven for dependency management. However, the same concepts apply to e.g., Node.js and npm.

Notify another account of new releases

To send events from EventBridge to another account, the receiving account needs to specify an EventBusPolicy. The AWS CDK code on the consumer account looks like the following code:

new events.CfnEventBusPolicy(this, 'EventBusPolicy', {
    statementId: 'AllowCrossAccount',
    action: 'events:PutEvents',
    principal: consumerAccount
});

With that the producer account has the permission to publish events into the event bus of the consumer account.

I’m interested in CodeArtifact events that are published on the release of a new artifact. I first create a Rule which matches those events. Next, I add a target to the rule which targets the event bus of account B. As of this writing there is no CDK construct available to directly add another account as a target. That is why I use the underlying CloudFormation CfnRule to do that. This is called an escape hatch in CDK. For more information about escape hatches, see Escape hatches.

const onLibraryReleaseRule = new events.Rule(this, 'LibraryReleaseRule', {
  eventPattern: {
    source: [ 'aws.codeartifact' ],
    detailType: [ 'CodeArtifact Package Version State Change' ],
    detail: {
      domainOwner: [ this.account ],
      domainName: [ codeArtifactDomain.domainName ],
      repositoryName: [ codeArtifactRepo.repositoryName ],
      packageVersionState: [ 'Published' ],
      packageFormat: [ 'maven' ]
    }
  }
});
/* there is currently no CDK construct provided to add an event bus in another account as a target. 
That's why we use the underlying CfnRule directly */
const cfnRule = onLibraryReleaseRule.node.defaultChild as events.CfnRule;
cfnRule.targets = [ {arn: `arn:aws:events:${this.region}:${consumerAccount}:event-bus/default`, id: 'ConsumerAccount'} ];

For more information about event formats, see CodeArtifact event format and example.

Act on new releases in the consumer account

I established the connection between the events produced by Account A and Account B: The events now are available in Account B’s event bus. To use them, I add a rule which matches this event in Account B:

const onLibraryReleaseRule = new events.Rule(this, 'LibraryReleaseRule', {
  eventPattern: {
    source: [ 'aws.codeartifact' ],
    detailType: [ 'CodeArtifact Package Version State Change' ],
    detail: {
      domainOwner: [ producerAccount ],
      packageVersionState: [ 'Published' ],
      packageFormat: [ 'maven' ]
    }
  }
});

Add a Lambda function target

Now that I created a rule to trigger anytime a new package version is published, I will now add an EventBridge target which triggers my runTaskLambda Lambda Function. The below CDK code shows how I add our Lambda function as a target to the onLibraryRelease rule. Notice how I extract information from the event’s payload and pass it into the Lambda function’s invocation event.

onLibraryReleaseRule.addTarget(
    new targets.LambdaFunction( runTaskLambda,{
      event: events.RuleTargetInput.fromObject({
        groupId: events.EventField.fromPath('$.detail.packageNamespace'),
        artifactId: events.EventField.fromPath('$.detail.packageName'),
        version: events.EventField.fromPath('$.detail.packageVersion'),
        repoUrl: codeCommitRepo.repositoryCloneUrlHttp,
        region: this.region
      })
    }));

Filter SNAPSHOT versions

Because I’m not interested in Maven SNAPSHOT versions (such as 1.0.1-SNAPSHOT), I have to find a way to filter those and only act upon non-SNAPSHOT versions. Even though content-based filtering on event patterns is supported by Amazon EventBridge, filtering on suffixes is not supported as of this writing. This is why the Lambda function filters SNAPSHOT versions and only acts upon real, non-SNAPSHOT, releases. For those, I start a custom Amazon ECS Fargate task by using the AWS JavaScript SDK. My function passes some environment overrides to the Fargate task in order to have the required information about the artifact available at runtime.

In the following function code, I pass all required information to create a pull request into the environment of the Fargate task:

const AWS = require('aws-sdk');

const ECS = new AWS.ECS();
exports.handler = async (event) => {
    console.log(`Received event: ${JSON.stringify(event)}`)
    const artifactVersion = event.version;
    const artifactId = event.artifactId;
    if ( artifactVersion.indexOf('SNAPSHOT') > -1 ) {
        console.log(`Skipping SNAPSHOT version ${artifactVersion}`)
    } else {
        console.log(`Triggering task to create pull request for version ${artifactVersion} of artifact ${artifactId}`);
        const params = {
            launchType: 'FARGATE',
            taskDefinition: process.env.TASK_DEFINITION_ARN,
            cluster: process.env.CLUSTER_ARN,
            networkConfiguration: {
                awsvpcConfiguration: {
                    subnets: process.env.TASK_SUBNETS.split(',')
                }
            },
            overrides: {
                containerOverrides: [ {
                    name: process.env.CONTAINER_NAME,
                    environment: [
                        {name: 'REPO_URL', value: process.env.REPO_URL},
                        {name: 'REPO_NAME', value: process.env.REPO_NAME},
                        {name: 'REPO_REGION', value: process.env.REPO_REGION},
                        {name: 'ARTIFACT_VERSION', value: artifactVersion},
                        {name: 'ARTIFACT_ID', value: artifactId}
                    ]
                } ]
            }
        };
        await ECS.runTask(params).promise();
    }
};

Create the pull request

With the environment set, I can use a simple bash script inside the container to create a new Git branch, update the pom.xml with the new dependency version, push the branch to CodeCommit, and use the AWS Command Line Interface (AWS CLI) to create the pull request. The Docker entrypoint looks like the following code:

#!/usr/bin/env bash
set -e

# clone the repository and create a new branch for the change
git clone --depth 1 $REPO_URL repo && cd repo
branch="library_update_$(date +"%Y-%m-%d_%H-%M-%S")"
git checkout -b "$branch"

# replace whatever version is currently used by the new version of the library
sed -i "s/<shared\.library\.version>.*<\/shared\.library\.version>/<shared\.library\.version>${ARTIFACT_VERSION}<\/shared\.library\.version>/g" pom.xml

# stage, commit and push the change
git add pom.xml
git -c "user.name=ECS Pull Request Creator" -c "[email protected]" commit -m "Update version of ${ARTIFACT_ID} to ${ARTIFACT_VERSION}"
git push --set-upstream origin "$branch"

# create pull request
aws codecommit create-pull-request --title "Update version of ${ARTIFACT_ID} to ${ARTIFACT_VERSION}" --targets repositoryName="$REPO_NAME",sourceReference="$branch",destinationReference=main --region "$REPO_REGION"

After a successful run, I can check the CodeCommit UI for the created pull request. The following screenshot shows the changes introduced by one of my pull requests during testing:

Screenshot of the Pull Request in AWS CodeCommit

Now that I have the pull request in place, I want to verify that the dependency update does not break my consumer code. I do this by triggering a CodeBuild project with the help of EventBridge.

Build the pull request

The ingredients I use are the same as with the CodeArtifact event. I create a rule that matches the event emitted by CodeCommit (limiting it to branches that match the prefix used by our Fargate task). Afterwards I add a target to the rule to start the CodeBuild project:

const onPullRequestCreatedRule = new events.Rule(this, 'PullRequestCreatedRule', {
  eventPattern: {
    source: [ 'aws.codecommit' ],
    detailType: [ 'CodeCommit Pull Request State Change' ],
    resources: [ codeCommitRepo.repositoryArn ],
    detail: {
      event: [ 'pullRequestCreated' ],
      sourceReference: [ {
        prefix: 'refs/heads/library_update_'
      } ],
      destinationReference: [ 'refs/heads/main' ]
    }
  }
});
onPullRequestCreatedRule.addTarget( new targets.CodeBuildProject(codeBuild, {
  event: events.RuleTargetInput.fromObject( {
    projectName: codeBuild.projectName,
    sourceVersion: events.EventField.fromPath('$.detail.sourceReference')
  })
}));

This triggers the build whenever a new pull request is created with a branch prefix of refs/head/library_update_.
You can easily add the build results as a comment back to CodeCommit. For more information, see Validating AWS CodeCommit Pull Requests with AWS CodeBuild and AWS Lambda.

My last step is to notify an SNS topic in in case of a failing build. The SNS topic is a resource in Account A. To target a resource in a different account I need to forward the event to this account’s event bus. From there I then target the SNS topic.

First, I forward the failed build event from Account B into the default event bus of Account A:

const onFailedBuildRule = new events.Rule(this, 'BrokenBuildRule', {
  eventPattern: {
    detailType: [ 'CodeBuild Build State Change' ],
    source: [ 'aws.codebuild' ],
    detail: {
      'build-status': [ 'FAILED' ]
    }
  }
});
const producerAccountTarget = new targets.EventBus(events.EventBus.fromEventBusArn(this, 'cross-account-event-bus', `arn:aws:events:${this.region}:${producerAccount}:event-bus/default`))
onFailedBuildRule.addTarget(producerAccountTarget);

Then I target the SNS topic in Account A to be notified of failures:

const onFailedBuildRule = new events.Rule(this, 'BrokenBuildRule', {
  eventPattern: {
    detailType: [ 'CodeBuild Build State Change' ],
    source: [ 'aws.codebuild' ],
    account: [ consumerAccount ],
    detail: {
      'build-status': [ 'FAILED' ]
    }
  }
});
onFailedBuildRule.addTarget(new targets.SnsTopic(notificationTopic));

See it in action

I use the cdk-assume-role-credential-plugin to deploy to both accounts, producer and consumer, with a single CDK command issued to the producer account. To do this I create roles for cross account access from the producer account in the consumer account as described here. I also make sure that the accounts are bootstrapped for CDK as described here. After that I run the following steps:

Deploy the Stacks:
cd cdk && cdk deploy --context region=<YOUR_REGION> --context producerAccount=<PRODUCER_ACCOUNT_NO> --context consumerAccount==<CONSUMER_ACCOUNT_NO> --all && cd -
After a successful deployment CDK prints a set of export commands. I set my environment from those Outputs:
❯ export CODEARTIFACT_ACCOUNT=<MY_PRODUCER_ACCOUNT>
❯ export CODEARTIFACT_DOMAIN=<MY_CODEARTIFACT_DOMAIN>
❯ export CODEARTIFACT_REGION=<MY_REGION>
❯ export CODECOMMIT_URL=<MY_CODECOMMIT_URL>
Setup Maven to authenticate to CodeArtifact
export CODEARTIFACT_TOKEN=$(aws codeartifact get-authorization-token --domain $CODEARTIFACT_DOMAIN --domain-owner $CODEARTIFACT_ACCOUNT --query authorizationToken --output text)
Release the first version of the shared library to CodeArtifact:
cd library_producer/library && mvn --settings ./settings.xml deploy && cd -
From a console which is authenticated/authorized for CodeCommit in the Consumer Account
1. Setup git to work with CodeCommit
2. Push the code of the library consumer to CodeCommit:
  cd library_consumer/library && git init && git add . && git commit -m "Add consumer to codecommit" && git remote add codecommit $CODECOMMIT_URL && git push --set-upstream codecommit main && cd -
Release a new version of the shared library:
cd library_producer/library && sed -i '' 's/<version>1.0.0/<version>1.0.1/' pom.xml && mvn --settings settings.xml deploy && cd -
After 1-3 minutes a Pull Request is created in the CodeCommit repo in the Consumer Account and a build is run to verify this PR:
In case of a build failure, you can create a subscription to the SNS topic in Account A to act upon the broken build.

Clean up

In case you followed along with this blog post and want to prevent incurring costs you have to delete the created resources. Run cdk destroy --context region=<YOUR_REGION> --context producerAccount=<PRODUCER_ACCOUNT_NO> --context consumerAccount==<CONSUMER_ACCOUNT_NO> --all to delete the CloudFormation stacks.

Conclusion

In this post, I automated the manual task of updating a shared library dependency version. I used a workflow that not only updates the dependency version, but also notifies the library producer in case the new artifact introduces a regression (for example, an API incompatibility with an older version). By using Amazon EventBridge I’ve created a loosely coupled solution which can be used as a basis for a feedback loop between library creators and consumers.

What next?

To improve the solution, I suggest to look into possibilities of error handling for the Fargate task. What happens if the git operation fails? How do we signal such a failure? You might want to replace the AWS Fargate portion with a Lambda-only solution and use AWS Step Functions for better error handling.

As a next step, I could think of a solution that automates updates for libraries stored in Maven Central. Wouldn’t it be nice to never miss the release of a new Spring Boot version? A Fargate task run on a schedule and the following code should get you going:

curl -sS 'https://search.maven.org/solrsearch/select?q=g:org.springframework.boot%20a:spring-boot-starter&start=0&rows=1&wt=json' | jq -r '.response.docs[ 0 ].latestVersion'

Happy Building!

Author bio

Joerg is a Solutions Architect at AWS and works with manufacturing customers in Germany. As a former Developer, DevOps Engineer and SRE he enjoys building and automating things.

Our Journey to Continuous Delivery at Grab (Part 2)

2021-05-10 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab-part2

In the first part of this blog post, you’ve read about the improvements made to our build and staging deployment process, and how plenty of manual tasks routinely taken by engineers have been automated with Conveyor: an in-house continuous delivery solution.

This new post begins with the introduction of the hermeticity principle for our deployments, and how it improves the confidence with promoting changes to production. Changes sent to production via Conveyor’s deployment pipelines are then described in detail.

Finally, looking back at the engineering efficiency improvements around velocity and reliability over the last 2 years, we answer the big question – was the investment on a custom continuous delivery solution like Conveyor the right decision for Grab?

Improving Confidence in our Production Deployments with Hermeticity

The term deployment hermeticity is borrowed from build systems. A build system is called hermetic if builds always produce the same artefacts regardless of changes in the environment they run on. Similarly, we call our deployments hermetic if they always result in the same deployed artefacts regardless of the environment’s change or the number of times they are executed.

The behaviour of a service is rarely controlled by a single variable. The application that makes up your service is an important driver of its behaviour, but its configuration is an important contributor, for example. The behaviour for traditional microservices at Grab is dictated mainly by 3 versioned artefacts: application code, static and dynamic configuration.

Conveyor has been integrated with the systems that operate changes in each of these parameters. By tracking all 3 parameters at every deployment, Conveyor can reproducibly deploy microservices with similar behaviour: its deployments are hermetic.

Building upon this property, Conveyor can ensure that all deployments made to production have been tested before with the same combination of parameters. This is valuable to us:

An outcome of staging deployments for a specific set of parameters is a good predictor of outcomes in production deployments for the same set of parameters and thus it makes testing in staging more relevant.
Rollbacks are hermetic; we never rollback to a combination of parameters that has not been used previously.

In the past, incidents had resulted from an application rollback not compatible with the current dynamic configuration version; this was aggravating since rollbacks are expected to be a safe recovery mechanism. The introduction of hermetic deployments has largely eliminated this category of problems.

Hermeticity is maintained by registering the deployment parameters as artefacts after each successfully completed pipeline. Users must then select one of the registered deployment metadata to promote to production.

At this point, you might be wondering: why not use a single pipeline that includes both staging and production deployments? This was indeed how it started, with a single pipeline spanning multiple environments. However, engineers soon complained about it.

The most obvious reason for the complaint was that less than 20% of changes deployed in staging will make their way to production. This meant that engineers would have toil associated with each completed staging deployment since the pipeline must be manually cancelled rather than continued to production.

The other reason is that this multi-environment pipeline approach reduced flexibility when promoting changes to production. There are different ways to apply changes to a cluster. For example, lengthy pipelines that refresh instances can be used to deploy any combination of changes, while there are quicker pipelines restricted to dynamic configuration changes (such as feature flags rollouts). Regardless of the order in which the changes are made and how they are applied, Conveyor tracks the change.

Eventually, engineers promote a deployment artefact to production. However they do not need to apply changes in the same sequence with which were applied to staging. Furthermore, to prevent erroneous actions, Conveyor presents only changes that can be applied with the requested pipeline (and sometimes, no changes are available). Not being forced into a specific method of deploying changes is one of added benefits of hermetic deployments.

Returning to Our Journey Towards Engineering Efficiency

If you can recall, the first part of this blog post series ended with a description of staging deployment. Our deployment to production starts with a verification that we uphold our hermeticity principle, as explained above.

Our production deployment pipelines can run for several hours for large clusters with rolling releases (few run for days), so we start by acquiring locks to ensure there are no concurrent deployments for any given cluster.

Before making any changes to the environment, we automatically generate release notes, giving engineers a chance to abort if the wrong set of changes are sent to production.

The pipeline next waits for a deployment slot. Early on, engineers adopted deployment windows that coincide with office hours, such that if anything goes wrong, there is always someone on hand to help. Prior to the introduction of Conveyor, however, engineers would manually ask a Slack bot for approval. This interaction is now automated, and the only remaining action left is for the engineer to approve that the deployment can proceed via a single click, in line with our hands-off deployment principle.

When the canary is in production, Conveyor automates monitoring it. This process is similar to the one already discussed in the first part of this blog post: Engineers can configure a set of alerts that Conveyor will keep track of. As soon as any one of the alerts is triggered, Conveyor automatically rolls back the service.

If no alert is raised for the duration of the monitoring period, Conveyor waits again for a deployment slot. It then publishes the release notes for that deployment and completes the deployments for the cluster. After the lock is released and the deployment registered, the pipeline finally comes to its successful completion.

Benefits of Our Journey Towards Engineering Efficiency

All these improvements made over the last 2 years have reduced the effort spent by engineers on deployment while also reducing the failure rate of our deployments.

If you are an engineer working on DevOps in your organisation, you know how hard it can be to measure the impact you made on your organisation. To estimate the time saved by our pipelines, we can model the activities that were previously done manually with a rudimentary weighted graph. In this graph, each edge carries a probability of the activity being performed (100% when unspecified), while each vertex carries the time taken for that activity.

Focusing on our regular staging deployments only, such a graph would look like this:

The overall amount of effort automated by the staging pipelines () is represented in the graph above. It can be converted into the equation below:

This equation shows that for each staging deployment, around 16 minutes of work have been saved. Similarly, for regular production deployments, we find that 67 minutes of work were saved for each deployment:

Moreover, efficiency was not the only benefit brought by the use of deployment pipelines for our traditional microservices. Surprisingly perhaps, the rate of failures related to production changes is progressively reducing while the amount of production changes that were made with Conveyor increased across the organisation (starting at 1.5% of failures per deployments, and finishing at 0.3% on average over the last 3 months for the period of data collected):

Keep Calm and Automate

Since the first draft for this post was written, we’ve made many more improvements to our pipelines. We’ve begun automating Database Migrations; we’ve extended our set of hermetic variables to Amazon Machine Image (AMI) updates; and we’re working towards supporting container deployments.

Through automation, all of Conveyor’s deployment pipelines have contributed to save more than 5,000 man-days of efforts in 2020 alone, across all supported teams. That’s around 20 man-years worth of effort, which is around 3 times the capacity of the team working on the project! Investments in our automation pipelines have more than paid for themselves, and the gains go up every year as more workflows are automated and more teams are onboarded.

If Conveyor has saved efforts for engineering teams, has it then helped to improve velocity? I had opened the first part of this blog post with figures on the deployment funnel for microservice teams at Grab, towards the end of 2018. So where do the figures stand today for these teams?

In the span of 2 years, the average number of build and staging deployment performed each day has not varied much. However, in the last 3 months of 2020, engineers have sent twice more changes to production than they did for the same period in 2018.

Perhaps the biggest recognition received by the team working on the project, was from Grab’s engineers themselves. In the 2020 internal NPS survey for engineering experience at Grab, Conveyor received the highest score of any tools (built in-house or not).

All these improvements in efficiency for our engineers would never have been possible without the hard work of all team members involved in the project, past and present: Tanun Chalermsinsuwan, Aufar Gilbran, Deepak Ramakrishnaiah, Repon Kumar Roy (Kowshik), Su Han, Voislav Dimitrijevikj, Stanley Goh, Htet Aung Shine, Evan Sebastian, Qijia Wang, Oscar Ng, Jacob Sunny, Subhodip Mandal and many others who have contributed and collaborated with them.

Join Us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Integrate GitHub monorepo with AWS CodePipeline to run project-specific CI/CD pipelines

2021-04-27 Vivek Kumar

Post Syndicated from Vivek Kumar original https://aws.amazon.com/blogs/devops/integrate-github-monorepo-with-aws-codepipeline-to-run-project-specific-ci-cd-pipelines/

AWS CodePipeline is a continuous delivery service that enables you to model, visualize, and automate the steps required to release your software. With CodePipeline, you model the full release process for building your code, deploying to pre-production environments, testing your application, and releasing it to production. CodePipeline then builds, tests, and deploys your application according to the defined workflow either in manual mode or automatically every time a code change occurs. A lot of organizations use GitHub as their source code repository. Some organizations choose to embed multiple applications or services in a single GitHub repository separated by folders. This method of organizing your source code in a repository is called a monorepo.

This post demonstrates how to customize GitHub events that invoke a monorepo service-specific pipeline by reading the GitHub event payload using AWS Lambda.

Solution overview

With the default setup in CodePipeline, a release pipeline is invoked whenever a change in the source code repository is detected. When using GitHub as the source for a pipeline, CodePipeline uses a webhook to detect changes in a remote branch and starts the pipeline. When using a monorepo style project with GitHub, it doesn’t matter which folder in the repository you change the code, CodePipeline gets an event at the repository level. If you have a continuous integration and continuous deployment (CI/CD) pipeline for each of the applications and services in a repository, all pipelines detect the change in any of the folders every time. The following diagram illustrates this scenario.

GitHub monorepo folder structure

This post demonstrates how to customize GitHub events that invoke a monorepo service-specific pipeline by reading the GitHub event payload using Lambda. This solution has the following benefits:

Add customizations to start pipelines based on external factors – You can use custom code to evaluate whether a pipeline should be triggered. This allows for further customization beyond polling a source repository or relying on a push event. For example, you can create custom logic to automatically reschedule deployments on holidays to the next available workday.
Have multiple pipelines with a single source – You can trigger selected pipelines when multiple pipelines are listening to a single GitHub repository. This lets you group small and highly related but independently shipped artifacts such as small microservices without creating thousands of GitHub repos.
Avoid reacting to unimportant files – You can avoid triggering a pipeline when changing files that don’t affect the application functionality (such as documentation, readme, PDF, and .gitignore files).

In this post, we’re not debating the advantages or disadvantages of a monorepo versus a single repo, or when to create monorepos or single repos for each application or project.

Sample architecture

This post focuses on controlling running pipelines in CodePipeline. CodePipeline can have multiple stages like test, approval, and deploy. Our sample architecture considers a simple pipeline with two stages: source and build.

Github monorepo - CodePipeline Sample Architecture

This solution is made up of following parts:

An Amazon API Gateway endpoint (3) is backed by a Lambda function (5) to receive and authenticate GitHub webhook push events (2)
The same function evaluates incoming GitHub push events and starts the pipeline on a match
An Amazon Simple Storage Service (Amazon S3) bucket (4) stores the CodePipeline-specific configuration files
The pipeline contains a build stage with AWS CodeBuild

Normally, after you create a CI/CD pipeline, it automatically triggers a pipeline to release the latest version of your source code. From then on, every time you make a change in your source code, the pipeline is triggered. You can also manually run the last revision through a pipeline by choosing Release change on the CodePipeline console. This architecture uses the manual mode to run the pipeline. GitHub push events and branch changes are evaluated by the Lambda function to avoid commits that change unimportant files from starting the pipeline.

Creating an API Gateway endpoint

We need a single API Gateway endpoint backed by a Lambda function with the responsibility of authenticating and validating incoming requests from GitHub. You can authenticate requests using HMAC security or GitHub Apps. API Gateway only needs one POST method to consume GitHub push events, as shown in the following screenshot.

Creating an API Gateway endpoint

Creating the Lambda function

This Lambda function is responsible for authenticating and evaluating the GitHub events. As part of the evaluation process, the function can parse through the GitHub events payload, determine which files are changed, added, or deleted, and perform the appropriate action:

Start a single pipeline, depending on which folder is changed in GitHub
Start multiple pipelines
Ignore the changes if non-relevant files are changed

You can store the project configuration details in Amazon S3. Lambda can read this configuration to decide what needs to be done when a particular folder is matched from a GitHub event. The following code is an example configuration:

{

"GitHubRepo": "SampleRepo",

"GitHubBranch": "main",

"ChangeMatchExpressions": "ProjectA/.*",

"IgnoreFiles": "*.pdf;*.md",

"CodePipelineName": "ProjectA - CodePipeline"

}

For more complex use cases, you can store the configuration file in Amazon DynamoDB.

The following is the sample Lambda function code in Python 3.7 using Boto3:

def lambda_handler(event, context):

    import json
    modifiedFiles = event["commits"][0]["modified"]
    #full path
    for filePath in modifiedFiles:
        # Extract folder name
        folderName = (filePath[:filePath.find("/")])
        break

    #start the pipeline
    if len(folderName)>0:
        # Codepipeline name is foldername-job. 
        # We can read the configuration from S3 as well. 
        returnCode = start_code_pipeline(folderName + '-job')

    return {
        'statusCode': 200,
        'body': json.dumps('Modified project in repo:' + folderName)
    }
    

def start_code_pipeline(pipelineName):
    client = codepipeline_client()
    response = client.start_pipeline_execution(name=pipelineName)
    return True

cpclient = None
def codepipeline_client():
    import boto3
    global cpclient
    if not cpclient:
        cpclient = boto3.client('codepipeline')
    return cpclient

Creating a GitHub webhook

GitHub provides webhooks to allow external services to be notified on certain events. For this use case, we create a webhook for a push event. This generates a POST request to the URL (API Gateway URL) specified for any files committed and pushed to the repository. The following screenshot shows our webhook configuration.

Creating a GitHub webhook2

Conclusion

In our sample architecture, two pipelines monitor the same GitHub source code repository. A Lambda function decides which pipeline to run based on the GitHub events. The same function can have logic to ignore unimportant files, for example any readme or PDF files.

Using API Gateway, Lambda, and Amazon S3 in combination serves as a general example to introduce custom logic to invoke pipelines. You can expand this solution for increasingly complex processing logic.

About the Author

Vivek Kumar

Vivek is a Solutions Architect at AWS based out of New York. He works with customers providing technical assistance and architectural guidance on various AWS services. He brings more than 23 years of experience in software engineering and architecture roles for various large-scale enterprises.

Gaurav-Sharma

Gaurav is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

Nitin-Aggarwal

Nitin is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

Integrating AWS Device Farm with your CI/CD pipeline to run cross-browser Selenium tests

2021-03-04 Mahesh Biradar

Post Syndicated from Mahesh Biradar original https://aws.amazon.com/blogs/devops/integrating-aws-device-farm-with-ci-cd-pipeline-to-run-cross-browser-selenium-tests/

Continuously building, testing, and deploying your web application helps you release new features sooner and with fewer bugs. In this blog, you will create a continuous integration and continuous delivery (CI/CD) pipeline for a web app using AWS CodeStar services and AWS Device Farm’s desktop browser testing service. AWS CodeStar is a suite of services that help you quickly develop and build your web apps on AWS.

AWS Device Farm’s desktop browser testing service helps developers improve the quality of their web apps by executing their Selenium tests on different desktop browsers hosted in AWS. For each test executed on the service, Device Farm generates action logs, web driver logs, video recordings, to help you quickly identify issues with your app. The service offers pay-as-you-go pricing so you only pay for the time your tests are executing on the browsers with no upfront commitments or additional costs. Furthermore, Device Farm provides a default concurrency of 50 Selenium sessions so you can run your tests in parallel and speed up the execution of your test suites.

After you’ve completed the steps in this blog, you’ll have a working pipeline that will build your web app on every code commit and test it on different versions of desktop browsers hosted in AWS.

Solution overview

In this solution, you use AWS CodeStar to create a sample web application and a CI/CD pipeline. The following diagram illustrates our solution architecture.

Figure 1: Deployment pipeline architecture

Prerequisites

Before you deploy the solution, complete the following prerequisite steps:

On the AWS Management Console, search for AWS CodeStar.
Choose Getting Started.
Choose Create Project.
Choose a sample project.

For this post, we choose a Python web app deployed on Amazon Elastic Compute Cloud (Amazon EC2) servers.

For Templates, select Python (Django).
Choose Next.

Figure 2: Choose a template in CodeStar

For Project name, enter Demo CICD Selenium.
Leave the remaining settings at their default.

You can choose from a list of existing key pairs. If you don’t have a key pair, you can create one.

Choose Next.
Choose Create project.

Figure 3: Verify the project configuration

AWS CodeStar creates project resources for you as listed in the following table.

Service	Resource Created
AWS CodePipeline project	demo-cicd-selen-Pipeline
AWS CodeCommit repository	Demo-CICD-Selenium
AWS CodeBuild project	demo-cicd-selen
AWS CodeDeploy application	demo-cicd-selen
AWS CodeDeploy deployment group	demo-cicd-selen-Env
Amazon EC2 server	Tag: Environment = demo-cicd-selen-WebApp
IAM role	AWSCodeStarServiceRole, CodeStarWorker*

Your EC2 instance needs to have access to run AWS Device Farm to run the Selenium test scripts. You can use service roles to achieve that.

Attach policy AWSDeviceFarmFullAccess to the IAM role CodeStarWorker-demo-cicd-selen-WebApp.

You’re now ready to create an AWS Cloud9 environment.

Check the Pipelines tab of your AWS CodeStar project; it should show success.

On the IDE tab, under Cloud 9 environments, choose Create environment.
For Environment name, enter Demo-CICD-Selenium.
Choose Create environment.
Wait for the environment to be complete and then choose Open IDE.
In the IDE, follow the instructions to set up your Git and make sure it’s up to date with your repo.

The following screenshot shows the verification that the environment is set up.

Figure 4: Verify AWS Cloud9 setup

You can now verify the deployment.

On the Amazon EC2 console, choose Instances.
Choose demo-cicd-selen-WebApp.
Locate its public IP and choose open address.

Figure 5: Locating IP address of the instance

A webpage should open. If it doesn’t, check your VPN and firewall settings.

Now that you have a working pipeline, let’s move on to creating a Device Farm project for browser testing.

Creating a Device Farm project

To create your browser testing project, complete the following steps:

On the Device Farm console, choose Desktop browser testing project.
Choose Create a new project.
For Project name, enter Demo cicd selenium.
Choose Create project.

Figure 6: Creating AWS Device Farm project

Note down the project ARN; we use this in the Selenium script for the remote web driver.

Figure 7: Project ARN for AWS Device Farm project

Testing the solution

This solution uses the following script to run browser testing. We call this script in the validate service lifecycle hook of the CodeDeploy project.

Open your AWS Cloud9 IDE (which you made as a prerequisite).
Create a folder tests under the project root directory.
Add the sample Selenium script browser-test-sel.py under tests folder with the following content (replace <sample_url> with the url of your web application refer pre-requisite step 18):

import boto3
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

devicefarm_client = boto3.client("devicefarm", region_name="us-west-2")
testgrid_url_response = devicefarm_client.create_test_grid_url(
    projectArn="arn:aws:devicefarm:us-west-2:<your project ARN>",
    expiresInSeconds=300)

driver = webdriver.Remote(testgrid_url_response["url"],
                          webdriver.DesiredCapabilities.FIREFOX)
try:
    driver.implicitly_wait(30)
    driver.maximize_window()
    driver.get("<sample_url>")
    if driver.find_element_by_id("Layer_1"):
        print("graphics generated in full screen")
    assert driver.find_element_by_id("Layer_1")
    driver.set_window_position(0, 0) and driver.set_window_size(1000, 400)
    driver.get("<sample_url>")
    tower = driver.find_element_by_id("tower1")
    if tower.is_displayed():
        print("graphics generated after resizing")
    else:
        print("graphics not generated at this window size")
       # this is where you can fail the script with error if you expect the graphics to load. And pipeline will terminate
except Exception as e:
    print(e)
finally:
    driver.quit()

This script launches a website (in Firefox) created by you in the prerequisite step using the AWS CodeStar template and verifies if a graphic element is loaded.

The following screenshot shows a full-screen application with graphics loaded.

Figure 8: Full screen application with graphics loaded

The following screenshot shows the resized window with no graphics loaded.

Figure 9: Resized window application with No graphics loaded

Create the file validate_service in the scripts folder under the root directory with the following content:

#!/bin/bash
if [ "$DEPLOYMENT_GROUP_NAME" == "demo-cicd-selen-Env" ]
then
cd /home/ec2-user
source environment/bin/activate
python tests/browser-test-sel.py
fi

This script is part of CodeDeploy scripts and determines whether to stop the pipeline or continue based on the output from the browser testing script from the preceding step.

Modify the file appspec.yml under the root directory, add the tests files and ValidateService hook , the file should look like following:

version: 0.0
os: linux
files:
 - source: /ec2django/
   destination: /home/ec2-user/ec2django
 - source: /helloworld/
   destination: /home/ec2-user/helloworld
 - source: /manage.py
   destination: /home/ec2-user
 - source: /supervisord.conf
   destination: /home/ec2-user
 - source: /requirements.txt
   destination: /home/ec2-user
 - source: /requirements/
   destination: /home/ec2-user/requirements
 - source: /tests/
   destination: /home/ec2-user/tests

permissions:
  - object: /home/ec2-user/manage.py
    owner: ec2-user
    mode: 644
    type:
      - file
  - object: /home/ec2-user/supervisord.conf
    owner: ec2-user
    mode: 644
    type:
      - file
hooks:
  AfterInstall:
    - location: scripts/install_dependencies
      timeout: 300
      runas: root
    - location: scripts/codestar_remote_access
      timeout: 300
      runas: root
    - location: scripts/start_server
      timeout: 300
      runas: root

  ApplicationStop:
    - location: scripts/stop_server
      timeout: 300
      runas: root

  ValidateService:
    - location: scripts/validate_service
      timeout: 600
      runas: root

This file is used by AWS CodeDeploy service to perform the deployment and validation steps.

Modify the artifacts section in the buildspec.yml file. The section should look like the following:

artifacts:
  files:
    - 'template.yml'
    - 'ec2django/**/*'
    - 'helloworld/**/*'
    - 'scripts/**/*'
    - 'tests/**/*'
    - 'appspec.yml'
    - 'manage.py'
    - 'requirements/**/*'
    - 'requirements.txt'
    - 'supervisord.conf'
    - 'template-configuration.json'

This file is used by AWS CodeBuild service to package the code

Modify the file Common.txt in the requirements folder under the root directory, the file should look like the following:

# dependencies common to all environments 
Django=2.1.15 
selenium==3.141.0 
boto3 >= 1.10.44
pytest

Save All the changes, your folder structure should look like the following:

── README.md
├── appspec.yml*
├── buildspec.yml*
├── db.sqlite3
├── ec2django
├── helloworld
├── manage.py
├── requirements
│   ├── common.txt*
│   ├── dev.txt
│   ├── prod.txt
├── requirements.txt
├── scripts
│   ├── codestar_remote_access
│   ├── install_dependencies
│   ├── start_server
│   ├── stop_server
│   └── validate_service**
├── supervisord.conf
├── template-configuration.json
├── template.yml
├── tests
│   └── browser-test-sel.py**

**newly added files
*modified file

Running the tests

The Device Farm desktop browsing project is now integrated with your pipeline. All you need to do now is commit the code, and CodePipeline takes care of the rest.

On Cloud9 terminal, go to project root directory.
Run git add . to stage all changed files for commit.
Run git commit -m “<commit message>” to commit the changes.
Run git push to push the changes to repository, this should trigger the Pipeline.
Check the Pipelines tab of your AWS CodeStar project; it should show success.
Go to AWS Device Farm console and click on Desktop browser testing projects.
Click on your Project Demo cicd selenium.

You can verify the running of your Selenium test cases using the recorded run steps shown on the Device Farm console, the video of the run, and the logs, all of which can be downloaded on the Device Farm console, or using the AWS SDK and AWS Command Line Interface (AWS CLI).

The following screenshot shows the project run details on the console.

Figure 10: Viewing AWS Device Farm project run details

To Test the Selenium script locally you can run the following commands.

1. Create a Python virtual environment for your Django project. This virtual environment allows you to isolate this project and install any packages you need without affecting the system Python installation. At the terminal, go to project root directory and type the following command:

$ python3 -m venv ./venv

2. Activate the virtual environment:

$ source ./venv/bin/activate

3. Install development Python dependencies for this project:

$ pip install -r requirements/dev.txt

4. Run Selenium Script:

$ python tests/browser-test-sel.py

Testing the failure scenario (optional)

To test the failure scenario, you can modify the sample script browser-test-sel.py at the else statement.

The following code shows the lines to change:

else:
print("graphics was not generated at this form size")
# this is where you can fail the script with error if you expect the graphics to load. And pipeline will terminate

The following is the updated code:

else:
exit(1)
# this is where you can fail the script with error if you expect the graphics to load. And pipeline will terminate

Commit the change, and the pipeline should fail and stop the deployment.

Conclusion

Integrating Device Farm with CI/CD pipelines allows you to control deployment based on browser testing results. Failing the Selenium test on validation failures can stop and roll back the deployment, and a successful testing can continue the pipeline to deploy the solution to the final stage. Device Farm offers a one-stop solution for testing your native and web applications on desktop browsers and real mobile devices.

Mahesh Biradar is a Solutions Architect at AWS. He is a DevOps enthusiast and enjoys helping customers implement cost-effective architectures that scale..

Our Journey to Continuous Delivery at Grab (Part 1)

2020-09-23 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab

This blog post is a two-part presentation of the effort that went into improving the continuous delivery processes for backend services at Grab in the past two years. In the first part, we take stock of where we started two years ago and describe the software and tools we created while introducing some of the integrations we’ve done to automate our software delivery in our staging environment.

Continuous Delivery is the principle of delivering software often, every day.

As a backend engineer at Grab, nothing matters more than the ability to innovate quickly and safely. Around the end of 2018, Grab’s transportation and deliveries backend architecture consisted of roughly 270 services (the majority being microservices). The deployment process was lengthy, required careful inputs and clear communication. The care needed to push changes in production and the risk associated with manual operations led to the introduction of a Slack bot to coordinate deployments. The bot ensures that deployments occur only during off-peak and within work hours:

Once the build was completed, engineers who desired to deploy their software to the Staging environment would copy release versions from the build logs, and paste them in a Jenkins job’s parameter. Tests needed to be manually triggered from another dedicated Jenkins job.

Prior to production deployments, engineers would generate their release notes via a script and update them manually in a wiki document. Deployments would be scheduled through interactions with a Slack bot that controls release notes and deployment windows. Production deployments were made once again by pasting the correct parameters into two dedicated Jenkins jobs, one for the canary (a.k.a. one-box) deployment and the other for the full deployment, spread one hour apart. During the monitoring phase, engineers would continuously observe metrics reported on our dashboards.

In spite of the fragmented process and risky manual operations impacting our velocity and stability, around 614 builds were running each business day and changes were deployed on our staging environment at an average rate of 300 new code releases per business day, while production changes averaged a rate of 28 new code releases per business day.

*Our Deployment Funnel, Towards the End of 2018*

These figures meant that, on average, it took 10 business days between each service update in production, and only 10% of the staging deployments were eventually promoted to production.

Automating Continuous Deployments at Grab

With an increased focus on Engineering efficiency, in 2018 we started an internal initiative to address frictions in deployments that became known as Conveyor. To build Conveyor with a small team of engineers, we had to rely on an already mature platform which exhibited properties that are desirable to us to achieve our mission.

Hands-off deployments

Deployments should be an afterthought. Engineers should be as removed from the process as possible, and whenever possible, decisions should be taken early, during the code review process. The machine will do the heavy lifting, and only when it can’t decide for itself, should the engineer be involved. Notifications can be leveraged to ensure that engineers are only informed when something goes wrong and a human decision is required.

Confidence in Deployments

Grab’s focus on gathering internal Engineering NPS feedback helped us collect valuable metrics. One of the metrics we cared about was our engineers’ confidence in their production deployments. A team’s entire deployment process to production could last for more than a day and may extend up to a week for teams with large infrastructures running critical services. The possibility of losing progress in deployments when individual steps may last for hours is detrimental to the improvement of Engineering efficiency in the organisation. The deployment automation platform is the bedrock of that confidence. If the platform itself fails regularly or does provide a path of upgrade that is transparent to end-users, any features built on top of it would suffer from these downtimes and ultimately erode confidence in deployments.

Tailored To Most But Extensible For The Few

Our backend engineering teams are working on diverse stacks, and so are their deployment processes. Right from the start, we wanted our product to benefit the largest population of engineers that had adopted the same process, so as to maximize returns on our investments. To ease adoption, we decided to tailor a deployment pipeline such that:

It would model the exact sequence of manual processes followed by this population of engineers.
Switching to use that pipeline should require as little work as possible by service teams.

However, in cases where this model would not fit a team’s specific process, our deployment platform should be open and extensible and support new customizations even when they are not originally supported by the product’s ecosystem.

Cloud-Agnosticity

While we were going to target a specific process and team, to ensure that our solution would stand the test of time, we needed to ensure that our solution would support the variety of environments currently used in production. This variety was also likely to increase, and we wanted a platform that would mature together with the rest of our ecosystem.

Overview Of Conveyor

Setting Sail With Spinnaker

Conveyor is based on Spinnaker, an open-source, multi-cloud continuous delivery platform. We’ve chosen Spinnaker over other platforms because it is a mature deployment platform with no single point of failure, supports complex workflows (referred to as pipelines in Spinnaker), and already supports a large array of cloud providers. Since Spinnaker is open-source and extensible, it allowed us to add the features we needed for the specificity of our ecosystem.

To further ease adoption within our organization, we built a tailored user interface and created our own domain-specific language (DSL) to manage its pipelines as code.

Outline of Conveyor's Architecture — *Outline of Conveyor’s Architecture*

Onboarding To A Simpler Interface

Spinnaker comes with its own interface, it has all the features an engineer would want from an advanced continuous delivery system. However, Spinnaker interface is vastly different from Jenkins and makes for a steep learning curve.

To reduce our barrier to adoption, we decided early on to create a simple interface for our users. In this interface, deployment pipelines take the center stage of our application. Pipelines are objects managed by Spinnaker, they model the different steps in the workflow of each deployment. Each pipeline is made up of stages that can be assembled like lego-bricks to form the final pipeline. An instance of a pipeline is called an execution.

Conveyor dashboard. Sensitive information like authors and service names are redacted. — *Conveyor Dashboard*

With this interface, each engineer can focus on what matters to them immediately: the pipeline they have started, or those started by other teammates working on the same services as they are. Conveyor also provides a search bar (on the top) and filters (on the left) that work in concert to explore all pipelines executed at Grab.

We adopted a consistent set of colours to model all information in our interface:

blue: represent stages that are currently running;
red: stages that have failed or important information;
yellow: stages that require human interaction;
and finally, in green: stages that were successfully completed.

Conveyor also provides a task and notifications area, where all stages requiring human intervention are listed in one location. Manual interactions are often no more than just YES or NO questions:

Conveyor tasks. Sensitive information like author/service names is redacted. — *Conveyor Tasks*

Finally, in addition to supporting automated deployments, we greatly simplified the start of manual deployments. Instead of being required to copy/paste information, each parameter can be selected on the interface from a set of predefined items, sorted chronologically, and presented with contextual information to help engineers in their decision.

Several parameters are required for our deployments and their values are selected from the UI to ensure correctness.

Simplified manual deployments — *Simplified Manual Deployments*

Ease Of Adoption With Our Pipeline-As-Code DSL

Ease of adoption for the team is not simply about the learning curve of the new tools. We needed to make it easy for teams to configure their services to deploy with Conveyor. Since we focused on automating tasks that were already performed manually, we needed only to configure the layer that would enable the integration.

We set on creating a pipeline-as-code implementation when none were widely being developed in the Spinnaker community. It’s interesting to see that two years on, this idea has grown in parallel in the community, with the birth of other pipeline-as-code implementations. Our pipeline-as-code is referred to as the Pipeline DSL, and its configuration is located inside each team’s repository. Artificer is the name of our Pipeline DSL interpreter and it runs with every change inside our monorepository:

Pipelines are being updated at every commit if necessary.

Creating a conveyor.jsonnet file inside with the service’s directory of our monorepository with the few lines below is all that’s required for Artificer to do its work and get the benefits of automation provided by Conveyor’s pipeline:

local default = import 'default.libsonnet';
[
 {
 name: "service-name",
 group: [
 "group-name",
 ]
 }
]

Sample minimal conveyor.jsonnet configuration to onboard services.

In this file, engineers simply specify the name of their service and the group that a user should belong to, to have deployment rights for the service.

Once the build is completed, teams can log in to Conveyor and start manual deployments of their services with our pipelines. Three pipelines are provided by default: the integration pipeline used for tests and developments, the staging pipeline used for pre-production tests, and the production pipeline for production deployment.

Thanks to the simplicity of this minimal configuration file, we were able to generate these configuration files for all existing services of our monorepository. This resulted in the automatic onboarding of a large number of teams and was a major contributing factor to the adoption of Conveyor throughout our organisation.

Our Journey To Engineering Efficiency (for backend services)

The sections below relate some of the improvements in engineering efficiency we’ve delivered since Conveyor’s inception. They were not made precisely in this order but for readability, they have been mapped to each step of the software development lifecycle.

Automate Deployments at Build Time

Continuous delivery begins with a pushed code commit in our trunk-based development flow. Whenever a developer pushes changes onto their development branch or onto the trunk, a continuous integration job is triggered on Jenkins. The products of this job (binaries, docker images, etc) are all uploaded into our artefact repositories. We’ve made two additions to our continuous integration process.

The first modification happens at the step “Upload & Register artefacts”. At this step, each artefact created is now registered in Conveyor with its associated metadata. When and if an engineer needs to trigger a deployment manually, Conveyor can display the list of versions to choose from, eliminating the need for error-prone manual inputs:

Each selectable version shows contextual information: title, author, version and link to the code change where it originated. During registration, the commit time is also recorded and used to order entries chronologically in the interface. To ensure this integration is not a single point of failure for deployments, manual input is still available optionally.

The second modification implements one of the essential feature continuous delivery: your deployments should happen often, automatically. Engineers are now given the possibility to start automatic deployments once continuous integration has successfully completed, by simply modifying their project’s continuous integration settings:

 "AfterBuild": [
  {
      "AutoDeploy": {
      "OnDiff": false,
      "OnLand": true
    }
    "TYPE": "conveyor"
  }
 ],

Sample settings needed to trigger auto-deployments. ‘Diff’ refers to code review submissions, and ‘Land’ refers to merged code changes.

Staging Pipeline

Before deploying a new artefact to a service in production, changes are validated on the staging environment. During the staging deployment, we verify that canary (one-box) deployments and full deployments with automated smoke and functional tests suites.

We start by acquiring a deployment lock for this service and this environment. This prevents another deployment of the same service on the same environment to happen concurrently, other deployments will be waiting in a FIFO queue until the lock is released.

The stage “Compute Changeset” ensures that the deployment is not a rollback. It verifies that the new version deployed does not correspond to a rollback by comparing the ancestry of the commits provided during the artefact registration at build time: since we automate deployments after the build process has completed, cases of rollback may occur when two changes are created in quick succession and the latest build completes earlier than the older one.

After the stage “Deploy Canary” has completed, smoke test run. There are three kinds of tests executed at different stages of the pipeline: smoke, functional and security tests. Smoke tests directly reach the canary instance’s endpoint, by-passing load-balancers. If the smoke tests fail, the canary is immediately rolled back and this deployment is terminated.

All tests are generated from the same builds as the artefact being tested and their versions must match during testing. To ensure that the right version of the test run and distinguish between the different kind of tests to perform, we provide additional metadata that will be passed by Conveyor to the tests system, known internally as Gandalf:

local default = import 'default.libsonnet';
[
  {
    name: "service-name",
    group: [
    "group-name",
    ],
    gandalf_smoke_tests: [
    {
        path: "repo.internal/path/to/my/smoke/tests"
      }
      ],
      gandalf_functional_tests: [
      {
        path: "repo.internal/path/to/my/functional/tests"
      }
      gandalf_security_tests: [
      {
        path: "repo.internal/path/to/my/security/tests"
      }
      ]
    }
]

Sample conveyor.jsonnet configuration with integration tests added.

Additionally, in parallel to the execution of the smoke tests, the canary is also being monitored from the moment its deployment has completed and for a predetermined duration. We leverage our integration with Datadog to allow engineers to select the alerts to monitor. If an alert is triggered during the monitoring period, and while the tests are executed, the canary is again rolled back, and the pipeline is terminated. Engineers can specify the alerts by adding them to the conveyor.jsonnet configuration file together with the monitoring duration:

local default = import 'default.libsonnet';
[
 {
   name: "service-name",
   group: [
   "group-name",
   ],
    gandalf_smoke_tests: [
    {
      path: "repo.internal/path/to/my/smoke/tests"
   }
   ],
   gandalf_functional_tests: [
   {
   path: "repo.internal/path/to/my/functional/tests"
  }
     gandalf_security_tests: [
     {
     path: "repo.internal/path/to/my/security/tests"
     }
     ],
     monitor: {
     stg: {
     duration_seconds: 300,
     alarms: [
     {
   type: "datadog",
   alert_id: 12345678
   },
   {
   type: "datadog",
   alert_id: 23456789
      }
      ]
      }
    }
  }
]

Sample conveyor.jsonnet configuration with alerts in staging added.

When the smoke tests and monitor pass and the deployment of new artefacts is completed, the pipeline execution triggers functional and security tests. Unlike smoke tests, functional & security tests run only after that step, as they communicate with the cluster through load-balancers, impersonating other services.

Before releasing the lock, release notes are generated to inform engineers of the delta of changes between the version they just released and the one currently running in production. Once the lock is released, the stage “Check Policies” verifies that the parameters and variable of the deployment obeys a specific set of criteria, for example: if its service metadata is up-to-date in our service inventory, or if the base image used during deployment is sufficiently recent.

Here’s how the policy stage, the engine, and the providers interact with each other:

In Spinnaker, each event of a pipeline’s execution updates the pipeline’s state in the database. The current state of the pipeline can be fetched by its API as a single JSON document, describing all information related to its execution: including its parameters, the contextual information related to each stage or even the response from the various interfacing components. The role of our “Policy Check” stage is to query this JSON representation of the pipeline, to extract and transform the variables which are forwarded to our policy engine for validation. Our policy engine gathers judgements passed by different policy providers. If the validation by the policy engine fails, the deployment is not rolled back this time; however, promotion to production is not possible and the pipeline is immediately terminated.

The journey through staging deployment finally ends with the stage “Register Deployment”. This stage registers that a successful deployment was made in our staging environment as an artefact. Similarly to the policy check above, certain parameters of the deployment are picked up and consolidated into this document. We use this kind of artefact as proof for upcoming production deployment.

Continuing Our Journey to Engineering Efficiency

With the advancements made in continuous integration and deployment to staging, Conveyor has reduced the efforts needed by our engineers to just three clicks in its interface, when automated deployment is used. Even when the deployment is triggered manually, Conveyor gives the assurance that the parameters selected are valid and it does away with copy/pasting and human interactions across heterogeneous tools.

In the sequel to this blog post, we’ll dive into the improvements that we’ve made to our production deployments and introduce a crucial concept that led to the creation of our proof for successful staging deployment. Finally, we’ll cover the impact that Conveyor had on the continuous delivery of our backend services, by comparing our deployment velocity when we started two years ago versus where we are today.

All these improvements in efficiency for our engineers would never have been possible without the hard work of all team members involved in the project, past and present: Evan Sebastian, Tanun Chalermsinsuwan, Aufar Gilbran, Deepak Ramakrishnaiah, Repon Kumar Roy (Kowshik), Su Han, Voislav Dimitrijevikj, Qijia Wang, Oscar Ng, Jacob Sunny, Subhodip Mandal, and many others who have contributed and collaborated with them.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Cross-account and cross-region deployment using GitHub actions and AWS CDK

2020-09-15 DAMODAR SHENVI WAGLE

Post Syndicated from DAMODAR SHENVI WAGLE original https://aws.amazon.com/blogs/devops/cross-account-and-cross-region-deployment-using-github-actions-and-aws-cdk/

GitHub Actions is a feature on GitHub’s popular development platform that helps you automate your software development workflows in the same place you store code and collaborate on pull requests and issues. You can write individual tasks called actions, and combine them to create a custom workflow. Workflows are custom automated processes that you can set up in your repository to build, test, package, release, or deploy any code project on GitHub.

A cross-account deployment strategy is a CI/CD pattern or model in AWS. In this pattern, you have a designated AWS account called tools, where all CI/CD pipelines reside. Deployment is carried out by these pipelines across other AWS accounts, which may correspond to dev, staging, or prod. For more information about a cross-account strategy in reference to CI/CD pipelines on AWS, see Building a Secure Cross-Account Continuous Delivery Pipeline.

In this post, we show you how to use GitHub Actions to deploy an AWS Lambda-based API to an AWS account and Region using the cross-account deployment strategy.

Using GitHub Actions may have associated costs in addition to the cost associated with the AWS resources you create. For more information, see About billing for GitHub Actions.

Prerequisites

Before proceeding any further, you need to identify and designate two AWS accounts required for the solution to work:

Tools – Where you create an AWS Identity and Access Management (IAM) user for GitHub Actions to use to carry out deployment.
Target – Where deployment occurs. You can call this as your dev/stage/prod environment.

You also need to create two AWS account profiles in ~/.aws/credentials for the tools and target accounts, if you don’t already have them. These profiles need to have sufficient permissions to run an AWS Cloud Development Kit (AWS CDK) stack. They should be your private profiles and only be used during the course of this use case. So, it should be fine if you want to use admin privileges. Don’t share the profile details, especially if it has admin privileges. I recommend removing the profile when you’re finished with this walkthrough. For more information about creating an AWS account profile, see Configuring the AWS CLI.

Solution overview

You start by building the necessary resources in the tools account (an IAM user with permissions to assume a specific IAM role from the target account to carry out deployment). For simplicity, we refer to this IAM role as the cross-account role, as specified in the architecture diagram.

You also create the cross-account role in the target account that trusts the IAM user in the tools account and provides the required permissions for AWS CDK to bootstrap and initiate creating an AWS CloudFormation deployment stack in the target account. GitHub Actions uses the tools account IAM user credentials to the assume the cross-account role to carry out deployment.

In addition, you create an AWS CloudFormation execution role in the target account, which AWS CloudFormation service assumes in the target account. This role has permissions to create your API resources, such as a Lambda function and Amazon API Gateway, in the target account. This role is passed to AWS CloudFormation service via AWS CDK.

You then configure your tools account IAM user credentials in your Git secrets and define the GitHub Actions workflow, which triggers upon pushing code to a specific branch of the repo. The workflow then assumes the cross-account role and initiates deployment.

The following diagram illustrates the solution architecture and shows AWS resources across the tools and target accounts.

Architecture diagram

Creating an IAM user

You start by creating an IAM user called git-action-deployment-user in the tools account. The user needs to have only programmatic access.

Clone the GitHub repo aws-cross-account-cicd-git-actions-prereq and navigate to folder tools-account. Here you find the JSON parameter file src/cdk-stack-param.json, which contains the parameter CROSS_ACCOUNT_ROLE_ARN, which represents the ARN for the cross-account role we create in the next step in the target account. In the ARN, replace <target-account-id> with the actual account ID for your designated AWS target account.
Run deploy.sh by passing the name of the tools AWS account profile you created earlier. The script compiles the code, builds a package, and uses the AWS CDK CLI to bootstrap and deploy the stack. See the following code:

cd aws-cross-account-cicd-git-actions-prereq/tools-account/
./deploy.sh "<AWS-TOOLS-ACCOUNT-PROFILE-NAME>"

You should now see two stacks in the tools account: CDKToolkit and cf-GitActionDeploymentUserStack. AWS CDK creates the CDKToolkit stack when we bootstrap the AWS CDK app. This creates an Amazon Simple Storage Service (Amazon S3) bucket needed to hold deployment assets such as a CloudFormation template and Lambda code package. cf-GitActionDeploymentUserStack creates the IAM user with permission to assume git-action-cross-account-role (which you create in the next step). On the Outputs tab of the stack, you can find the user access key and the AWS Secrets Manager ARN that holds the user secret. To retrieve the secret, you need to go to Secrets Manager. Record the secret to use later.

Stack that creates IAM user with its secret stored in secrets manager

Creating a cross-account IAM role

In this step, you create two IAM roles in the target account: git-action-cross-account-role and git-action-cf-execution-role.

git-action-cross-account-role provides required deployment-specific permissions to the IAM user you created in the last step. The IAM user in the tools account can assume this role and perform the following tasks:

Upload deployment assets such as the CloudFormation template and Lambda code package to a designated S3 bucket via AWS CDK
Create a CloudFormation stack that deploys API Gateway and Lambda using AWS CDK

AWS CDK passes git-action-cf-execution-role to AWS CloudFormation to create, update, and delete the CloudFormation stack. It has permissions to create API Gateway and Lambda resources in the target account.

To deploy these two roles using AWS CDK, complete the following steps:

In the already cloned repo from the previous step, navigate to the folder target-account. This folder contains the JSON parameter file cdk-stack-param.json, which contains the parameter TOOLS_ACCOUNT_USER_ARN, which represents the ARN for the IAM user you previously created in the tools account. In the ARN, replace <tools-account-id> with the actual account ID for your designated AWS tools account.
Run deploy.sh by passing the name of the target AWS account profile you created earlier. The script compiles the code, builds the package, and uses the AWS CDK CLI to bootstrap and deploy the stack. See the following code:

cd ../target-account/
./deploy.sh "<AWS-TARGET-ACCOUNT-PROFILE-NAME>"

You should now see two stacks in your target account: CDKToolkit and cf-CrossAccountRolesStack. AWS CDK creates the CDKToolkit stack when we bootstrap the AWS CDK app. This creates an S3 bucket to hold deployment assets such as the CloudFormation template and Lambda code package. The cf-CrossAccountRolesStack creates the two IAM roles we discussed at the beginning of this step. The IAM role git-action-cross-account-role now has the IAM user added to its trust policy. On the Outputs tab of the stack, you can find these roles’ ARNs. Record these ARNs as you conclude this step.

Stack that creates IAM roles to carry out cross account deployment

Configuring secrets

One of the GitHub actions we use is aws-actions/configure-aws-credentials@v1. This action configures AWS credentials and Region environment variables for use in the GitHub Actions workflow. The AWS CDK CLI detects the environment variables to determine the credentials and Region to use for deployment.

For our cross-account deployment use case, aws-actions/configure-aws-credentials@v1 takes three pieces of sensitive information besides the Region: AWS_ACCESS_KEY_ID, AWS_ACCESS_KEY_SECRET, and CROSS_ACCOUNT_ROLE_TO_ASSUME. Secrets are recommended for storing sensitive pieces of information in the GitHub repo. It keeps the information in an encrypted format. For more information about referencing secrets in the workflow, see Creating and storing encrypted secrets.

Before we continue, you need your own empty GitHub repo to complete this step. Use an existing repo if you have one, or create a new repo. You configure secrets in this repo. In the next section, you check in the code provided by the post to deploy a Lambda-based API CDK stack into this repo.

On the GitHub console, navigate to your repo settings and choose the Secrets tab.
Add a new secret with name as TOOLS_ACCOUNT_ACCESS_KEY_ID.
Copy the access key ID from the output OutGitActionDeploymentUserAccessKey of the stack GitActionDeploymentUserStack in tools account.
Enter the ID in the Value field.
Repeat this step to add two more secrets:

- TOOLS_ACCOUNT_SECRET_ACCESS_KEY (value retrieved from the AWS Secrets Manager in tools account)
- CROSS_ACCOUNT_ROLE (value copied from the output OutCrossAccountRoleArn of the stack cf-CrossAccountRolesStack in target account)

You should now have three secrets as shown below.

All required git secrets

Deploying with GitHub Actions

As the final step, first clone your empty repo where you set up your secrets. Download and copy the code from the GitHub repo into your empty repo. The folder structure of your repo should mimic the folder structure of source repo. See the following screenshot.

Folder structure of the Lambda API code

We can take a detailed look at the code base. First and foremost, we use Typescript to deploy our Lambda API, so we need an AWS CDK app and AWS CDK stack. The app is defined in app.ts under the repo root folder location. The stack definition is located under the stack-specific folder src/git-action-demo-api-stack. The Lambda code is located under the Lambda-specific folder src/git-action-demo-api-stack/lambda/ git-action-demo-lambda.

We also have a deployment script deploy.sh, which compiles the app and Lambda code, packages the Lambda code into a .zip file, bootstraps the app by copying the assets to an S3 bucket, and deploys the stack. To deploy the stack, AWS CDK has to pass CFN_EXECUTION_ROLE to AWS CloudFormation; this role is configured in src/params/cdk-stack-param.json. Replace <target-account-id> with your own designated AWS target account ID.

Update cdk-stack-param.json in git-actions-cross-account-cicd repo with TARGET account id

Finally, we define the Git Actions workflow under the .github/workflows/ folder per the specifications defined by GitHub Actions. GitHub Actions automatically identifies the workflow in this location and triggers it if conditions match. Our workflow .yml file is named in the format cicd-workflow-<region>.yml, where <region> in the file name identifies the deployment Region in the target account. In our use case, we use us-east-1 and us-west-2, which is also defined as an environment variable in the workflow.

The GitHub Actions workflow has a standard hierarchy. The workflow is a collection of jobs, which are collections of one or more steps. Each job runs on a virtual machine called a runner, which can either be GitHub-hosted or self-hosted. We use the GitHub-hosted runner ubuntu-latest because it works well for our use case. For more information about GitHub-hosted runners, see Virtual environments for GitHub-hosted runners. For more information about the software preinstalled on GitHub-hosted runners, see Software installed on GitHub-hosted runners.

The workflow also has a trigger condition specified at the top. You can schedule the trigger based on the cron settings or trigger it upon code pushed to a specific branch in the repo. See the following code:

name: Lambda API CICD Workflow
# This workflow is triggered on pushes to the repository branch master.
on:
  push:
    branches:
      - master

# Initializes environment variables for the workflow
env:
  REGION: us-east-1 # Deployment Region

jobs:
  deploy:
    name: Build And Deploy
    # This job runs on Linux
    runs-on: ubuntu-latest
    steps:
      # Checkout code from git repo branch configured above, under folder $GITHUB_WORKSPACE.
      - name: Checkout
        uses: actions/checkout@v2
      # Sets up AWS profile.
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.TOOLS_ACCOUNT_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.TOOLS_ACCOUNT_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.REGION }}
          role-to-assume: ${{ secrets.CROSS_ACCOUNT_ROLE }}
          role-duration-seconds: 1200
          role-session-name: GitActionDeploymentSession
      # Installs CDK and other prerequisites
      - name: Prerequisite Installation
        run: |
          sudo npm install -g [email protected]
          cdk --version
          aws s3 ls
      # Build and Deploy CDK application
      - name: Build & Deploy
        run: |
          cd $GITHUB_WORKSPACE
          ls -a
          chmod 700 deploy.sh
          ./deploy.sh

For more information about triggering workflows, see Triggering a workflow with events.

We have configured a single job workflow for our use case that runs on ubuntu-latest and is triggered upon a code push to the master branch. When you create an empty repo, master branch becomes the default branch. The workflow has four steps:

Check out the code from the repo, for which we use a standard Git action actions/checkout@v2. The code is checked out into a folder defined by the variable $GITHUB_WORKSPACE, so it becomes the root location of our code.
Configure AWS credentials using aws-actions/configure-aws-credentials@v1. This action is configured as explained in the previous section.
Install your prerequisites. In our use case, the only prerequisite we need is AWS CDK. Upon installing AWS CDK, we can do a quick test using the AWS Command Line Interface (AWS CLI) command aws s3 ls. If cross-account access was successfully established in the previous step of the workflow, this command should return a list of buckets in the target account.
Navigate to root location of the code $GITHUB_WORKSPACE and run the deploy.sh script.

You can check in the code into the master branch of your repo. This should trigger the workflow, which you can monitor on the Actions tab of your repo. The commit message you provide is displayed for the respective run of the workflow.

Workflow for region us-east-1 Workflow for region us-west-2

You can choose the workflow link and monitor the log for each individual step of the workflow.

Git action workflow steps

In the target account, you should now see the CloudFormation stack cf-GitActionDemoApiStack in us-east-1 and us-west-2.

Lambda API stack in us-east-1 Lambda API stack in us-west-2

The API resource URL DocUploadRestApiResourceUrl is located on the Outputs tab of the stack. You can invoke your API by choosing this URL on the browser.

API Invocation Output

Clean up

To remove all the resources from the target and tools accounts, complete the following steps in their given order:

Delete the CloudFormation stack cf-GitActionDemoApiStack from the target account. This step removes the Lambda and API Gateway resources and their associated IAM roles.
Delete the CloudFormation stack cf-CrossAccountRolesStack from the target account. This removes the cross-account role and CloudFormation execution role you created.
Go to the CDKToolkit stack in the target account and note the BucketName on the Output tab. Empty that bucket and then delete the stack.
Delete the CloudFormation stack cf-GitActionDeploymentUserStack from tools account. This removes cross-account-deploy-user IAM user.
Go to the CDKToolkit stack in the tools account and note the BucketName on the Output tab. Empty that bucket and then delete the stack.

Security considerations

Cross-account IAM roles are very powerful and need to be handled carefully. For this post, we strictly limited the cross-account IAM role to specific Amazon S3 and CloudFormation permissions. This makes sure that the cross-account role can only do those things. The actual creation of Lambda, API Gateway, and Amazon DynamoDB resources happens via the AWS CloudFormation IAM role, which AWS CloudFormation assumes in the target AWS account.

Make sure that you use secrets to store your sensitive workflow configurations, as specified in the section Configuring secrets.

Conclusion

In this post we showed how you can leverage GitHub’s popular software development platform to securely deploy to AWS accounts and Regions using GitHub actions and AWS CDK.

Build your own GitHub Actions CI/CD workflow as shown in this post.

About the author

Damodar Shenvi Wagle is a Cloud Application Architect at AWS Professional Services. His areas of expertise include architecting serverless solutions, ci/cd and automation.

Speed and Stability: Yahoo Mail’s Forward-Thinking Continuous Integration and Delivery Pipeline

2017-06-27 mikesefanov

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/162320459636

By Mohit Goenka, Senior Engineering Manager

Building the technology powering the best consumer email inbox in the world is no easy task. When you start on such a journey, it is important to consider how to deliver such an experience to the users. After all, any consumer feature we build can only make a difference after it is delivered to everyone via the tech pipeline.

As we began building out the new version of Yahoo Mail, we wanted to ensure that our internal developer productivity would not be hindered by how our pipelines work. Keeping this in mind, we identified the following principles as most important while designing the delivery pipeline for the new Yahoo Mail experience:

Product updates are pushed at regular intervals
Releases are stable
Builds are not blocked by irrational test failures
Developers are notified of code pushes
Hotfixes
Rollbacks
Heartbeat pushes

Product updates are pushed at regular intervals

We ensure that our engineers can push any code changes to all Mail users everyday, with the ability to push multiple times a day, if necessary or desired. This is possible because of the time we spent building a solid testing infrastructure, which continues to evolve as we scale to new users and add new features to the product. Every one of our builds runs 10,000+ unit tests and 5,000+ integration tests on various combinations of operating systems and browsers. It is important to push product updates regularly as it allows all our users to get the best Mail experience possible.

Releases are stable

Every code release starts with the company’s internal audience first, where all our employees get to try out the latest changes before they go out to production. This begins with our alpha and beta environments that our Mail engineers use by default. Our build then goes out to the canary environment, which is a small subset of production users, before making it to all users. This gives us the ability to analyze quality metrics on internal and canary servers before rolling the build out to 100% of users in production. Once we go through this process, the code pushed to all our users is thoroughly baked and tested.

Builds are not blocked by irrational test failures

Running tests using web drivers on multiple browsers, as is standard when testing frontend code, comes with the problem of tests irrationally failing. As part the Yahoo Mail continuous delivery pipeline, we employ various novel strategies to recover from such failures. One such strategy is recording the data related to failed tests in the first pass of a build, and then rerunning only the failed tests in the subsequent passes. This is achieved by creating a metadata file that stores all our build-related information. As part of this process, a new bundle is created with a new set of code changes. Once a bundle is created with build metadata information, the same build job can be rerun multiple times such that subsequent reruns would only run the failing tests. This significantly improves rerun times and eliminates the chances of build detentions introduced by irrational test failures. The recorded test information is analyzed independently to understand the pattern of failing tests. This helps us in improving the stability of those intermittently failing tests.

Developers are notified of code pushes

Our build and deployment pipelines collect data related to all the authors contributing to any release through code commits or by merging various pull requests. This enables the build pipeline to send out email notifications to all our Mail developers as their code flows through each environment in our build pipeline (alpha, beta, canary, and production). With this ability, developers are well aware of where their code is in the pipeline and can test their changes as needed.

Hotfixes

We have also created a pipeline to deploy major code fixes directly to production. This is needed even after the existence of tens of thousands of tests and multitudes of checks. Every now and then, a bug may make its way into production. For such instances, we have hotfixes that are very useful. These are code patches that we quickly deploy on top of production code to address critical issues impacting large sets of users.

Rollbacks

If we find any issues in production, we do our best to minimize the impact on users by swiftly utilizing rollbacks, ensuring there is zero to minimal impact time. In order to do rollbacks, we maintain lists of all the versions pushed to production along with their release bundles and change logs. If needed, we pick the stable version that was previously pushed to production and deploy it directly on all the machines running our production instance.

Heartbeat pushes

As part of our continuous delivery efforts, we have also developed a concept we call heartbeat pushes. Heartbeat pushes are notifications we send users to refresh their browsers when we issue important builds that they should immediately adopt. These can include bug fixes, product updates, or new features. Heartbeat allows us to dynamically update the latest version of Yahoo Mail when we see that a user’s current version needs to be updated.

Yahoo Mail Continuous Delivery Flow

In building the new Yahoo Mail experience, we knew that we needed to revamp from the ground up, starting with our continuous integration and delivery pipeline. The guiding principles of our new, forward-thinking infrastructure allow us to deliver new features and code fixes at a very high launch velocity and ensure that our users are always getting the latest and greatest Yahoo Mail experience.