Tag Archives: Docker

Reduce Docker image build time on AWS CodeBuild using Amazon ECR as a remote cache

Post Syndicated from Kirubakaran Sundaramoorthy original https://aws.amazon.com/blogs/devops/reduce-docker-image-build-time-on-aws-codebuild-using-amazon-ecr-as-a-remote-cache/

In modern software development, containerization with Docker has revolutionized how we build and deploy applications. While Docker enables packaging applications into portable containers, the continuous need to update these images can be resource intensive. AWS CodeBuild addresses this challenge by providing a managed build service that eliminates infrastructure maintenance overhead. In this blog post, we’ll explore how AWS CodeBuild integration with Amazon Elastic Container Registry (Amazon ECR) as a cache backend can significantly accelerate our Docker image build process, making development more efficient and streamlined.

AWS CodeBuild creates isolated environments for each build, which means build artifacts cannot be permanently stored on the host system. While CodeBuild does offer a native local caching feature, it provides only temporary storage and is most effective for builds that occur in quick succession.

This local caching mechanism, however, is not reliable when builds are triggered at varying intervals, as it operates on a best-effort basis. To address this limitation, we recommend using Amazon Elastic Container Registry as a persistent cache for Docker layers. This solution offers several advantages:

  • It provides a reliable, long-term storage solution for build caches
  • The cached layers can be reused across multiple builds regardless of timing
  • The cache remains valid and accessible at any point in time

This post shows how to implement a simple, effective, and durable Docker layer cache for CodeBuild using Amazon ECR repository as a cache backend to significantly reduce image build runtime.

Solution Overview

The following diagram illustrates the high-level architecture of this solution. We describe implementing each stage in more detail in the following paragraphs.

Solution Flow Diagram

Figure 1: Solution Flow Diagram

To use an Amazon ECR registry as a backend for caching, we must first enable the containerd image store in our Docker driver. This feature is not enabled in the default Docker driver configuration. Therefore, we create a new docker driver using docker buildx command with containerd (docker-container driver) image store enabled.

When CodeBuild runs for the first time, it will attempt to retrieve cache data from the Amazon ECR repository. Since this is the first run, no cache will be available. CodeBuild will then proceed to build the Docker image from scratch, generate cache data during this initial build and export both the newly built image and its associated cache to the Amazon ECR repository.

In each subsequent build, CodeBuild will import the previously stored cache from Amazon ECR. This cached data will be used to speed up the image building process, as only the changed layers will need to be rebuilt. Finally, the updated cache and image will be stored back in Amazon ECR.

Prerequisites

Before we begin the walk-through, we must have an AWS account. If you don’t have one, sign up at https://aws.amazon.com.

Walk-through

Launch the following AWS CloudFormation template to create Amazon ECR repository and AWS CodeBuild project including CodeBuild service role and required permission as a managed policy.

AWSTemplateFormatVersion: "2010-09-09"

Description: 'AWS CloudFormation template to create infrastructure which demo using Amazon ECR as a remote cache for AWS CodeBuild'

Parameters:
  CodeBuildProjectName:
    Type: String
    Default: CBECRCacheDemoProject
    Description: "Enter name for your CodeBuild project"
  CodeBuildServiceRolePolicyName:
    Type: String
    Default: CodeBuildDockerCachePolicy
    Description: "Enter name for the IAM policy"
  ECRRepoName:
    Type: String
    Default: amazon_linux_codebuild_image
    Description: "Enter name for Amazon ECR repository"
  GitHubLocation:
    Type:  String
    Default: "https://github.com/aws/aws-codebuild-docker-images"
    Description: "Enter your source code GitHub URL"
  ImageTag:
    Type: String
    Default: demo
    Description: "Enter Tag name for your application docker image"
  CacheTag:
    Type: String
    Default: demo-cache
    Description: "Enter tag name for the cache image"

Resources:

  CodeBuildServiceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument: |
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": "codebuild.amazonaws.com"
               },
              "Action": "sts:AssumeRole"
            }
         ]
        }
      Path: /

  CodeBuildServiceRolePolicy:
    Type: AWS::IAM::RolePolicy
    Properties:
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - ecr:BatchGetImage
              - ecr:BatchCheckLayerAvailability
              - ecr:InitiateLayerUpload
              - ecr:UploadLayerPart
              - ecr:CompleteLayerUpload
              - ecr:PutImage
              - ecr:GetDownloadUrlForLayer
            Resource: !GetAtt ECRRepository.Arn
          - Effect: Allow
            Action: 
              - ecr:GetAuthorizationToken
            Resource: '*'
          - Effect: Allow
            Action:
              - codeconnections:UseConnection
              - codeconnections:GetConnectionToken
              - codeconnections:GetConnection
              - codestar-connections:GetConnectionToken
              - codestar-connections:GetConnection
            Resource: '*'
          - Effect: Allow
            Action: 
              - logs:CreateLogStream
              - logs:CreateLogGroup
              - logs:PutLogEvents
            Resource: 
              - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/${CodeBuildProjectName}'
              - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/${CodeBuildProjectName}:*'
          - Effect: Allow
            Action:
              - s3:PutObject
              - s3:GetObject
              - s3:GetObjectVersion
              - s3:GetBucketAcl
              - s3:GetBucketLocation
            Resource: 
              - !Sub 'arn:${AWS::Partition}:s3:::codepipeline-${AWS::Region}-*'
      PolicyName: !Ref CodeBuildServiceRolePolicyName
      RoleName: !Ref CodeBuildServiceRole

  ECRRepository:
    Type: AWS::ECR::Repository
    Properties:
      RepositoryName: !Ref ECRRepoName
      ImageScanningConfiguration:
        ScanOnPush: true

  CodeBuildProject:
    Type: AWS::CodeBuild::Project
    Properties:
      Name: !Ref CodeBuildProjectName
      Source:
        Type: GITHUB
        Location: !Ref GitHubLocation
        BuildSpec: !Sub |
          version: 0.2

          phases:
            install:
             commands:
               - docker buildx create --name containerd --driver=docker-container --driver-opt default-load=true

            pre_build:
             commands:
               - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS::AccountId}.dkr.ecr.$AWS_REGION.amazonaws.com

            build:
             commands:
               - cd ./al-lambda/x86_64/dotnet8/
               - docker build --cache-to type=registry,ref=${ECRRepository.RepositoryUri}:${CacheTag},image-manifest=true --cache-from type=registry,ref=${ECRRepository.RepositoryUri}:${CacheTag} --tag ${ECRRepository.RepositoryUri}:${ImageTag} --builder=containerd .

            post_build:
             commands:
               - docker push ${ECRRepository.RepositoryUri}:${ImageTag}
      ServiceRole: !GetAtt CodeBuildServiceRole.Arn
      Artifacts:
        Type: NO_ARTIFACTS
      Environment:
        Type: LINUX_CONTAINER
        Image: aws/codebuild/amazonlinux-x86_64-standard:5.0
        ComputeType: BUILD_GENERAL1_SMALL
        PrivilegedMode: true
      Cache:
        Type: LOCAL
        Modes:
          - LOCAL_DOCKER_LAYER_CACHE

Specify the following parameters while creating the CloudFormation stack (see Figure 2):

  1.  Set the cache image tag (CacheTag) to “demo-cache“.
  2. Name the CodeBuild project (CodeBuildProjectName) as “CBECRCacheDemoProject“.
  3. Specify the IAM policy name (CodeBuildServiceRolePolicyName) as “CodeBuildDockerCachePolicy“.
  4. Define the ECR repository name (ECRRepoName) as “amazon_al_lambda_codebuild_image“.
  5. Enter the GitHub repository URL (GitHubLocation) as “https://github.com/aws/aws-codebuild-docker-images“.
  6. Set the Docker image tag (ImageTag) to “demo
CloudFormation Stack parameter

Figure 2: CloudFormation Stack parameter

The CloudFormation stack will set up a comprehensive development environment for our project. It will create a CodeBuild project equipped with all necessary IAM roles and permissions, ensuring smooth and secure build processes. Additionally, the stack will create an Amazon ECR repository. This repository is configured to automatically scan Docker images for vulnerabilities upon upload. The ECR will serve as a secure storage location for both our Docker images and cache images.

The CodeBuild project will be created with a buildspec file which will instruct CodeBuild to do the following:

  • Creates a new driver called “containerd” using buildx since the default Docker driver supports registry cache backend only when the containerd image store is enabled.
  • To pull and push both Docker images and cache, authentication with the Amazon ECR repository is required.
  • During the image build process:
    • We use the --cache-from parameter to force Docker to check for and use any existing cache from the repository.
    • The image-manifest option is set to true to enable cache storage in the Amazon ECR repository.
    • The --cache-to parameter is used to push or update the cache to the Amazon ECR repository.
  • After the build is complete, in the post-build phase, the image is pushed to Amazon ECR. The cache is automatically uploaded to the Amazon ECR repository as part of the Docker build command execution.

Testing the solution

After having successfully created the CloudFormation stack, we can proceed to test and evaluate how it performs.

The initial build process took approximately 10 minutes to complete. Since this was the first execution, no cache was available in Amazon ECR, requiring the system to build the image entirely from scratch. Although the cache import operation failed as expected during this initial run, the system continued with the build process without any cached layers. This first build served a dual purpose: it not only created the application image but also generated a cache, which was then exported to Amazon ECR as a separate image. This cached data would become available for future builds, setting the foundation for more efficient subsequent builds.

To verify the effectiveness of the solution, we triggered a second build after introducing a minor modification – adding an echo command in the middle of the Dockerfile’s active commands (excluding commented line). During this subsequent build, Docker intelligently utilized the cached layers up until the point of modification, after which it rebuilt only the necessary layers. This smart caching strategy resulted in a build time of approximately 6 minutes, clearly demonstrating how the caching system optimizes the build process even when changes are introduced. Further validation across multiple large-scale projects confirmed the effectiveness of this approach, consistently achieving build time reductions of up to 25%.

We enabled CodeBuild’s built-in Docker layer caching feature on a best-effort basis. This approach is recommended as it uses cached layers in the local when available instead of downloading them from the repository, which will further improve the overall build speed.

Cleaning up

When we finished testing, we should de-provision the following resources to avoid incurring further charges and keep the account clean from unused resources:

  • Delete the docker images from the Amazon ECR repository amazon_al_lambda_codebuild_image.
  • Delete the CloudFormation stack which has been created in the “Launch the AWS CloudFormation template” section.

Conclusion

In this discussion, we explored an efficient and straightforward solution for implementing external Docker caching in CodeBuild using Amazon ECR as a backend storage system. This approach offers several key benefits:

The solution reduces Docker build times in CodeBuild up to 25% and is versatile enough to handle most scenarios, including complex multi-stage builds. A particularly valuable advantage is that Amazon ECR stores the cache separately in its repository, making it reusable across different projects.

The business impact is substantial: shorter build times lead directly to reduced compute costs. More importantly, this optimization results in a more streamlined development lifecycle, enabling faster feature releases at lower operational costs.

In essence, this caching solution not only improves technical efficiency but also delivers tangible business value through reduced costs and accelerated development cycles.

About the author

Kirubakaran Sundaramoorthy

Kirubakaran Sundaramoorthy is a Cloud Support Engineer specializing in DevOps practices and AWS architecture, with expertise in AWS CloudFormation and CI/CD implementations. He builds efficient cloud infrastructure solutions using automation processes, cloud deployment strategies, infrastructure as code, and DevOps best practices to help businesses succeed.

Deploying Zabbix Components with Docker and Docker Compose

Post Syndicated from Janis Eidaks original https://blog.zabbix.com/deploying-zabbix-components-with-docker-and-docker-compose/30025/

Installing Zabbix from packages can feel overwhelming, due to the availability of different configuration options. The detailed and comprehensive documentation certainly helps to check the purpose of these multiple options, what values can be set in their fields, and if one is required for your planned deployment. There are quite a few official Zabbix blog posts about Zabbix in containers, and this post is aimed at showcasing how additional Zabbix components can be easily set up in a docker environment, along with docker run and docker compose examples.

For those who would prefer to use Zabbix in a containerized environment such as Docker, or who want to try out Zabbix quickly, this guide is for you (you can also check out the other Zabbix Docker blog posts). You can also mix and match Zabbix components installed from packages or built from source with those running in containers.

Please follow the official guide on how to set up the docker here.

To better understand the Zabbix architecture for those who are trying out Zabbix for the first time, I will give you an overview that should make it much easier to follow and understand Zabbix.

Zabbix consists of 3 main components (the bare minimum to get started):

  • Zabbix Server – responsible for everything related to data collection, trigger evaluation, event generation, and alerting.
  • Zabbix Frontend – responsible for the configuration (modifying or changing the configuration of the monitoring targets) and visualization (dashboards, graphs, tables, and widgets).
  • Database – this is where the Zabbix configuration and monitoring history data are stored.

You can monitor your targets with the bare minimum setup; however, more comprehensive and complete monitoring can be achieved by using the C-based Zabbix-agent or GO-based Zabbix-agent2 in combination with templates, user parameters, and more. To set up the minimum necessary Zabbix components, you can use this example in the guide.

There are also official guides available on the Zabbix documentation page (for both: the docker run and docker compose) or the Docker/Github.

As of this writing , these official Zabbix docker components are available from the docker hub page:

  • Zabbix Server (with MySQL/PostgreSQL database)
  • Zabbix Proxy (with MySQL/SQLite3 database)
  • Zabbix Frontend (Apache/Nginx with MySQL/PostgreSQL DB)
  • Zabbix Agent (TLS encryption)
  • Zabbix Agent2 (TLS encryption)
  • Zabbix Java Gateway
  • Zabbix SNMP traps
  • Zabbix Web Service

Tags are used to select which OS container an image will be based on, as well as which Zabbix component version you wish to employ. If you only specify tag value – latest, you will get the latest Zabbix version based on the Alpine Linux. The images based on Linux Alpine are more lightweight than the other distros.

When something does not work as expected or fails, check the container error logs! This will be useful for debugging purposes and will help to narrow down the cause of an issue. Additionally, when debugging you can also specify additional options, such as specific lines of log, timestamp since or until, or following the log file content.

# docker logs --tail 50 container_name_or_id
    --details        Show extra details provided to logs
-f, --follow         Follow log output
    --since string   Show logs since timestamp (e.g. "2013-01-02T13:23:37Z") or relative (e.g. "42m" for 42 minutes)
-n, --tail string    Number of lines to show from the end of the logs (default "all")
-t, --timestamps     Show timestamps
    --until string   Show logs before a timestamp (e.g. "2013-01-02T13:23:37Z") or relative (e.g. "42m" for 42 minutes)

In some rare cases, when there is a container issue (everything else is correct, worked before, etc.), restarting the docker service can sometimes solve the issue.

So, what is different if you have only used Zabbix installed from packages?  The examples below illustrate the differences in configuration options based on different Zabbix deployment methods: a) package-based/compiled installation, b) docker run command, and c) docker compose file example. First of all, you will have to specify environment variables in the docker run command or docker compose file. The list of available environment variables for each docker image is available in both docker hub and Github.

A). Package-based config

# vi /etc/zabbix/zabbix_server.conf
...
DBName=zabbix
DBUser=zabbix_usr
DBPassword=zabbix_pwd
...

B).Docker run config

docker run --name zbxsrv -t \
...
-e MYSQL_DATABASE=zabbix\
-e MYSQL_USER=zabbix_usr \
-e MYSQL_PASSWORD=zabbix_pwd\
...

C). Docker compose config

# vi /../...yaml
...
  environment:
   MYSQL_DATABASE=zabbix
   MYSQL_USER=zabbix_usr
   MYSQL_PASSWORD=zabbix_pwd

The environment variables are represented as key-value pairs, e.g., VAR=VAL. The values can optionally be unquoted or double-quoted. If some environment variable value contains special characters, you will need to escape them. To properly escape them, check out the docker documentation page.

You can create custom, user-defined networks to connect multiple containers to the same network. On such networks, containers can resolve each other by name or alias. If needed, you can assign a specific IP address to a container (if the address is already used, you will get an error).

# docker network create --subnet 172.20.0.0/16 --ip-range 172.20.240.0/20 zabbix-net

Docker run

In this section, we have an example of docker run commands for two Zabbix components: Zabbix proxy and Java gateway. When using custom, user-defined networks, you can use container names for communication between containers instead of using IP addresses. Here, instead of defining the IP address for Zabbix Java gateway, the container name is used. You can set a static IP address for your container or let docker do it for you, but confirm if the change of the IP address will not cause issues in case your container gets a different IP address. This can become an issue if you use an IP address in some configuration fields instead of a container name.

A lot of parameters are specified using environment variables with the option -e. Also, 3 different ports are exposed on your host machine. To keep the SQLite3 database file upon container deletion, the container directory containing database file is mounted to host directory (the proxy DB is usually used as a buffer storage before sending data to Zabbix server and usually is not used to store data beyond the moment when the data is sent).

docker run --name zabbix-proxy-active-01 \
-e ZBX_HOSTNAME="Zabbix-proxy-active-01" \
-e ZBX_SERVER_HOST=46.101.140.98 \
-e ZBX_PROXYMODE="0" \
-e ZBX_JAVAGATEWAY_ENABLE=true \
-e ZBX_JAVAGATEWAY=zabbix-java-gateway-proxy \
-e ZBX_JAVAGATEWAYPORT=10052 \
-e ZBX_STARTJAVAPOLLERS=5 \
--network=zabbix-net \
-e ZBX_LISTENPORT=10101  \
-p 10101:10101 \
-p 10050:10050 \
-p 10051:10051 \
-v /var/lib/zabbix/db_data:/var/lib/zabbix/db_data \
--restart unless-stopped \
--init -d zabbix/zabbix-proxy-sqlite3:alpine-7.2.4
docker run --name zabbix-java-gateway-proxy \
--network=zabbix-net \
--restart unless-stopped \
-d zabbix/zabbix-java-gateway:alpine-7.2.4

You can start each of these Zabbix components using the docker run command, however, any change to the container configuration will require you to stop the container, delete it, and execute the docker run command again. You also have another option – you could create a docker compose file and write the necessary configuration in yaml format. When you need to add some changes to the container configuration, run the docker compose down command to remove containers, edit the docker compose file, and run docker compose up command to start them up again with the new configuration:

  • docker compose -f ./docker_compose_v3_proxy.yaml down
  • docker compose -f ./docker_compose_v3_proxy.yaml up -d

If you have not mounted volume or directory to container for the data you want to keep, you can copy the data from the container to your host. Otherwise, that data will be gone if you delete the container or use the docker compose down command. So, it is important to set up the persistent storage/volume for the data that needs retaining, so you don’t lose important data from the container when container configuration is changed. You also need to expose the ports for the necessary services for the appropriate components (if they are set up on on separate hosts): zabbix-server, zabbix-proxy, zabbix-agent/zabbix-agent2 (default ports: 10050 for Zabbix agent passive mode, 10051 for Zabbix-agent active mode, some different port for proxy, 10052 for Java gateway).

Here we have the same docker run options written to docker compose file, including the environment variables, mounted directories and exposed ports. You can specify as many services as needed and start them just with docker compose command.

docker_compose_v3_proxy.yaml

services:
  zabbix-proxy-active-01:
    image: "${PROXY_SQLITE3_IMAGE}:${ALPINE_IMAGE_TAG}"
    environment:
      ZBX_HOSTNAME: Zabbix-proxy-active-01
      ZBX_SERVER_HOST: ${ZBX_SERVER_HOST}
      ZBX_PROXYMODE: 0
      ZBX_LISTENPORT: 10101
      ZBX_JAVAGATEWAY_ENABLE: true
      ZBX_JAVAGATEWAY: zabbix-java-gateway-proxy
      ZBX_JAVAGATEWAYPORT: 10052
      ZBX_STARTJAVAPOLLERS: 5
    volumes:
      - /var/lib/zabbix/db_data:/var/lib/zabbix/db_data:rw
    networks:
      - backend
    ports:
      - 10101:10101
      - 10050:10050
      - 10051:10051
    restart: unless-stopped

  zabbix-java-gateway-proxy:
    image: "${JAVA_GW_IMAGE}:${ALPINE_IMAGE_TAG}"
    networks:
      - backend
    restart: unless-stopped

networks:
  backend:
    name: zabbix-net
    external: true

.env

PROXY_SQLITE3_IMAGE=zabbix/zabbix-proxy-sqlite3
JAVA_GW_IMAGE = zabbix/zabbix-java-gateway
ALPINE_IMAGE_TAG=alpine-7.2.4
ZBX_SERVER_HOST=46.101.140.98

You can also use official Zabbix-supplied docker compose files, try them out, and modify them as needed.

You can read more about the official docker compose files here.

Containerized Zabbix components allow us to use test different scenarios within the docker:

  • Creating HA Zabbix-server nodes
  • Creating multiple proxies
  • Creating multiple agents
  • Adding more Java gateways
  • Creating multiple frontends
  • Easily configure Browser monitoring
  • Configure SNMP traps
  • Easily make scheduled reports

Deploying multiple redundant Zabbix servers

To enable HA Zabbix server mode, modify both the Zabbix-server container and Zabbix-frontend container configuration environment variables.

For the HA Zabbix server mode, add 2 environment variables:

  • ZBX_HANODENAME
  • ZBX_NODEADDRESS

All of the containers are set with the user-defined network, therefore I will use the container name in the ZBX_HANODENAME option instead of the static address, as it will be resolved by docker. If you need to use a different listen port for the trapper, you need to define it using the environment variable ZBX_LISTENPORT. You can omit the port in variable ZBX_HANODENAME, as the ZBX_LISTENPORT (default is 10051) will be applied automatically.

Here is the docker run example for the Zabbix-server HA mode.

docker run --name zabbix-server-mysql-ha1 -t \
-e DB_SERVER_HOST="mysql-server" \
-e MYSQL_DATABASE="zabbix" \
-e MYSQL_USER="zabbix" \
-e MYSQL_PASSWORD="zabbix_pwd" \
-e MYSQL_ROOT_PASSWORD="root_pwd" \
-e ZBX_HANODENAME="zabbix-server-HA1" \
-e ZBX_NODEADDRESS="zabbix-server-mysql-ha1" \
--network=zabbix-net \
-p 10151:10051 \
--restart unless-stopped \
-d zabbix/zabbix-server-mysql:alpine-7.2.4
docker run --name zabbix-server-mysql-ha2 -t \
-e DB_SERVER_HOST="mysql-server" \
-e MYSQL_DATABASE="zabbix" \
-e MYSQL_USER="zabbix" \
-e MYSQL_PASSWORD="zabbix_pwd" \
-e MYSQL_ROOT_PASSWORD="root_pwd" \
-e ZBX_HANODENAME="zabbix-server-HA2" \
-e ZBX_NODEADDRESS="zabbix-server-mysql-ha2" \
--network=zabbix-net \
-p 10251:10051 \
--restart unless-stopped \
-d zabbix/zabbix-server-mysql:alpine-7.2.4

From the frontend container, remove these two environment variables:

  • ZBX_SERVER_HOST
  • ZBX_SERVER_PORT
docker run --name zabbix-web-nginx-mysql -t \
-e ZBX_SERVER_HOST="zabbix-server-mysql" \
-e ZBX_SERVER_PORT=10051
-e DB_SERVER_HOST="mysql-server" \
-e MYSQL_DATABASE="zabbix" \
-e MYSQL_USER="zabbix" \
-e MYSQL_PASSWORD="zabbix_pwd" \
-e MYSQL_ROOT_PASSWORD="root_pwd" \
--network=zabbix-net \
-p 80:8080 \
--restart unless-stopped \
-d zabbix/zabbix-web-nginx-mysql:alpine-7.2.4

Once both container configurations are modified, you should be able to see the currently added HA server nodes and their states without issues.

Fig. 1. Containers of HA Zabbix server containers

Fig. 2. Dashboard – system information

You can also execute commands on the container:

# docker exec -it container_name_or_id sh -c "zabbix_server -R ha_status"

Fig. 3. Executing command on container

Containers of HA Zabbix server containers

I’t’s possible to allocate an interactive pseudo-TTY shell, by adding option -ti and specifying shell after the container name or id.

# docker exec -ti container_name_or_id /bin/bash

Fig. 4. Executing command from within container

You can also start multiple proxies at once in docker. This can help to offload preprocessing to the proxy, gather data from the targets behind the firewall, and send collected data back to the Zabbix server, only requiring one port.

Fig. 5.Overall block diagram of Zabbix monitoring opportunities

Deploying multiple Zabbix proxies

First, you must choose the proxy mode and set the environment variable ZBX_PROXYMODE.

For active mode proxy, please define the server host address for a single server or addresses separated by a semicolon in the case of HA Zabbix server configuration (example shown below).

docker run --name zabbix-proxy-active-01 \
-e ZBX_HOSTNAME="Zabbix-proxy-active-01" \
-e ZBX_SERVER_HOST="zabbix-server-mysql-ha1;zabbix-server-mysql-ha2;zabbix-server-mysql-ha3" \
-e ZBX_PROXYMODE="0" \
--network=zabbix-net \
-e ZBX_LISTENPORT=10101  \
-p 10101:10101 \
-v /var/lib/zabbix/db_data:/var/lib/zabbix/db_data \
--restart unless-stopped \
--init -d zabbix/zabbix-proxy-sqlite3:alpine-7.2.4

For passive mode proxy, define the server host address for a single server or addresses separated by a comma in the case of HA Zabbix server configuration (example shown below).

docker run --name zabbix-proxy-passive-01 \
-e ZBX_HOSTNAME="Zabbix-proxy-passive-01" \
-e ZBX_SERVER_HOST="zabbix-server-mysql-ha1,zabbix-server-mysql-ha2,zabbix-server-mysql-ha3" \
-e ZBX_PROXYMODE="1" \
--network=zabbix-net \
-e ZBX_LISTENPORT=10102 \
-p 10102:10102 \
-v /var/lib/zabbix/db_data:/var/lib/zabbix/db_data \
--restart unless-stopped \
--init -d zabbix/zabbix-proxy-sqlite3:alpine-7.2.4

docker_compose_v3_proxies.yaml

services:
  zabbix-proxy-active-01:
    image: "${PROXY_SQLITE3_IMAGE}:${ALPINE_IMAGE_TAG}"
    environment:
      ZBX_HOSTNAME: zabbix-proxy-active-01
      ZBX_SERVER_HOST: zabbix-server-mysql-ha1;zabbix-server-mysql-ha2;zabbix-server-mysql-ha3
      ZBX_PROXYMODE: 0
      ZBX_LISTENPORT: 10101
    volumes:
      - /var/lib/zabbix/db_data:/var/lib/zabbix/db_data:rw
    networks:
      - backend
    ports:
      - 10101:10101
    restart: unless-stopped

  zabbix-proxy-passive-01:
    image: "${PROXY_SQLITE3_IMAGE}:${ALPINE_IMAGE_TAG}"
    environment:
      ZBX_HOSTNAME: zabbix-proxy-passive-01
      ZBX_SERVER_HOST: zabbix-server-mysql-ha1,zabbix-server-mysql-ha2,zabbix-server-mysql-ha3
      ZBX_PROXYMODE: 1
      ZBX_LISTENPORT: 10102
    volumes:
      - /var/lib/zabbix/db_data:/var/lib/zabbix/db_data:rw
    networks:
      - backend
    ports:
      - 10102:10102
    restart: unless-stopped
networks:
  backend:
    name: zabbix-net
    external: true

.env

PROXY_SQLITE3_IMAGE=zabbix/zabbix-proxy-sqlite3
JAVA_GW_IMAGE = zabbix/zabbix-java-gateway
ALPINE_IMAGE_TAG=alpine-7.2.4
ZBX_SERVER_HOST=46.101.140.98

The proxy name in the frontend must be the same as the value set in proxy environment variable ZBX_HOSTNAME! Also, in frontend for active proxies, you don’t need to add the proxy address.

Next, you can set hosts to be monitored by Zabbix-proxies, but make sure to update the agent configuration, so agents accept connections from proxy.

Fig. 6. Hosts monitored by proxy

Fig. 7.List of proxies and hosts monitored by them

Configuring Proxy groups

You can create as many proxy containers as necessary in Docker, and you can also create proxy groups for load balancing (it is based on the number of hosts per proxy).

First, create a proxy group in the frontend:

  • Set proxy group name
  • Select failover period
  • Minimum number of proxies

Fig. 8.Creating a new proxy group

Next, add proxies to the proxy group, and specify the address for active agents and port for the active agents.

Fig. 9. Adding proxy to proxy group

Do not forget to change Zabbix agent configuration for hosts now monitored through the proxy group (add proxy groups IPs/DNS to Server and ServerActive options).

Fig. 10. Creating a new host and monitoring it through proxy group

You can see additional information regarding the proxies in the Frontend section: Administration/ Proxies.

Fig. 11. List of all configured proxies and those belonging to proxy group

Adding more Java gateways

Zabbix server or proxy can communicate with only one Zabbix java gateway, however, you are not limited tin how many Zabbix proxies you create together with Zabbix Java Gateway. You can make an unlimited number of pairs, consisting of Zabbix proxy with Zabbix Java Gateway.

For the containerized Zabbix server, you will need to add these 4 environment variables:

ZBX_JAVAGATEWAY_ENABLE=true
ZBX_JAVAGATEWAY=zabbix-java-gateway-server
ZBX_JAVAGATEWAYPORT=10052
ZBX_STARTJAVAPOLLERS=5

And start the Java gateway for the zabbix-server in docker:

docker run --name zabbix-java-gateway-server -t \
--network=zabbix-net \
--restart unless-stopped \
-d zabbix/zabbix-java-gateway:alpine-7.2.4

Or if you want to add java gateway to the Zabbix proxy, then add these 4 environment variables to Zabbix proxy in docker:

ZBX_JAVAGATEWAY_ENABLE=true
ZBX_JAVAGATEWAY=zabbix-java-gateway-proxy
ZBX_JAVAGATEWAYPORT=10052
ZBX_STARTJAVAPOLLERS=5

And start the java gateway as a container:

docker run --name zabbix-java-gateway-proxy -t \
--network=zabbix-net \
--restart unless-stopped \
-d zabbix/zabbix-java-gateway:alpine-7.2.4

And here we have a host, monitored by zabbix-agent2 through zabbix-proxy-active-02

Fig. 12. Host monitored by proxy with configured Java gateway

Upgrading docker proxies with SQLite3 database

If you have older Zabbix components already running in docker and you have upgraded the server, you will also need to upgrade the proxies.

If you have a container created from the proxy zabbix-proxy-sqlite3 image and want to upgrade it, you will lose the existing data stored in the SQLite3 database. For most users, the database functions as a buffer to temporarily keep the data until it’s sent to Zabbix server and the loss of the proxy database file data is of no consequence.

Once you have updated the image for the container, the proxy will detect the existing old database version on startup. If the directory is mounted to database file, it will delete the database file and create a new one. This will impact those who keep data after sending it to Zabbix server and use the data from the proxy database for other purposes.

Fig. 13. Database upgrade for proxy container with SQLite3 database

Upgrading docker proxies with MySQL database

To upgrade the MySQL database for proxy, log in in the MySQL database, set the log_bin_trust_function_creators flag to 1. Change the proxy image version to a newer one and start the container.

mysql> set global log_bin_trust_function_creators = 1;

If you have not set the flag, you will receive an error of database upgrade.

Fig. 14. Failed database upgrade for proxy with MySQL database

Replace the previous version of the proxy image with the new one, check the log file, and check the docker logs to see when the database schema upgrade has finished. After the upgrade, set the flag back to 0.

mysql> set global log_bin_trust_function_creators = 0;

The upgrade has been successful, and the proxy service has started after that.

Fig. 15. Successful database upgrade for proxy with MySQL database

An official docker image for the proxy with Postgresql database support is not available due to the extensive number of existing images and different versions.

Deploying multiple frontends

You can launch as many frontends as you need if you are experiencing a sudden surge in Zabbix users. Just specify which port to assign for it and you are good to go (don’t forget to also open the port in the firewall).

docker run --name zabbix-web-nginx-mysql1 -t \
-e DB_SERVER_HOST="mysql-server" \
-e MYSQL_DATABASE="zabbix" \
-e MYSQL_USER="zabbix" \
-e MYSQL_PASSWORD="zabbix_pwd" \
-e MYSQL_ROOT_PASSWORD="root_pwd" \
--network=zabbix-net \
-p 80:8080 \
--restart unless-stopped \
-d zabbix/zabbix-web-nginx-mysql:alpine-7.2.4

Fig. 16. One started Zabbix frontend container in docker

docker run --name zabbix-web-nginx-mysql2 -t \
-e DB_SERVER_HOST="mysql-server" \
-e MYSQL_DATABASE="zabbix" \
-e MYSQL_USER="zabbix" \
-e MYSQL_PASSWORD="zabbix_pwd" \
-e MYSQL_ROOT_PASSWORD="root_pwd" \
--network=zabbix-net \
-p 81:8080 \
--restart unless-stopped \
-d zabbix/zabbix-web-nginx-mysql:alpine-7.2.4

Fig. 17. Two started Zabbix frontend containers in docker

docker run --name zabbix-web-nginx-mysql3 -t \
-e DB_SERVER_HOST="mysql-server" \
-e MYSQL_DATABASE="zabbix" \
-e MYSQL_USER="zabbix" \
-e MYSQL_PASSWORD="zabbix_pwd" \
-e MYSQL_ROOT_PASSWORD="root_pwd" \
--network=zabbix-net \
-p 82:8080 \
--restart unless-stopped \
-d zabbix/zabbix-web-nginx-mysql:alpine-7.2.4

Fig. 18. Three started Zabbix frontend containers in docker

Fig. 19. Multiple frontends accessed through different ports

Browser monitoring

Browser monitoring setup has never been easier! Just add two parameters to zabbix-server container config:

ZBX_WEBDRIVERURL=selenium:4444
ZBX_STARTBROWSERPOLLERS=2

And start the web driver in the docker (with a standalone chrome browser):

docker run --name selenium -t\
--network=zabbix-net \
--restart unless-stopped \
-p 4444:4444 \
--shm-size="1g" \
-d selenium/standalone-chrome:latest

Next step: create a new host, add the template, specify which page to monitor with Macro values, and it’s DONE!!!!

Fig. 20. Creating host for monitoring website

Fig. 21. Screenshot of the monitored website

SNMP traps

For the snmptraps to work, the same directory must be shared among the zabbix-server and zabbix-snmptrap container. On the Zabbix-server side, you need to explicitly set snmp environment variable ZBX_ENABLE_SNMP_TRAPS to true and mount directory /var/lib/zabbix/snmptraps.

You also need to add the same volume to the snmptrap container.

And run the snmptraps container (make sure there is no permission issue for the directory)

docker run --name zabbix-snmptraps -t \
-v /var/lib/zabbix/snmptraps:/var/lib/zabbix/snmptraps:rw \
--network=zabbix-net \
-p 162:1162/udp \
--restart unless-stopped \
-d zabbix/zabbix-snmptraps:alpine-7.2-latest

Fig. 22. Received SNMP trap message

Scheduled reports

You can also easily configure scheduled reports by adding 2 additional environment variables to the Zabbix-server. In my case, both of these containers are in the same custom user network, therefore I will use the container name zabbix-web-service in the ZBX_WEBSERVICEURL option.

ZBX_STARTREPORTWRITERS=5
ZBX_WEBSERVICEURL=http://zabbix-web-service:10053/report

Start the Zabbix-web service, specify also these 2 parameters (you can skip those if defaults are used). You can also allow any incoming connections by setting ZBX_ALLOWEDIP=0.0.0.0/0. We discourage this, however.

ZBX_ALLOWEDIP=zabbix-server-mysql
ZBX_LISTENPORT=10053

Before testing scheduled reports, make sure you have enabled and configured the email media type.

Fig. 23. Configured and enabled media type

It is also encouraged to test it and check that you have received the test email.

Fig. 24. Successful media type test response

Fig. 25. Received test response on the selected media type.

Next, configure the user media where the scheduled report will be sent.

Fig. 26.Media type defined for the user

Last, but not least, set the frontend URL in the section Administration/General/Other section. In my case, I set the container name of the frontend and specify the port.

for Apache: http://<server_ip_or_name>/zabbix
for Nginx: http://<server_ip_or_name>

Fig. 27. Configured frontend address for the Frontend URL option

Next, create a scheduled report based on the dashboard of your choice.

Fig. 28.Configuring scheduled report

Check that you have received the test report in your mail.

Fig. 29.Successful scheduled report test.

Fig. 30. Received scheduled report test in the email

Now you know how to set up scheduled reports!

Docker container monitoring

You can also monitor Docker containers with a containerized Zabbix instance*

* Disclaimer: If docker service is not running, Zabbix monitoring will also not function and you will not receive notifications and alerts.

You can also monitor your docker instance with the Zabbix agent 2, however, you will be required to install Zabbix-agent 2 on the host either as a package or build it from the source.

You will also need to give user zabbix access to the docker.sock file. Just add user zabbix to group docker:

# usermod -aG docker zabbix

Otherwise, you will get an error message in items:

Cannot fetch data: Get "http://1.28/info": dial unix /var/run/docker.sock: connect: permission denied.

Go back to the frontend and create a Host for monitoring the docker containers:

  • Link template: Docker by Zabbix agent 2
  • Add host to host group
  • Specify host address or dns name, set the correct connect to option, and specify the agent port (if a default port is used, then set 10050).

Fig. 31. Configuring the host for monitoring the docker container

Now, if some issue happens to other containers, Zabbix will monitor them. But to be notified of an issue, don’t forget to enable and configure the media, user media, media templates, and trigger actions, so that you receive alerts.

Fig. 32.Latest data for the docker host

Thank you for reading – I hope you’ve found this article helpful and informative!

 

The post Deploying Zabbix Components with Docker and Docker Compose appeared first on Zabbix Blog.

Accelerate Serverless Streamlit App Deployment with Terraform

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/accelerate-serverless-streamlit-app-deployment-with-terraform/

Image depicting the HashiCorp Terraform and Amazon Web Services (AWS) logos. Underneath the AWS logo are AWS service logos for Amazon Elastic Container Service (ECS), AWS CodePipeline, AWS CodeBuild, and Amazon CloudFront

Graphic created by Kevon Mayers.

Introduction

As customers increasingly seek to harness the power of generative AI (GenAI) and machine learning to deliver cutting-edge applications, the need for a flexible, intuitive, and scalable development platform has never been greater. In this landscape, Streamlit has emerged as a standout tool, making it easy for developers to prototype, build, and deploy GenAI-powered apps with minimal friction. It is an open-source Python framework designed to simplify the development of custom web applications for data science, machine learning, and GenAI projects. With Streamlit, developers can quickly transform Python scripts into interactive dashboards, LLM-powered chatbots, and web apps, using just a few lines of code. Its unique combination of simplicity, interactivity, and speed is the perfect complement to the rapid advancements in AI.

When deploying Streamlit applications, customers often face the challenge of ensuring their applications are highly available and can scale to meet a variable amount of demand. To achieve these goals, customers are looking at serverless approaches to deploying their Streamlit apps. With a serverless application, you only pay for the resources required and do not want have to worry about managing servers or capacity planning.

In this post, we will walk you through deploying containerized, serverless Streamlit applications automatically via HashiCorp Terraform, an Infrastructure as Code (IaC) tool that enables users to define and provision infrastructure across cloud platforms.

Solution Overview

For this solution, we have the Streamlit app running on an Amazon Elastic Container Service (ECS) cluster across multiple availability zones (AZs), using AWS Fargate to manage the compute. Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building apps without managing servers. Using Fargate helps reduce the undifferentiated heavy lifting that can come with building and maintaining web applications. It is also often desirable to use a Content Delivery Network (CDN) to ensure low latency for users globally by caching the content at edge locations closer to where the users are geographically located.

Let’s zoom in on the two architectures – the Streamlit App hosting architecture, and the Streamlit App deployment pipeline.

Streamlit app hosting

Image depicting the AWS data flow architecture for the solution. The architecture shows an Amazon Elastic Container Service (ECS) cluster that spans across two availability zones. Within each availability zone are a public and private subnet. A NAT gateway is within the public subnet, and an ECS Cluster with AWS Fargate deployment type is in the private subnet. An Internet Gateway (IGW) is used to allow traffic to flow through the NAT Gateway out to the internet.An Application Load Balancer (ALB) is used to distribute the load to the ECS cluster. Amazon CloudFront is used as the content delivery network (CDN).

In the above architecture, the following flow applies:

  1. Users access the Streamlit App using the public DNS endpoint for an Amazon CloudFront distribution.
  2. Using an Internet Gateway (IGW), user requests are routed to a public-facing Application Load Balancer (ALB).
  3. This ALB has target groups which map to ECS task nodes that are part of an ECS cluster running in two AZs (us-east-1a and us-east-1b in this example).
  4. Fargate will automatically scale the underlying compute nodes in the ECS cluster based on the demand.

Streamlit app deployment pipeline

Image depicting the Streamlit app deployment pipeline architecture. Within it, a developer uploads a .zip file called streamlit-app-assets.zip to an Amazon S3 Bucket. This upload event is processed by Amazon EventBridge, which in turn invokes an AWS CodePipeline to run. Related artifacts are stored in a connected CodePipeline S3 bucket. CodePipeline orchestrates an AWS CodeBuild project that creates a new Docker image using the .zip file that was uploaded, and stores in an Amazon Elastic Container Registry (ECR) repository. This image upload triggers a new Amazon Elastic Container Service (ECS) deployment. Terraform then creates a Amazon CloudFront invalidation to serve the new version of the application to customers.

In the above architecture, the following flow applies:

  1. User develops a local Streamlit App and defines the path of these assets in the module configuration, then runs terraform apply to generate a local .zip file comprised of the Streamlit App directory, and upload this to an Amazon S3 bucket (Streamlit Assets) with versioning enabled, which is configured to trigger the Streamlit CI/CD pipeline to run.
  2. AWS CodePipeline (Streamlit CI/CD pipeline) begins running. The pipeline copies the .zip file from the Streamlit Assets S3 Bucket, stores the contents in a connected CodePipeline Artifacts S3 bucket, and passes the asset to the AWS CodeBuild project that is also part of the pipeline.
  3. CodeBuild (Streamlit CodeBuild Project) configures a compute/build environment and fetches a Python Docker Image from a public Amazon ECR repository. CodeBuild uses Docker to build a new Streamlit App image based on what is defined in the Dockerfile within the .zip file, and pushes the new image to a private ECR repository. It tags the image with latest, an app_version (user-defined in Terraform), as well as the S3 Version ID of the .zip file and pushes the image to ECR.
  4. ECS has a task definition that references the image in ECR based on the S3 Version ID tag which will always be a unique value, as it is generated whenever a new version of the file is created. This also serves as data lineage so versions of the Streamlit App .zip files in S3 can be linked to versions of the image stored in ECR. Once a new image is pushed to ECR (with a unique image tag), the task definition is updated and the ECS service begins a new deployment using the new version of the Streamlit App.
  5. When a new image is pushed to ECR, the Terraform Module is configured to use the local-exec provisioner to run an AWS CLI command that creates a CloudFront invalidation. This enables users of the Streamlit app to use the new version without waiting for the time-to-live (TTL) of the cached file to expire on the edge locations (default is 24 hours).
    Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Prerequisites

This solution requires the following prerequisites:

  • An AWS account. If you don’t have an account, you can sign up for one.
  • Terraform v1.0.0 or newer installed.
  • python v3.8 or newer installed.
  • A Streamlit app. If you don’t have a Streamlit project already, you can download this app directory as a sample Streamlit app for this post and save it to a local folder.

Your folder structure will look something like this:

terraform_streamlit_folder
├── README.md
└── app                 # Streamlit app directory
    ├── home.py         # Streamlit app entry point
    ├── Dockerfile      # Dockerfile
     └── pages/          # Streamlit pages

Create and initialize a Terraform project

In the same folder where you have the your Streamlit app saved, in the above example in the terraform_streamlit_folder, you will create and initialize a new Terraform project.

  1.  In your preferred terminal, create a new file named main.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
    touch main.tf
  2. Open up the main.tf file and add the following code to it:
    module "serverless-streamlit-app" {
      source          = "aws-ia/serverless-streamlit-app/aws"
      app_name        = "streamlit-app"
      app_version     = "v1.1.0" 
      path_to_app_dir = "./app" # Replace with path to your app
    }

    This code utilizes a module block with a source pointing to the Terraform module, and the appropriate input variables passed in. When Terraform encounters a module block, it loads and processes that module’s configuration files using the source. The Serverless Streamlit App Terraform module has many optional input variables. If you have existing resources, such as an existing VPC, subnets, and security groups that you’d like to reuse instead of deploying new ones, you can use the module’s input variables to reference your existing resources. However, in this post, we’re deploying all of the resources in the above architecture from scratch. Here, we simply define the source that references the module hosted in the Terraform Registry, provide an app_name that will be used as a prefix for naming your resources, the app_version that is used for tracking changes to your app, and the path_to_app_dir which is the path to the local directory where the assets for your Streamlit app are stored.

  3. Save the file.
  4. To initialize the Terraform working directory, run the following command in your terminal:
    terraform init

    The output will contain a successful message like the following:

    "Terraform has been successfully initialized"

Output the CloudFront URL

To be able to easily access the Cloudfront URL of the deployed Streamlit application, you can add the URL as a Terraform output.

  1. In your terminal, create a new file named outputs.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
    touch outputs.tf
  2. Open up the outputs.tf file and add the following code to it:
    output "streamlit_cloudfront_distribution_url" {
      value = module.serverless-streamlit-app.streamlit_cloudfront_distribution_url
    }
  3. Save the file.
    Now, your folder structure will look like:

    terraform_streamlit_folder
    ├── README.md
    ├── app                 # Streamlit app directory
    │   ├── home.py         # Streamlit app entry point
    │   ├── Dockerfile      # Dockerfile
    │   └── pages/          # Streamlit pages
    │     
    ├── main.tf             # Terraform Code (where you call the module) 
    └── outputs.tf          # Outputs definition

Deploy the solution

Now you can use Terraform to deploy the resources defined in your main.tf file.

  1. In your terminal, run the following command to apply to deploy the infrastructure. This includes the hosting for your Streamlit application using ECS and CloudFront, as well as the pipeline that is used to push updates.
    terraform apply

    When the apply command finishes running, you’ll see the Terraform outputs displayed in the terminal.

  2. Navigate to the streamlit_cloudfront_distribution_url to see your Streamlit application that is hosted on AWS.
  3. When you make changes to your Streamlit codebase, you can go ahead and re-run terraform apply to push your new changes to your cloud environment.

When updating the Streamlit codebase, the CodePipeline and CodeBuild processes kick off to automatically update your new changes, which get reflected on your Streamlit application. CodePipeline automates the entire software release process, managing stages like source retrieval, building, testing, and deployment. It integrates with AWS services and third-party tools (such as GitHub and Jenkins) to enhance automation, speed, and security. CodeBuild focuses on automating code compilation, testing, and packaging, supporting multiple languages and custom Docker environments, while integrating with CodePipeline for scalable, secure builds. With this CI/CD pipeline, when you make changes to your code, all you need to run is terraform apply to update your cloud environment. For an example buildspec, see the example in the repo.

You can find full examples of deploying the infrastructure with and without existing resources in the GitHub repository.

Clean up

When you no longer need the resources deployed in this post, you can clean up the resources by using the Terraform destroy command. Simply run terraform destroy . This will remove all of the resources you have deployed in this post with Terraform.

Conclusion

Building serverless Streamlit applications with Terraform on AWS offers a powerful combination of scalability, efficiency, and automation. As you continue to build and refine your Streamlit applications, Terraform’s flexibility ensures that your infrastructure can evolve seamlessly, supporting rapid innovation and agile development. With Streamlit and Terraform, you have the tools to create dynamic, serverless applications that scale effortlessly and operate reliably in the cloud.

Authors

Image depicting Kevon Mayers, a Solutions Architect at AWS

Kevon Mayers

Kevon Mayers is a Solutions Architect at AWS. Kevon is a Terraform Contributor and has led multiple Terraform initiatives within AWS. Prior to joining AWS he was working as a DevOps Engineer and Developer, and before that was working with the GRAMMYs/The Recording Academy as a Studio Manager, Music Producer, and Audio Engineer. He also owns a professional production company, MM Productions.

Image depicting Alexa Perlov, a Prototyping Architect at AWS

Alexa Perlov

Alexa Perlov is a Prototyping Architect with the Prototyping Acceleration team at AWS. She helps customers build with emerging technologies by open sourcing repeatable projects. She is currently based out of Pittsburgh, PA.

Image depicting Shravani Malipeddi, a Solutions Architect at AWS

Shravani Malipeddi

Shravani Malipeddi is a Solutions Architect at AWS who came out of the TechU Program. She currently supports strategic accounts and is based out of San Francisco, CA. .

Evolution of Catwalk: Model serving platform at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/catwalk-evolution

Introduction

As Southeast Asia’s leading super app, Grab serves millions of users across multiple countries every day. Our services range from ride-hailing and food delivery to digital payments and much more. The backbone of our operations? Machine Learning (ML) models. They power our real-time decision-making capabilities, enabling us to provide a seamless and personalised experience to our users. Whether it’s determining the most efficient route for a ride, suggesting a food outlet based on a user’s preference, or detecting fraudulent transactions, ML models are at the forefront.

However, serving these ML models at Grab’s scale is no small feat. It requires a robust, efficient, and scalable model serving platform, which is where our ML model serving platform, Catwalk, comes in.

Catwalk has evolved over time, adapting to the growing needs of our business and the ever-changing tech landscape. It has been a journey of continuous learning and improvement, with each step bringing new challenges and opportunities.

Evolution of the platform

Phase 0: The need for a model serving platform

Before Catwalk’s debut as our dedicated model serving platform, data scientists across the company employed various ad-hoc approaches to serve ML models. These included:

  • Shipping models online using custom solutions.
  • Relying on backend engineering teams to deploy and manage trained ML models.
  • Embedding ML logic within Go backend services.

These methods, however, led to several challenges, undercovering the need for a unified, company-wide platform for serving machine learning models:

  • Operational overhead: Data scientists often lacked the necessary expertise to handle the operational aspects of their models, leading to service outages.
  • Resource wastage: There was frequently low resource utilisation (e.g., 1%) for data science services, leading to inefficient use of resources.
  • Friction with engineering teams: Differences in release cycles and unclear ownership when code was embedded into backend systems resulted in tension between data scientists and engineers.
  • Reinventing the wheel: Multiple teams independently attempted to solve model serving problems, leading to a duplication of effort.

​​These challenges highlighted the need for a company-wide, centralised platform for serving machine learning models.

Phase 1: No-code, managed platform for TensorFlow Serving models

Our initial foray into model serving was centred around creating a managed platform for deploying TensorFlow Serving models. The process involved data scientists submitting their models to the platform’s engineering admin, who could then deploy the model with an endpoint. Infrastructure and networking were managed using Amazon Elastic Kubernetes Service (EKS) and Helm Charts as illustrated below.


This phase of our platform, which we also detailed in our previous article, was beneficial for some users. However, we quickly encountered scalability challenges:

  • Codebase maintenance: Applying changes to every TensorFlow Serving (TFS) version was cumbersome and difficult to maintain.
  • Limited scalability: The fully managed nature of the platform made it difficult to scale.
  • Admin bottleneck: The engineering admin’s limited bandwidth became a bottleneck for onboarding new models.
  • Limited serving types: The platform only supported TensorFlow, limiting its usefulness for data scientists using other frameworks like LightGBM, XGBoost, or PyTorch.

After a year of operation, only eight models were onboarded to the platform, highlighting the need for a more scalable and flexible solution.

Phase 2: From models to model serving applications

To address the limitations of Phase 1, we transitioned from deploying individual models to self-contained model serving applications. This “low-code, self-serving” strategy introduced several new components and changes as illustrated in the points and diagram below:

  • Support for multiple serving types: Users gained the ability to deploy models trained with a variety of frameworks like Open Neural Network Exchange (ONNX), PyTorch, and TensorFlow.
  • Self-served platform through CI/CD pipelines: Data scientists could self-serve and independently manage their model serving applications through CI/CD pipelines.
  • New components: We introduced these new components to support the self-serving approach:
    • Catwalk proxy, a managed reverse proxy to various serving types.
    • Catwalk transformer, a low-code component to transform input and output data.
    • Amphawa, a feature fetching component to augment model inputs.

API request flow

The Catwalk proxy acts as the orchestration layer. Clients send requests to Catwalk proxy then it orchestrates calls to different components like transformers, feature-store, and so on. A typical end to end request flow is illustrated below.


Within a year of implementing these changes, the number of models on the platform increased from 8 to 300, demonstrating the success of this approach. However, new challenges emerged:

  • Complexity of maintaining Helm chart: As the platform continued to grow with new components and functionalities, maintaining the Helm chart became increasingly complex. The readability and flow control became more challenging, making the helm chart updating process prone to errors.
  • Process-level mistakes: The self-serving approach led to errors such as pushing empty or incompatible models to production, setting too few replicas, or allocating insufficient resources, which resulted in service crashes.

We knew that our work was nowhere near done. We had to keep iterating and explore ways to address the new challenges.

Phase 3: Replacing Helm charts with Kubernetes CRDs

To tackle the deployment challenges and gain more control, we made the significant decision to replace Helm charts with Kubernetes Custom Resource Definitions (CRDs). This required substantial engineering effort, but the outcomes have been rewarding. This transition gave us improved control over deployment pipelines, enabling customisations such as:

  • Smart defaults for AutoML
  • Blue-green deployments
  • Capacity management
  • Advanced scaling
  • Application set groupings

Below is an example of a simple model serving CRD manifest:

apiVersion: ml.catwalk.kubebuilder.io/v1
kind: ModelServing
spec:
  hpa:
    desired: 1
    max: 1
    min: 1
  modelMeta:
    modelName: "my-model"
    modelOwner: john.doe
  proxyLayer:
    enableLogging: true
    logHTTPBody: true
  servingLayer:
    servingType: "tensorflow-serving"
    version: "20"

Model serving CRD deployment state machine

Every model serving CRD submission follows a sequence of steps. If there are failures at any step, the controller keeps retrying after small intervals. The major steps on the deployment cycle are described below:

  1. Validate whether the new CRD specs are acceptable. Along with sanity checks, we also enforce a lot of platform constraints through this step.
  2. Clean up previous non-ready deployment resources. Sometimes a deployment submission might keep crashing and hence it doesn’t proceed to a ready state. On every submission, it’s important to check and clean up such previous deployments.
  3. Create resources for the new deployment and ensure that the new deployment is ready.
  4. Switch traffic from old deployment to the new deployment.
  5. Clean up resources for old deployment. At this point, traffic is already being served by the new deployment resources. So, we can clean up the old deployment.

Phase 4: Transition to a high-code, self-served, process-managed platform

As the number of model serving applications and use cases multiplied, clients sought greater control over orchestrations between different models, experiment executions, traffic shadowing, and responses archiving. To cater to these needs, we introduced several changes and components with the Catwalk Orchestrator, a high code orchestration solution, leading the pack.

Catwalk orchestrator

The Catwalk Orchestrator is a highly abstracted framework for building ML applications that replaced the catwalk-proxy from previous phases. The key difference is that users can now write their own business/orchestration logic. The orchestrator offers a range of utilities, reducing the need for users to write extensive boilerplate code. Key components of the Catwalk Orchestrator include HTTP server, gRPC server, clients for different model serving flavours (TensorFlow, ONNX, PyTorch, etc), client for fetching features from the feature bank, and utilities for logging, metrics, and data lake ingestion.

The Catwalk Orchestrator is designed to streamline the deployment of machine learning models. Here’s a typical user journey:

  1. Scaffold a model serving application: Users begin by scaffolding a model serving application using a command-line tool.
  2. Write business logic: Users then write the business logic for the application.
  3. Deploy to staging: The application is then deployed to a staging environment for testing.
  4. Complete load testing: Users test the application in the staging environment and complete load testing to ensure it can handle the expected traffic.
  5. Deploy to production: Once testing is completed, the application is deployed to the production environment.

Bundled deployments

To support multiple ML models as part of a single model serving application, we introduced the concept of bundled deployments. Multiple Kubernetes deployments are bundled together as a single model serving application deployment, allowing each component (e.g., models, catwalk-orchestrator, etc) to have its own Kubernetes deployment and to scale independently.


In addition to the major developments, we implemented other changes to enhance our platform’s efficiency. We made load testing mandatory for all ML application updates to ensure robust performance. This testing process was streamlined with a single command that runs the load test in the staging environment, with the results directly shared with the user.

Furthermore, we boosted deployment transparency by sharing deployment details through Slack and Datadog. This empowered users to diagnose issues independently, reducing the dependency on on-call support. This transparency not only improved our issue resolution times but also enhanced user confidence in our platform.

The results of these changes speak for themselves. The Catwalk Orchestrator has evolved into our flagship product. In just two years, we have deployed 200 Catwalk Orchestrators serving approximately 1,400 ML models.

What’s next?

As we continue to innovate and enhance our model serving platform, we are venturing into new territories:

  • Catwalk serverless: We aim to further abstract the model serving experience, making it even more user-friendly and efficient.
  • Catwalk data serving: We are looking to extend Catwalk’s capabilities to serve data online, providing a more comprehensive service.
  • LLM serving: In line with the trend towards generative AI and large language models (LLMs), we’re pivoting Catwalk to support these developments, ensuring we stay at the forefront of the AI and machine learning field.

Stay tuned as we continue to advance our technology and bring these exciting developments to life.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 700 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Debugging a FUSE deadlock in the Linux kernel

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/debugging-a-fuse-deadlock-in-the-linux-kernel-c75cd7989b6d

Tycho Andersen

The Compute team at Netflix is charged with managing all AWS and containerized workloads at Netflix, including autoscaling, deployment of containers, issue remediation, etc. As part of this team, I work on fixing strange things that users report.

This particular issue involved a custom internal FUSE filesystem: ndrive. It had been festering for some time, but needed someone to sit down and look at it in anger. This blog post describes how I poked at /procto get a sense of what was going on, before posting the issue to the kernel mailing list and getting schooled on how the kernel’s wait code actually works!

Symptom: Stuck Docker Kill & A Zombie Process

We had a stuck docker API call:

goroutine 146 [select, 8817 minutes]:
net/http.(*persistConn).roundTrip(0xc000658fc0, 0xc0003fc080, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/transport.go:2610 +0x765
net/http.(*Transport).roundTrip(0xc000420140, 0xc000966200, 0x30, 0x1366f20, 0x162)
/usr/local/go/src/net/http/transport.go:592 +0xacb
net/http.(*Transport).RoundTrip(0xc000420140, 0xc000966200, 0xc000420140, 0x0, 0x0)
/usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc000966200, 0x161eba0, 0xc000420140, 0x0, 0x0, 0x0, 0xc00000e050, 0x3, 0x1, 0x0)
/usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0, 0xc00000e050, 0x0, 0x1, 0x10000168e)
/usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
/usr/local/go/src/net/http/client.go:585
golang.org/x/net/context/ctxhttp.Do(0x163bd48, 0xc000044090, 0xc000438480, 0xc000966100, 0x0, 0x0, 0x0)
/go/pkg/mod/golang.org/x/[email protected]/context/ctxhttp/ctxhttp.go:27 +0x10f
github.com/docker/docker/client.(*Client).doRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc000966100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/[email protected]/client/request.go:132 +0xbe
github.com/docker/docker/client.(*Client).sendRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0x13d8643, 0x3, 0xc00079a720, 0x51, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/[email protected]/client/request.go:122 +0x156
github.com/docker/docker/client.(*Client).get(...)
/go/pkg/mod/github.com/moby/[email protected]/client/request.go:37
github.com/docker/docker/client.(*Client).ContainerInspect(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc0006a01c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/[email protected]/client/container_inspect.go:18 +0x128
github.com/Netflix/titus-executor/executor/runtime/docker.(*DockerRuntime).Kill(0xc000215180, 0x163bdb8, 0xc000938600, 0x1, 0x0, 0x0)
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runtime/docker/docker.go:2835 +0x310
github.com/Netflix/titus-executor/executor/runner.(*Runner).doShutdown(0xc000432dc0, 0x163bd10, 0xc000938390, 0x1, 0xc000b821e0, 0x1d, 0xc0005e4710)
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:326 +0x4f4
github.com/Netflix/titus-executor/executor/runner.(*Runner).startRunner(0xc000432dc0, 0x163bdb8, 0xc00071e0c0, 0xc0a502e28c08b488, 0x24572b8, 0x1df5980)
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:122 +0x391
created by github.com/Netflix/titus-executor/executor/runner.StartTaskWithRuntime
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:81 +0x411

Here, our management engine has made an HTTP call to the Docker API’s unix socket asking it to kill a container. Our containers are configured to be killed via SIGKILL. But this is strange. kill(SIGKILL) should be relatively fatal, so what is the container doing?

$ docker exec -it 6643cd073492 bash
OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: process_linux.go:130: executing setns process caused: exit status 1: unknown

Hmm. Seems like it’s alive, but setns(2) fails. Why would that be? If we look at the process tree via ps awwfux, we see:

\_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/6643cd073492ba9166100ed30dbe389ff1caef0dc3d35
| \_ [docker-init]
| \_ [ndrive] <defunct>

Ok, so the container’s init process is still alive, but it has one zombie child. What could the container’s init process possibly be doing?

# cat /proc/1528591/stack
[<0>] do_wait+0x156/0x2f0
[<0>] kernel_wait4+0x8d/0x140
[<0>] zap_pid_ns_processes+0x104/0x180
[<0>] do_exit+0xa41/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] do_syscall_64+0x37/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

It is in the process of exiting, but it seems stuck. The only child is the ndrive process in Z (i.e. “zombie”) state, though. Zombies are processes that have successfully exited, and are waiting to be reaped by a corresponding wait() syscall from their parents. So how could the kernel be stuck waiting on a zombie?

# ls /proc/1544450/task
1544450 1544574

Ah ha, there are two threads in the thread group. One of them is a zombie, maybe the other one isn’t:

# cat /proc/1544574/stack
[<0>] request_wait_answer+0x12f/0x210
[<0>] fuse_simple_request+0x109/0x2c0
[<0>] fuse_flush+0x16f/0x1b0
[<0>] filp_close+0x27/0x70
[<0>] put_files_struct+0x6b/0xc0
[<0>] do_exit+0x360/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] get_signal+0x140/0x870
[<0>] arch_do_signal_or_restart+0xae/0x7c0
[<0>] exit_to_user_mode_prepare+0x10f/0x1c0
[<0>] syscall_exit_to_user_mode+0x26/0x40
[<0>] do_syscall_64+0x46/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Indeed it is not a zombie. It is trying to become one as hard as it can, but it’s blocking inside FUSE for some reason. To find out why, let’s look at some kernel code. If we look at zap_pid_ns_processes(), it does:

/*
* Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
* kernel_wait4() will also block until our children traced from the
* parent namespace are detached and become EXIT_DEAD.
*/
do {
clear_thread_flag(TIF_SIGPENDING);
rc = kernel_wait4(-1, NULL, __WALL, NULL);
} while (rc != -ECHILD);

which is where we are stuck, but before that, it has done:

/* Don't allow any more processes into the pid namespace */
disable_pid_allocation(pid_ns);

which is why docker can’t setns() — the namespace is a zombie. Ok, so we can’t setns(2), but why are we stuck in kernel_wait4()? To understand why, let’s look at what the other thread was doing in FUSE’s request_wait_answer():

/*
* Either request is already in userspace, or it was forced.
* Wait it out.
*/
wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));

Ok, so we’re waiting for an event (in this case, that userspace has replied to the FUSE flush request). But zap_pid_ns_processes()sent a SIGKILL! SIGKILL should be very fatal to a process. If we look at the process, we can indeed see that there’s a pending SIGKILL:

# grep Pnd /proc/1544574/status
SigPnd: 0000000000000000
ShdPnd: 0000000000000100

Viewing process status this way, you can see 0x100 (i.e. the 9th bit is set) under SigPnd, which is the signal number corresponding to SIGKILL. Pending signals are signals that have been generated by the kernel, but have not yet been delivered to userspace. Signals are only delivered at certain times, for example when entering or leaving a syscall, or when waiting on events. If the kernel is currently doing something on behalf of the task, the signal may be pending. Signals can also be blocked by a task, so that they are never delivered. Blocked signals will show up in their respective pending sets as well. However, man 7 signal says: “The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.” But here the kernel is telling us that we have a pending SIGKILL, aka that it is being ignored even while the task is waiting!

Red Herring: How do Signals Work?

Well that is weird. The wait code (i.e. include/linux/wait.h) is used everywhere in the kernel: semaphores, wait queues, completions, etc. Surely it knows to look for SIGKILLs. So what does wait_event() actually do? Digging through the macro expansions and wrappers, the meat of it is:

#define ___wait_event(wq_head, condition, state, exclusive, ret, cmd)           \
({ \
__label__ __out; \
struct wait_queue_entry __wq_entry; \
long __ret = ret; /* explicit shadow */ \
\
init_wait_entry(&__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0); \
for (;;) { \
long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);\
\
if (condition) \
break; \
\
if (___wait_is_interruptible(state) && __int) { \
__ret = __int; \
goto __out; \
} \
\
cmd; \
} \
finish_wait(&wq_head, &__wq_entry); \
__out: __ret; \
})

So it loops forever, doing prepare_to_wait_event(), checking the condition, then checking to see if we need to interrupt. Then it does cmd, which in this case is schedule(), i.e. “do something else for a while”. prepare_to_wait_event() looks like:

long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
unsigned long flags;
long ret = 0;

spin_lock_irqsave(&wq_head->lock, flags);
if (signal_pending_state(state, current)) {
/*
* Exclusive waiter must not fail if it was selected by wakeup,
* it should "consume" the condition we were waiting for.
*
* The caller will recheck the condition and return success if
* we were already woken up, we can not miss the event because
* wakeup locks/unlocks the same wq_head->lock.
*
* But we need to ensure that set-condition + wakeup after that
* can't see us, it should wake up another exclusive waiter if
* we fail.
*/
list_del_init(&wq_entry->entry);
ret = -ERESTARTSYS;
} else {
if (list_empty(&wq_entry->entry)) {
if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
__add_wait_queue_entry_tail(wq_head, wq_entry);
else
__add_wait_queue(wq_head, wq_entry);
}
set_current_state(state);
}
spin_unlock_irqrestore(&wq_head->lock, flags);

return ret;
}
EXPORT_SYMBOL(prepare_to_wait_event);

It looks like the only way we can break out of this with a non-zero exit code is if signal_pending_state() is true. Since our call site was just wait_event(), we know that state here is TASK_UNINTERRUPTIBLE; the definition of signal_pending_state() looks like:

static inline int signal_pending_state(unsigned int state, struct task_struct *p)
{
if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
return 0;
if (!signal_pending(p))
return 0;

return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
}

Our task is not interruptible, so the first if fails. Our task should have a signal pending, though, right?

static inline int signal_pending(struct task_struct *p)
{
/*
* TIF_NOTIFY_SIGNAL isn't really a signal, but it requires the same
* behavior in terms of ensuring that we break out of wait loops
* so that notify signal callbacks can be processed.
*/
if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))
return 1;
return task_sigpending(p);
}

As the comment notes, TIF_NOTIFY_SIGNAL isn’t relevant here, in spite of its name, but let’s look at task_sigpending():

static inline int task_sigpending(struct task_struct *p)
{
return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
}

Hmm. Seems like we should have that flag set, right? To figure that out, let’s look at how signal delivery works. When we’re shutting down the pid namespace in zap_pid_ns_processes(), it does:

group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);

which eventually gets to __send_signal_locked(), which has:

pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
...
sigaddset(&pending->signal, sig);
...
complete_signal(sig, t, type);

Using PIDTYPE_MAX here as the type is a little weird, but it roughly indicates “this is very privileged kernel stuff sending this signal, you should definitely deliver it”. There is a bit of unintended consequence here, though, in that __send_signal_locked() ends up sending the SIGKILL to the shared set, instead of the individual task’s set. If we look at the __fatal_signal_pending() code, we see:

static inline int __fatal_signal_pending(struct task_struct *p)
{
return unlikely(sigismember(&p->pending.signal, SIGKILL));
}

But it turns out this is a bit of a red herring (although it took a while for me to understand that).

How Signals Actually Get Delivered To a Process

To understand what’s really going on here, we need to look at complete_signal(), since it unconditionally adds a SIGKILL to the task’s pending set:

sigaddset(&t->pending.signal, SIGKILL);

but why doesn’t it work? At the top of the function we have:

/*
* Now find a thread we can wake up to take the signal off the queue.
*
* If the main thread wants the signal, it gets first crack.
* Probably the least surprising to the average bear.
*/
if (wants_signal(sig, p))
t = p;
else if ((type == PIDTYPE_PID) || thread_group_empty(p))
/*
* There is just one thread and it does not need to be woken.
* It will dequeue unblocked signals before it runs again.
*/
return;

but as Eric Biederman described, basically every thread can handle a SIGKILL at any time. Here’s wants_signal():

static inline bool wants_signal(int sig, struct task_struct *p)
{
if (sigismember(&p->blocked, sig))
return false;

if (p->flags & PF_EXITING)
return false;

if (sig == SIGKILL)
return true;

if (task_is_stopped_or_traced(p))
return false;

return task_curr(p) || !task_sigpending(p);
}

So… if a thread is already exiting (i.e. it has PF_EXITING), it doesn’t want a signal. Consider the following sequence of events:

1. a task opens a FUSE file, and doesn’t close it, then exits. During that exit, the kernel dutifully calls do_exit(), which does the following:

exit_signals(tsk); /* sets PF_EXITING */

2. do_exit() continues on to exit_files(tsk);, which flushes all files that are still open, resulting in the stack trace above.

3. the pid namespace exits, and enters zap_pid_ns_processes(), sends a SIGKILL to everyone (that it expects to be fatal), and then waits for everyone to exit.

4. this kills the FUSE daemon in the pid ns so it can never respond.

5. complete_signal() for the FUSE task that was already exiting ignores the signal, since it has PF_EXITING.

6. Deadlock. Without manually aborting the FUSE connection, things will hang forever.

Solution: don’t wait!

It doesn’t really make sense to wait for flushes in this case: the task is dying, so there’s nobody to tell the return code of flush() to. It also turns out that this bug can happen with several filesystems (anything that calls the kernel’s wait code in flush(), i.e. basically anything that talks to something outside the local kernel).

Individual filesystems will need to be patched in the meantime, for example the fix for FUSE is here, which was released on April 23 in Linux 6.3.

While this blog post addresses FUSE deadlocks, there are definitely issues in the nfs code and elsewhere, which we have not hit in production yet, but almost certainly will. You can also see it as a symptom of other filesystem bugs. Something to look out for if you have a pid namespace that won’t exit.

This is just a small taste of the variety of strange issues we encounter running containers at scale at Netflix. Our team is hiring, so please reach out if you also love red herrings and kernel deadlocks!


Debugging a FUSE deadlock in the Linux kernel was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to run AWS CloudHSM workloads in container environments

Post Syndicated from Derek Tumulak original https://aws.amazon.com/blogs/security/how-to-run-aws-cloudhsm-workloads-on-docker-containers/

January 25, 2023: We updated this post to reflect the fact that CloudHSM SDK3 does not support serverless environments and we strongly recommend deploying SDK5.


AWS CloudHSM provides hardware security modules (HSMs) in the AWS Cloud. With CloudHSM, you can generate and use your own encryption keys in the AWS Cloud, and manage your keys by using FIPS 140-2 Level 3 validated HSMs. Your HSMs are part of a CloudHSM cluster. CloudHSM automatically manages synchronization, high availability, and failover within a cluster.

CloudHSM is part of the AWS Cryptography suite of services, which also includes AWS Key Management Service (AWS KMS), AWS Secrets Manager, and AWS Private Certificate Authority (AWS Private CA). AWS KMS, Secrets Manager, and AWS Private CA are fully managed services that are convenient to use and integrate. You’ll generally use CloudHSM only if your workload requires single-tenant HSMs under your own control, or if you need cryptographic algorithms or interfaces that aren’t available in the fully managed alternatives.

CloudHSM offers several options for you to connect your application to your HSMs, including PKCS#11, Java Cryptography Extensions (JCE), OpenSSL Dynamic Engine, or Microsoft Cryptography API: Next Generation (CNG). Regardless of which library you choose, you’ll use the CloudHSM client to connect to HSMs in your cluster.

In this blog post, I’ll show you how to use Docker to develop, deploy, and run applications by using the CloudHSM SDK, and how to manage and orchestrate workloads by using tools and services like Amazon Elastic Container Service (Amazon ECS), Kubernetes, Amazon Elastic Kubernetes Service (Amazon EKS), and Jenkins.

Solution overview

This solution demonstrates how to create a Docker container that uses the CloudHSM JCE SDK to generate a key and use it to encrypt and decrypt data.

Note: In this example, you must manually enter the crypto user (CU) credentials as environment variables when you run the container. For production workloads, you’ll need to consider how to secure and automate the handling and distribution of these credentials. You should work with your security or compliance officer to ensure that you’re using an appropriate method of securing HSM login credentials. For more information on securing credentials, see AWS Secrets Manager.

Figure 1 shows the solution architecture. The Java application, running in a Docker container, integrates with JCE and communicates with CloudHSM instances in a CloudHSM cluster through HSM elastic network interfaces (ENIs). The Docker container runs in an EC2 instance, and access to the HSM ENIs is controlled with a security group.

Figure 1: Architecture diagram

Figure 1: Architecture diagram

Prerequisites

To implement this solution, you need to have working knowledge of the following items:

  • CloudHSM
  • Docker 20.10.17 – used at the time of this post
  • Java 8 or Java 11 – supported at the time of this post
  • Maven 3.05 – used at the time of this post

Here’s what you’ll need to follow along with my example:

  1. An active CloudHSM cluster with at least one active HSM instance. You can follow the CloudHSM getting started guide to create, initialize, and activate a CloudHSM cluster.

    Note: For a production cluster, you should have at least two active HSM instances spread across Availability Zones in the Region.

  2. An Amazon Linux 2 EC2 instance in the same virtual private cloud (VPC) in which you created your CloudHSM cluster. The Amazon Elastic Compute Cloud (Amazon EC2) instance must have the CloudHSM cluster security group attached—this security group is automatically created during the cluster initialization and is used to control network access to the HSMs. To learn about attaching security groups to allow EC2 instances to connect to your HSMs, see Create a cluster in the AWS CloudHSM User Guide.
  3. A CloudHSM crypto user (CU) account. You can create a CU by following the steps in the topic Managing HSM users in AWS CloudHSM in the AWS CloudHSM User Guide.

Solution details

In this section, I’ll walk you through how to download, configure, compile, and run a solution in Docker.

To set up Docker and run the application that encrypts and decrypts data with a key in AWS CloudHSM

  1. On your Amazon Linux EC2 instance, install Docker by running the following command.

    # sudo yum -y install docker

  2. Start the docker service.

    # sudo service docker start

  3. Create a new directory and move to it. In my example, I use a directory named cloudhsm_container. You’ll use the new directory to configure the Docker image.

    # mkdir cloudhsm_container
    # cd cloudhsm_container

  4. Copy the CloudHSM cluster’s trust anchor certificate (customerCA.crt) to the directory that you just created. You can find the trust anchor certificate on a working CloudHSM client instance under the path /opt/cloudhsm/etc/customerCA.crt. The certificate is created during initialization of the CloudHSM cluster and is required to connect to the CloudHSM cluster. This enables our application to validate that the certificate presented by the CloudHSM cluster was signed by our trust anchor certificate.
  5. In your new directory (cloudhsm_container), create a new file with the name run_sample.sh that includes the following contents. The script runs the Java class that is used to generate an Advanced Encryption Standard (AES) key to encrypt and decrypt your data.
    #! /bin/bash
    
    # start application
    echo -e "\n* Entering AES GCM encrypt/decrypt sample in Docker ... \n"
    
    java -ea -jar target/assembly/aesgcm-runner.jar -method environment
    
    echo -e "\n* Exiting AES GCM encrypt/decrypt sample in Docker ... \n"

  6. In the new directory, create another new file and name it Dockerfile (with no extension). This file will specify that the Docker image is built with the following components:
    • The CloudHSM client package.
    • The CloudHSM Java JCE package.
    • OpenJDK 1.8 (Java 8). This is needed to compile and run the Java classes and JAR files.
    • Maven, a build automation tool that is needed to assist with building the Java classes and JAR files.
    • The AWS CloudHSM Java JCE samples that will be downloaded and built as part of the solution.
  7. Cut and paste the following contents into Dockerfile.

    Note: You will need to customize your Dockerfile, as follows:

    • Make sure to specify the SDK version to replace the one specified in the pom.xml file in the sample code. As of the writing of this post, the most current version is 5.7.0. To find the SDK version, follow the steps in the topic Check your client SDK version. For more information, see the Building section in the README file for the Cloud HSM JCE examples.
    • Make sure to update the HSM_IP line with the IP of an HSM in your CloudHSM cluster. You can get your HSM IPs from the CloudHSM console, or by running the describe-clusters AWS CLI command.
      	# Use the amazon linux image
      	FROM amazonlinux:2
      
      	# Pass HSM IP address as a build argument
      	ARG HSM_IP
      
      	# Install CloudHSM client
      	RUN yum install -y https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-jce-latest.el7.x86_64.rpm
      
      	# Install Java, Maven, wget, unzip and ncurses-compat-libs
      	RUN yum install -y java maven wget unzip ncurses-compat-libs
              
      	# Create a work dir
      	WORKDIR /app
              
      	# Download sample code
      	RUN wget https://github.com/aws-samples/aws-cloudhsm-jce-examples/archive/refs/heads/sdk5.zip
              
      	# unzip sample code
      	RUN unzip sdk5.zip
             
      	# Change to the create directory
      	WORKDIR aws-cloudhsm-jce-examples-sdk5
      
      # Build JAR files using the installed CloudHSM JCE Provider version
      RUN export CLOUDHSM_CLIENT_VERSION=`rpm -qi cloudhsm-jce | awk -F': ' '/Version/ {print $2}'` \
              && mvn validate -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION \
              && mvn clean package -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION
              
        # Configure cloudhsm-client
        COPY customerCA.crt /opt/cloudhsm/etc/
        RUN /opt/cloudhsm/bin/configure-jce -a $HSM_IP
             
        # Copy the run_sample.sh script
        COPY run_sample.sh .
              
        # Run the script
        CMD ["bash","run_sample.sh"]

  8. Now you’re ready to build the Docker image. Run the following command, with the name jce_sample. This command will let you use the Dockerfile that you created in step 6 to create the image.

    # sudo docker build --build-arg HSM_IP=”<your HSM IP address>” -t jce_sample .

  9. To run a Docker container from the Docker image that you just created, run the following command. Make sure to replace the user and password with your actual CU username and password. (If you need help setting up your CU credentials, see prerequisite 3. For more information on how to provide CU credentials to the AWS CloudHSM Java JCE Library, see Providing credentials to the JCE provider in the CloudHSM User Guide).

    # sudo docker run --env HSM_USER=<user> --env HSM_PASSWORD=<password> jce_sample

    If successful, the output should look like this:

    	* Entering AES GCM encrypt/decrypt sample in Docker ... 
    
    	737F92D1B7346267D329C16E
    	Successful decryption
    
    	* Exiting AES GCM encrypt/decrypt sample in Docker ...

Conclusion

This solution provides an example of how to run CloudHSM client workloads in Docker containers. You can use the solution as a reference to implement your cryptographic application in a way that benefits from the high availability and load balancing built in to CloudHSM without compromising the flexibility that Docker provides for developing, deploying, and running applications.

If you have comments about this post, submit them in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Derek Tumulak

Derek Tumulak

Derek joined AWS in May 2021 as a Principal Product Manager. He is a data protection and cybersecurity expert who is enthusiastic about assisting customers with a wide range of sophisticated use cases.

Monitoring Kubernetes with Zabbix

Post Syndicated from Michaela DeForest original https://blog.zabbix.com/monitoring-kubernetes-with-zabbix/25055/

There are many options available for monitoring Kubernetes and cloud-native applications. In this multi-part blog series, we’ll explore how to use Zabbix to monitor a Kubernetes cluster and understand the metrics generated within Zabbix. We’ll also learn how to exploit Prometheus endpoints exposed by applications to monitor application-specific metrics.

Want to see Kubernetes monitoring in action? Watch the step-by-step Zabbix Kubernetes monitoring configuration and deployment guide.

Why Choose Zabbix to Monitor Kubernetes?

Before choosing Zabbix as a Kubernetes monitoring tool, we asked ourselves, “why would we choose to use Zabbix rather than Prometheus, Grafana, and alertmanager?” After all, they have become the standard monitoring tools in the cloud ecosystem. We decided that our minimum criteria for Zabbix would be that it was just as effective as Prometheus for monitoring both Kubernetes and cloud-native applications.

Through our discovery process, we concluded that Zabbix meets (and exceeds) this minimum requirement. Zabbix provides similar metrics and triggers as Prometheus, alert manager, and Grafana for Kubernetes, as they both use the same backend tools to do this. However, Zabbix can do this in one product while still maintaining flexibility and allowing you to monitor pretty much anything you can write code to collect. Regarding application monitoring, Zabbix can transform Prometheus metrics fed to it by Prometheus exporters and endpoints. In addition, because Zabbix can make calls to any HTTP endpoint, it can monitor applications that do not have a dedicated Prometheus endpoint, unlike Prometheus.

The Zabbix Helm Chart

Zabbix monitors Kubernetes by collecting metrics exposed via the Kubernetes API and kube-state-metrics. The components necessary to monitor a cluster are installed within the cluster using this helm chart provided by Zabbix. The helm chart includes the Zabbix agent installed as a daemon set and is used to monitor local resources and applications on each node. A Zabbix proxy is also installed to collect monitoring data and transfer it to the external Zabbix server.

Only the Zabbix proxy needs access to the Zabbix server, while the agents can send data to the proxy installed in the same namespace as each agent. A cluster role allows Zabbix to access resources in the cluster via the Kubernetes API. While the cluster role could be modified to restrict privileges given to Zabbix, this will result in some items becoming unsupported. We recommend keeping this the same if you want to get the most out of Kubernetes monitoring with Zabbix.

The Zabbix helm chart installs the kube-state-metrics project as a dependency. You may already be familiar with this project under the Kubernetes organization, which generates Prometheus format metrics based on the current state of the Kubernetes resources. In addition, if you have experience using Prometheus to monitor a cluster, you may already have this installed. If that is the case, you can point to this deployment rather than installing another one.

In this tutorial, we will install kube-state-metrics via the Zabbix helm chart.

For more information on skipping this step, refer to the values file in the Zabbix Kubernetes helm chart.

Installing the Zabbix Helm Chart

Now that we’ve explained how the Zabbix helm chart works, let’s go ahead and install it. In this example, we will assume that you have a running Zabbix 6.0 (or higher) instance that is reachable from the cluster you wish to monitor. I am running a 6.0 instance in a different cluster than the one we want to monitor. The server is reachable via the DNS name mdeforest.zabbix.atsgroup.io with a non-standard port of 31103.

We will start by installing the latest Zabbix helm chart. I recommend visiting zabbix.com/integrations/kubernetes to get any sources that may be referred to in this tutorial. There you will find a link to the Zabbix helm chart and templates. For the most part, we will follow the steps outlined in the readme.

 

Using a terminal window, I am going to make sure the active cluster is set to the cluster that I want to monitor:

kubectl config use-context <cluster context name>

I’m then going to add the Zabbix chart repo to my local helm repository:

helm repo add zabbix-chart-6.0 https://cdn.zabbix.com/zabbix/integrations/kubernetes-helm/6.0/

If you’re running Zabbix 6.2 or newer, change the references to 6.0 in this command to 6.2.

Depending on your circumstances, you will need to set a few values for the installation. In most cases, you only need to set a few environment variables for the Zabbix agent and the proxy. The complete list of values and environment variables is available in the helm chart repo, alongside the agent and proxy images on Docker Hub.

In this case, I’m setting the passive server environment variable for the agent to allow any IP to connect. For the proxy, I am setting the server host accessible from the proxy alongside the non-standard port. I’ve also set here some variables related to cache size. These variables may depend on your cluster size, so you may need to play around with them to find the correct values.

Now that I have the values file ready, I’m ready to install the chart. So, we’ll use the following command. Of course, the chart path might vary depending on what version of the chart you’re using.

helm install -f </path/to/values/file> [-n <namespace>] zabbix zabbix-chart-6.0/zabbix-helm-chart

You can also optionally add a namespace. You must wait until everything is running, so I’ll check just that with the following:

watch kubectl get pods

Now that everything is installed, we’re ready to set up hosts in Zabbix that will be associated with the cluster. The last step before we have all the information we need is to obtain the token created via the service account installed with the helm chart. We’ll get this by running the next command, which is the name of the service account that was created:

kubectl get secret -o jsonpath={.data.token} zabbix service-account | base64 -d

This will get the secret created for the service account and grab just the token from that, which is passed to the base64 utility to decode it. Be sure to copy that value somewhere because you’ll need it for later.

You’ll also need the Kubernetes API endpoint. In most cases, you’ll use the proxy installed rather than the server directly or a proxy outside the cluster. If this is the case, you can use the service DNS for the API. We should be able to reach it by pointing to https://kubernetes.default.svc.cluster.local:443/api.

If this is not the case, you can use the output from the command:

kubectl cluster-info

Now, let’s head over to the Zabbix UI. All the templates we need are shipped in Zabbix 6. If for some reason, you can’t find them, they are available for download and import by visiting the integrations page that I pointed out earlier on the Zabbix site.

Adding the Proxy

We will add our proxy by heading to Administration -> Proxies:

  1. Click Create Proxy. Because this is an active proxy by default, we only need to specify the proxy name. If you didn’t make any changes to the helm chart, this should default to zabbix-proxy. If you’d like to name this differently, you can change the environment variable zbx_hostname for the proxy in the helm chart. We’re going to leave it as the default for now. You’re going to enter this name and then click “Add.” After a few minutes, you’ll start to see that it says that the proxy has been seen.
  2. Create a Host Group to put hosts related to Kubernetes. For this example, let’s create one, which we’ll call Kubernetes.
  3. Head to the host page under configuration and click Create Host. The first host will collect metrics related to monitoring Kubernetes nodes, and we’ll discover nodes and create new hosts using Zabbix low-level discovery.
  4. Give this host the name Kubernetes Nodes. We’ll also assign this host to the Kubernetes host group we created and attach the template Kubernetes nodes by HTTP.
  5. Change the line “Monitored by proxy” to the proxy created earlier, called zabbix-proxy.
  6. Click the Macros tab and select “Inherited and host macros.” You should be able to see all the macros that may be set to influence what is monitored in your cluster. In this case, we need to change the first two macros. The first, {KUBE.API.ENDPOINT.URL}, should be set to the Kubernetes API endpoint. In our case, we can set it to what I mentioned earlier: default.svc.cluster.local:443/api. Next, the token should be set to the previously retrieved value from the command line.
  7. lick Add. After a few minutes, you should start seeing data on the latest data page and new hosts on the host page representing each node.

Creating an Additional Host

Now let’s create another host that will represent the metrics available via the Kubernetes API and the kube-state-metrics endpoint.

  1. Click Create Host again, name this host Kubernetes Cluster State, and add it to the Kubernetes group again.
  2. Let’s also attach the Kubernetes Cluster State template by HTTP. Again, we’re going to choose the proxy that we created earlier.
  3. In the Macro section, change the kube.api.url to the same thing we used before, but this time leave off the /api at the end. Simply: default.svc.cluster.local:443. Be sure to set the token as we did before.
  4. Assuming nothing else was changed in the installation of the helm chart, we can now add that host.

After a few minutes, you should receive metrics related to the cluster state, including hosts representing the kubelet on each node.

What’s Next?

Now you’re all set to start monitoring your Kubernetes cluster in Zabbix! Give it a try, and let us know your thoughts in the comments.

In the next blog post, we’ll look at what you can do with your newly monitored cluster and how to get the most out of it.

If you’d like help with any of this, ATS has advanced monitoring, orchestration, and automation skills to make this process a snap. Set up a 15-minute with our team to go through any questions you have.

About the Author

Michaela DeForest is a Platform Engineer for The ATS Group.  She is a Zabbix Certified Specialist on Zabbix 6.0 with additional areas of expertise, including Terraform, Amazon Web Services (AWS), Ansible, and Kubernetes, to name a few.  As ATS’s resident authority in DevOps, Michaela is critical in delivering cutting-edge solutions that help businesses improve efficiency, reduce errors, and achieve a faster ROI.

About ATS Group: The ATS Group provides a fully inclusive set of technology services and tools designed to innovate and transform IT.  Their systems integration, business resiliency, cloud enablement, infrastructure intelligence, and managed services help businesses of all sizes “get IT done.” With over 20 years in business, ATS has become the trusted advisor to nearly 500 customers across multiple industries.  They have built their reputation around honesty, integrity, and technical expertise unrivaled by the competition.

Docker Container Monitoring With Zabbix

Post Syndicated from Dmitry Lambert original https://blog.zabbix.com/docker-container-monitoring-with-zabbix/20175/

In this blog post, I will cover Docker container monitoring with Zabbix. We will use the official Docker by Zabbix agent 2 template to make things as simple as possible. The template download link and configuration steps can be found on the Zabbix Integrations page. If you require a visual guide, I invite you to check out my video covering this topic.

Importing the official Docker template

Importing the Docker by Zabbix agent 2 template

Since we will be using the official Docker by Zabbix agent 2 template, first, we need to make sure that the template is actually available in our Zabbix instance. The template is available for Zabbix versions 5.0, 5.4, and 6.0. If you cannot find this template under Configuration – Templates, chances are that you haven’t imported it into your environment after upgrading Zabbix to one of the aforementioned versions. Remember that Zabbix does not modify or import any templates during the upgrade process, so we will have to import the template manually. If that is so, simply download the template from the official Zabbix git page (or use the link in the introduction) and import it into your Zabbix instance by using the Import button in the Configuration – Templates section.

Installing and configuring Zabbix agent 2

Before we get started with configuring our host, we first have to install Zabbix agent 2 and configure it according to the template guidelines. Follow the steps in the download section of the Zabbix website and install the zabbix-agent2 package. Feel free to use any other agent deployment methods if you want to (like compiling the agent from the source files)

Installing Zabbix agent2 from packages takes just a few simple steps:

Install the Zabbix repository package:

rpm -Uvh https://repo.zabbix.com/zabbix/6.0/rhel/8/x86_64/zabbix-release-6.0-1.el8.noarch.rpm

Install the Zabbix agent 2 package:

dnf install zabbix-agent2

Configure the Server parameter by populating it with your Zabbix server/proxy address

vi /etc/zabbix/zabbix_agent2.conf
### Option: Server
# List of comma delimited IP addresses, optionally in CIDR notation, or DNS names of Zabbix servers and Zabbix proxies.
# Incoming connections will be accepted only from the hosts listed here.
# If IPv6 support is enabled then '127.0.0.1', '::127.0.0.1', '::ffff:127.0.0.1' are treated equally
# and '::/0' will allow any IPv4 or IPv6 address.
# '0.0.0.0/0' can be used to allow any IPv4 address.
# Example: Server=127.0.0.1,192.168.1.0/24,::1,2001:db8::/32,zabbix.example.com
#
# Mandatory: yes, if StartAgents is not explicitly set to 0
# Default:
# Server=

Server=192.168.50.49

Plugin specific Zabbix agent 2 configuration

Zabbix agent 2 provides plugin-specific configuration parameters. Mostly these are optional parameters related to a specific plugin. You can find the full list of plugin-specific configuration parameters in the Zabbix documentation. In the newer versions of Zabbix agent 2, the plugin-specific parameters are defined in separate plugin configuration files, located in /etc/zabbix/zabbix_agent2.d/plugins.d/, while in older versions, they are defined directly in the zabbix_agent2.conf file.

For the Zabbix agent 2 Docker plugin, we have to provide the Docker daemon unix-socket location. This can be done by specifying the following plugin parameter:

### Option: Plugins.Docker.Endpoint
# Docker API endpoint.
#
# Mandatory: no
# Default: unix:///var/run/docker.sock
# Plugins.Docker.Endpoint=unix:///var/run/docker.sock

The default socket location will be correct for your Docker environment – in that case, you can leave the configuration file as-is.

Once we have made the necessary changes in the Zabbix agent 2 configuration files, start and enable the agent:

systemctl enable zabbix-agent2 --now

Check if the Zabbix agent2 is running:

tail -f /var/log/zabbix/zabbix_agent2.log

Before we move on to Zabbix frontend, I would like to point your attention to the Docker socket file permission – the zabbix user needs to have access to the Docker socket file. The zabbix user should be added to the docker group to resolve the following error messages.

[Docker] cannot fetch data: Get http://1.28/info: dial unix /var/run/docker.sock: connect: permission denied
ZBX_NOTSUPPORTED: Cannot fetch data.

You can add the zabbix user to the Docker group by executing the following command:

usermod -aG docker zabbix

Configuring the docker host

Configuring the host representing our Docker environment

After importing the template, we have to create a host which will represent our Docker instance. Give the host a name and assign it to a Host group – I will assign it to the Linux servers host group. Assign the Docker by Zabbix agent 2 template to the host. Since the template uses Zabbix agent 2 to collect the metrics, we also have to add an agent interface on this host. The address of the interface should point to the machine running your Docker containers. Finish up the host configuration by clicking the Add button.

Docker by Zabbix agent 2 template

Regular docker template items

The template contains a set of regular items for the general Docker instance metrics, such as the number of available images, Docker architecture information, the total number of containers, and more.

Docker tempalte Low-level discovery rules

On top of that, the template also gathers container and image-specific information by using low-level discovery rules.

Once Zabbix discovers your containers and images, these low-level discovery rules will then be used to create items, triggers, and graphs from prototypes for each of your containers and images. This way, we can monitor container or image-specific metrics, such as container memory, network information, container status, and more.

Docker templates Low-level discovery item prototypes

Verifying the host and template configuration

To verify that the agent and the host are configured correctly, we can use Zabbix get command-line tool and try to poll our agent. If you haven’t installed Zabbix get, do so on your Zabbix server or Zabbix proxy host:

dnf install zabbix-get

Now we can use zabbix-get to verify that our agent can obtain the Docker-related metrics. Execute the following command:

zabbix_get -s docker-host -k docker.info

Use the -s parameter to specify your agent host’s host name or IP address. The -k parameter specifies the item key for which we wish to obtain the metrics by polling the agent with Zabbix get.

zabbix_get -s 192.168.50.141 -k docker.info

{"Id":"SJYT:SATE:7XZE:7GEC:XFUD:KZO5:NYFI:L7M5:4RGO:P2KX:QJFD:TAVY","Containers":2,"ContainersRunning":2,"ContainersPaused":0,"ContainersStopped":0,"Images":2,"Driver":"overlay2","MemoryLimit":true,"SwapLimit":true,"KernelMemory":true,"KernelMemoryTCP":true,"CpuCfsPeriod":true,"CpuCfsQuota":true,"CPUShares":true,"CPUSet":true,"PidsLimit":true,"IPv4Forwarding":true,"BridgeNfIptables":true,"BridgeNfIP6tables":true,"Debug":false,"NFd":39,"OomKillDisable":true,"NGoroutines":43,"LoggingDriver":"json-file","CgroupDriver":"cgroupfs","NEventsListener":0,"KernelVersion":"5.4.17-2136.300.7.el8uek.x86_64","OperatingSystem":"Oracle Linux Server 8.5","OSVersion":"8.5","OSType":"linux","Architecture":"x86_64","IndexServerAddress":"https://index.docker.io/v1/","NCPU":1,"MemTotal":1776848896,"DockerRootDir":"/var/lib/docker","Name":"localhost.localdomain","ExperimentalBuild":false,"ServerVersion":"20.10.14","ClusterStore":"","ClusterAdvertise":"","DefaultRuntime":"runc","LiveRestoreEnabled":false,"InitBinary":"docker-init","SecurityOptions":["name=seccomp,profile=default"],"Warnings":null}

In addition, we can also use the low-level discovery key – docker.containers.discovery[false] to check the result of the low-level discovery.

zabbix_get -s 192.168.50.141 -k docker.containers.discovery[false]

[{"{#ID}":"a1ad32f5ee680937806bba62a1aa37909a8a6663d8d3268db01edb1ac66a49e2","{#NAME}":"/apache-server"},{"{#ID}":"120d59f3c8b416aaeeba50378dee7ae1eb89cb7ffc6cc75afdfedb9bc8cae12e","{#NAME}":"/mysql-server"}]

We can see that Zabbix will discover and start monitoring two containers – apache-server and mysql-server. Any agent low-level discovery rule or item can be checked with Zabbix get.

Docker template in action

Discovered items on our Docker host

Now that we have configured our agent and host, applied the Docker template, and verified that everything is working, we should be able to see the discovered entities in the frontend.

Collected Docker container metrics

In addition, our metrics should have also started coming in. We can check the Latest data section and verify that they are indeed getting collected.

Macros inherited from the Docker template

Lastly, we have a few additional options for further modifying the template and the results of our low-level discovery. If you open the Macros section of your host and select Inherited and host macros, you will notice that there are 4 macros inherited from the Docker template. These macros are responsible for filtering in/out the discovered containers and images. Feel free to modify these values if you wish to filter in/out the discovery of these entities as per your requirements.

Notice that the container discovery item also has one parameter, which is defined as false on the template:

  • docker.containers.discovery[false] – Discover only running containers
  • docker.containers.discovery[true] – Discover all containers, no matter their state.

And that’s it! We successfully imported the template, installed and configured Zabbix agent 2, created a host, and applied the Docker template. Finally – our Zabbix instance is now monitoring our Docker environment! If you have any other questions or comments, feel free to leave a response in the comments section of this post.

 

The post Docker Container Monitoring With Zabbix appeared first on Zabbix Blog.

Deploy and Manage Gitlab Runners on Amazon EC2

Post Syndicated from Sylvia Qi original https://aws.amazon.com/blogs/devops/deploy-and-manage-gitlab-runners-on-amazon-ec2/

Gitlab CI is a tool utilized by many enterprises to automate their Continuous integration, continuous delivery and deployment (CI/CD) process. A Gitlab CI/CD pipeline consists of two major components: A .gitlab-ci.yml file describing a pipeline’s jobs, and a Gitlab Runner, an application that executes the pipeline jobs.

Setting up the Gitlab Runner is a time-consuming process. It involves provisioning the necessary infrastructure, installing the necessary software to run pipeline workloads, and configuring the runner. For enterprises running hundreds of pipelines across multiple environments, it is essential to automate the Gitlab Runner deployment process so as to be deployed quickly in a repeatable, consistent manner.

This post will guide you through utilizing Infrastructure-as-Code (IaC) to automate Gitlab Runner deployment and administrative tasks on Amazon EC2. With IaC, you can quickly and consistently deploy the entire Gitlab Runner architecture by running a script. You can track and manage changes efficiently. And, you can enforce guardrails and best practices via code. The solution presented here also offers autoscaling so that you save costs by terminating resources when not in use. You will learn:

  • How to deploy Gitlab Runner quickly and consistently across multiple AWS accounts.
  • How to enforce guardrails and best practices on the Gitlab Runner through IaC.
  • How to autoscale Gitlab Runner based on workloads to ensure best performance and save costs.

This post comes from a DevOps engineer perspective, and assumes that the engineer is familiar with the practices and tools of IaC and CI/CD.

Overview of the solution

The following diagram displays the solution architecture. We use AWS CloudFormation to describe the infrastructure that is hosting the Gitlab Runner. The main steps are as follows:

  1. The user runs a deploy script in order to deploy the CloudFormation template. The template is parameterized, and the parameters are defined in a properties file. The properties file specifies the infrastructure configuration, as well as the environment in which to deploy the template.
  2. The deploy script calls CloudFormation CreateStack API to create a Gitlab Runner stack in the specified environment.
  3. During stack creation, an EC2 autoscaling group is created with the desired number of EC2 instances. Each instance is launched via a launch template, which is created with values from the properties file. An IAM role is created and attached to the EC2 instance. The role contains permissions required for the Gitlab Runner to execute pipeline jobs. A lifecycle hook is attached to the autoscaling group on instance termination events. This ensures graceful instance termination.
  4. During instance launch, CloudFormation uses a cfn-init helper script to install and configure the Gitlab Runner:
    1. cfn-init installs the Gitlab Runner software on the EC2 instance.
    2. cfn-init configures the Gitlab Runner as a docker executor using a pre-defined docker image in the Gitlab Container Registry. The docker executor implementation lets the Gitlab Runner run each build in a separate and isolated container. The docker image contains the software required to run the pipeline workloads, thereby eliminating the need to install these packages during each build.
    3. cfn-init registers the Gitlab Runner to Gitlab projects specified in the properties file, so that these projects can utilize the Gitlab Runner to run pipelines.
  1. The user may repeat the same steps to deploy Gitlab Runner into another environment.

Architecture diagram previously explained in post.

Walkthrough

This walkthrough will demonstrate how to deploy the Gitlab Runner, and how easy it is to conduct Gitlab Runner administrative tasks via this architecture. We will walk through the following tasks:

  • Build a docker executor image for the Gitlab Runner.
  • Deploy the Gitlab Runner stack.
  • Update the Gitlab Runner.
  • Terminate the Gitlab Runner.
  • Add/Remove Gitlab projects from the Gitlab Runner.
  • Autoscale the Gitlab Runner based on workloads.

The code in this post is available at https://github.com/aws-samples/amazon-ec2-gitlab-runner.git

Prerequisites

For this walkthrough, you need the following:

  • A Gitlab account (all tiers including Gitlab Free self-managed, Gitlab Free SaaS, and higher tiers). This demo uses gitlab.com free tire.
  • A Gitlab Container Registry.
  • Git client to clone the source code provided.
  • An AWS account with local credentials properly configured (typically under ~/.aws/credentials).
  • The latest version of the AWS CLI. For more information, see Installing, updating, and uninstalling the AWS CLI.
  • Docker is installed and running on the localhost/laptop.
  • Nodejs and npm installed on the localhost/laptop.
  • A VPC with 2 private subnets and that is connected to the internet via NAT gateway allowing outbound traffic.
  • The following IAM service-linked role created in the AWS account: AWSServiceRoleForAutoScaling
  • An Amazon S3 bucket for storing Lambda deployment packages.
  • Familiarity with Git, Gitlab CI/CD, Docker, EC2, CloudFormation and Amazon CloudWatch.

Build a docker executor image for the Gitlab Runner

The Gitlab Runner in this solution is implemented as docker executor. The Docker executor connects to Docker Engine and runs each build in a separate and isolated container via a predefined docker image. The first step in deploying the Gitlab Runner is building a docker executor image. We provided a simple Dockerfile in order to build this image. You may customize the Dockerfile to install your own requirements.

To build a docker image using the sample Dockerfile:

  1. Create a directory where we will store our demo code. From your terminal run:
mkdir demo-repos && cd demo-repos
  1. Clone the source code repository found in the following location:
git clone https://github.com/aws-samples/amazon-ec2-gitlab-runner.git
  1. Create a new project on your Gitlab server. Name the project any name you like.
  2. Clone your newly created repo to your laptop. Ignore the warning about cloning an empty repository.
git clone <your-repo-url>
  1. Copy the demo repo files into your newly created repo on your laptop, and push it to your Gitlab repository. You may customize the Dockerfile before pushing it to Gitlab.
cp -r amazon-ec2-gitlab-runner/* <your-repo-dir>
cd <your-repo-dir>
git add .
git commit -m “Initial commit”
git push
  1. On the Gitlab console, go to your repository’s Package & Registries -> Container Registry. Follow the instructions provided on the Container Registry page in order to build and push a docker image to your repository’s container registry.

Deploy the Gitlab Runner stack

Once the docker executor image has been pushed to the Gitlab Container Registry, we can deploy the Gitlab Runner. The Gitlab Runner infrastructure is described in the Cloudformation template gitlab-runner.yaml. Its configuration is stored in a properties file called sample-runner.properties. A launch template is created with the values in the properties file. Then it is used to launch instances. This architecture lets you deploy Gitlab Runner to as many environments as you like by utilizing the configurations provided in the appropriate properties files.

During the provisioning process, utilize a cfn-init helper script to run a series of commands to install and configure the Gitlab Runner.

          commands:
            01InstallDocker:
              command: sudo yum -y install docker
            02StartDocker:
              command: sudo service docker start
            03DownloadGitlabRunner:
              command: sudo wget -O /usr/bin/gitlab-runner https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64
            04ChmodGitlabRunner:
              command: sudo chmod a+x /usr/bin/gitlab-runner
            05AddUser:
              command: sudo useradd --comment 'GitLab Runner' --create-home gitlab-runner --shell /bin/bash
            06InstallGitlabRunner:
              command: sudo gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner
            07SetRegion:
              command: !Sub 'aws configure set default.region ${AWS::Region}'
            08ConfigureDockerExecutor:
              command: !Sub 
                - |
                  for GitlabGroupToken in `aws ssm get-parameters --names /${AWS::StackName}/ci-tokens --query 'Parameters[0].Value' | sed -e "s/\"//g" | sed "s/,/ /g"`;do
                      sudo gitlab-runner register \
                      --non-interactive \
                      --url "${GitlabServerURL}" \
                      --registration-token $GitlabGroupToken \
                      --executor "docker" \
                      --docker-image "${DockerImagePath}" \
                      --description "Gitlab Runner with Docker Executor" \
                      --locked="${isLOCKED}" --access-level "${ACCESS}" \
                      --docker-volumes "/var/run/docker.sock:/var/run/docker.sock" \
                      --tag-list "${RunnerEnvironment}-${RunnerVersion}-docker"
                  done
                - isLOCKED: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, isLOCKED]
                  ACCESS: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, ACCESS]                              
            09StartGitlabRunner:
              command: sudo gitlab-runner start

The helper script ensures that the Gitlab Runner setup is consistent and repeatable for each deployment. If a configuration change is required, users simply update the configuration steps and redeploy the stack. Furthermore, all changes are tracked in Git, which allows for versioning of the Gitlab Runner.

To deploy the Gitlab Runner stack:

  1. Obtain the runner registration tokens of the Gitlab projects that you want registered to the Gitlab Runner. Obtain the token by selecting the project’s Settings > CI/CD and expand the Runners section.
  2. Update the sample-runner.properties file parameters according to your own environment. Refer to the gitlab-runner.yaml file for a description of these parameters. Rename the file if you like. You may also create an additional properties file for deploying into other environments.
  3. Run the deploy script to deploy the runner:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

<properties-file> is the name of the properties file.

<region> is the region where you want to deploy the stack.

<aws-profile> is the name of the CLI profile you set up in the prerequisites section.

<stack-name> is the name you chose for the CloudFormation stack.

For example:

./deploy-runner.sh sample-runner.properties us-east-1 dev amazon-ec2-gitlab-runner-demo

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console:

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console.

Under your Gitlab project Settings > CICD > Runners > Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Now go to your Gitlab project Settings  CICD  Runners  Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Updating the Gitlab Runner

There are times when you would want to update the Gitlab Runner. For example, updating the instance VolumeSize in order to resolve a disk space issue, or updating the AMI ID when a new AMI becomes available.

Utilizing the properties file and launch template makes it easy to update the Gitlab Runner. Simply update the Gitlab Runner configuration parameters in the properties file. Then, run the deploy script to udpate the Gitlab Runner stack. To ensure that the changes take effect immediately (e.g., existing instances are replaced by new instances with the new configuration), we utilize an AutoscalingRollingUpdate update policy to automatically update the instances in the autoscaling group.

    UpdatePolicy:
      AutoScalingRollingUpdate:
        MinInstancesInService: !Ref MinInstancesInService
        MaxBatchSize: !Ref MaxBatchSize
        PauseTime: "PT5M"
        WaitOnResourceSignals: true
        SuspendProcesses:
          - HealthCheck
          - ReplaceUnhealthy
          - AZRebalance
          - AlarmNotification
          - ScheduledActions

The policy tells CloudFormation that when changes are detected in the launch template, update the instances in batch size of MaxBatchSize, while keeping a number of instances (specified in MinInstanceInService) in service during the update.

Below is an example of updating the Gitlab Runner instance type.

To update the instance type of the runner instance:

  1. Update the “InstanceType” parameter in the properties file.

InstanceType=t2.medium

  1. Run the deploy-runner.sh script to update the CloudFormation stack:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

Terminate the Gitlab Runner

There are times when an autoscaling group instance must be terminated. For example, during an autoscaling scale-in event, or when the instance is being replaced by a new instance during a stack update, as seen previously. When terminating an instance, you must ensure that the Gitlab Runner finishes executing any running jobs before the instance is terminated, otherwise your environment could be left in an inconsistent state. Also, we want to ensure that the terminated Gitlab Runner is removed from the Gitlab project. We utilize an autoscaling lifecycle hook to achieve these goals.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

There are also times when you want to terminate an instance manually. For example, when an instance is suspected to not be functioning properly. To terminate an instance from the Gitlab Runner autoscaling group, use the following command:

aws autoscaling terminate-instance-in-auto-scaling-group \
    --instance-id="${InstanceId}" \
    --no-should-decrement-desired-capacity \
    --region="${region}" \
    --profile="${profile}"

The above command terminates the instance. The lifecycle hook ensures that the cleanup steps are conducted properly, and the autoscaling group launches another new instance to replace the old one.

Note that if you terminate the instance by using the “ec2 terminate-instance” command, then the autoscaling lifecycle hook actions will not be triggered.

Add/Remove Gitlab projects from the Gitlab Runner

As new projects are added to your enterprise, you may want to register them to the Gitlab Runner, so that those projects can utilize the Gitlab Runner to run pipelines. On the other hand, you would want to remove the Gitlab Runner from a project if it no longer wants to utilize the Gitlab Runner, or if it qualifies to utilize the Gitlab Runner. For example, if a project is no longer allowed to deploy to an environment configured by the Gitlab Runner. Our architecture offers a simple way to add and remove projects from the Gitlab Runner. To add new projects to the Gitlab Runner, update the RunnerRegistrationTokens parameter in the properties file, and then rerun the deploy script to update the Gitlab Runner stack.

To add new projects to the Gitlab Runner:

  1. Update the RunnerRegistrationTokens parameter in the properties file. For example:
RunnerRegistrationTokens=ps8RjBSruy1sdRdP2nZX,XbtZNv4yxysbYhqvjEkC
  1. Update the Gitlab Runner stack. This updates the SSM parameter which stores the tokens.
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 
  1. Relaunch the instances in the Gitlab Runner autoscaling group. The new instances will use the new RunnerRegistrationTokens value. Run the following command to relaunch the instances:
./cycle-runner.sh <runner-autoscaling-group-name> <region> <optional-aws-profile>

To remove projects from the Gitlab Runner, follow the steps described above, with just one difference. Instead of adding new tokens to the RunnerRegistrationTokens parameter, remove the token(s) of the project that you want to dissociate from the runner.

Autoscale the runner based on custom performance metrics

Each Gitlab Runner can be configured to handle a fixed number of concurrent jobs. Once this capacity is reached for every runner, any new jobs will be in a Queued/Waiting status until the current jobs complete, which would be a poor experience for our team. Setting the number of concurrent jobs too high on our runners would also result in a poor experience, because all jobs leverage the same CPU, memory, and storage in order to conduct the builds.

In this solution, we utilize a scheduled Lambda function that runs every minute in order to inspect the number of jobs running on every runner, leveraging the Prometheus Metrics endpoint that the runners expose. If we approach the concurrent build limit of the group, then we increase the Autoscaling Group size so that it can take on more work. As the number of concurrent jobs decreases, then the scheduled Lambda function will scale the Autoscaling Group back in an effort to minimize cost. The Scaling-Up operation will ignore the Autoscaling Group’s cooldown period, which will help ensure that our team is not waiting on a new instance, whereas the Scale-Down operation will obey the group’s cooldown period.

Here is the logical sequence diagram for the work:

Sequence diagram

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

Congratulations! You have completed the walkthrough. Take some time to review the resources you have deployed, and practice the various runner administrative tasks that we have covered in this post.

Troubleshooting

Problem: I deployed the CloudFormation template, but no runner is listed in my repository.

Possible Cause: Errors have been encountered during cfn-init, causing runner registration to fail. Connect to your runner EC2 instance, and check /var/log/cfn-*.log files.

Cleaning up

To avoid incurring future charges, delete every resource provisioned in this demo by deleting the CloudFormation stack created in the “Deploy the Gitlab Runner stack” section.

Conclusion

This article demonstrated how to utilize IaC to efficiently conduct various administrative tasks associated with a Gitlab Runner. We deployed Gitlab Runner consistently and quickly across multiple accounts. We utilized IaC to enforce guardrails and best practices, such as tracking Gitlab Runner configuration changes, terminating the Gitlab Runner gracefully, and autoscaling the Gitlab Runner to ensure best performance and minimum cost. We walked through the deploying, updating, autoscaling, and terminating of the Gitlab Runner. We also saw how easy it was to clean up the entire Gitlab Runner architecture by simply deleting a CloudFormation stack.

About the authors

Sylvia Qi

Sylvia is a Senior DevOps Architect focusing on architecting and automating DevOps processes, helping customers through their DevOps transformation journey. In her spare time, she enjoys biking, swimming, yoga, and photography.

Sebastian Carreras

Sebastian is a Senior Cloud Application Architect with AWS Professional Services. He leverages his breadth of experience to deliver bespoke solutions to satisfy the visions of his customer. In his free time, he really enjoys doing laundry. Really.

Handy Tips #21: Deploying Zabbix Server with Docker containers

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/handy-tips-21-deploying-zabbix-server-with-docker-containers/18972/

Deploy Zabbix components in docker containers for advanced automation, scalability, and maintenance.

In the past few years, containers have gained prevalence and are being used for many different tasks – from application development to improving automation and management of existing software.

Deploy Zabbix components in Docker containers:

  • Official Docker images are available for individual components
  • Automate the deployment of your Zabbix containers

  • Use containers to quickly scale your environment
  • Upgrade to a newer Zabbix version by deploying containers from the latest container images

Check out the video to learn how to deploy the Zabbix server with Docker containers.

How to deploy Zabbix server with Docker containers:
 
  1. Connect to your Docker container host
  2. Create a new docker network. Specify the subnet and the IP range for containers.
  3. Deploy your Zabbix server container
    1. Give the container a name and assign it to the newly created network
    2. Pass the Database host, user, and password in environment variables
    3. Map the port 10051 on the host to the port 10051 on the container
    4. Select the required Docker image and tag
  4. Deploy your Zabbix frontend container
    1. Give the container a name and assign it to the newly created network
    2. Pass the Database host, user, and password in environment variables
    3. Pass the Zabbix server address in the environment variable
    4. Map port 80 on your host to port 8080 on the container
  5. Use docker ps and docker logs to check if the containers are running
  6. Connect to your Zabbix frontend and confirm that there are no issues with the environment

Tips and best practices:
  • Container logs can be accessed by using the docker logs command
  • Zabbix server checks for an existing Zabbix database. If it does not exist – it will get created.
  • Use the docker exec command to run commands inside a container
  • All of the supported container environment variables are available in https://hub.docker.com/u/zabbix

The post Handy Tips #21: Deploying Zabbix Server with Docker containers appeared first on Zabbix Blog.

Using AWS CodePipeline for deploying container images to AWS Lambda Functions

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/using-aws-codepipeline-for-deploying-container-images-to-aws-lambda-functions/

AWS Lambda launched support for packaging and deploying functions as container images at re:Invent 2020. In the post working with Lambda layers and extensions in container images, we demonstrated packaging Lambda Functions with layers while using container images. This post will teach you to use AWS CodePipeline to deploy docker images for microservices architecture involving multiple Lambda Functions and a common layer utilized as a separate container image. Lambda functions packaged as container images do not support adding Lambda layers to the function configuration. Alternatively, we can use a container image as a common layer while building other container images along with Lambda Functions shown in this post. Packaging Lambda functions as container images enables familiar tooling and larger deployment limits.

Here are some advantages of using container images for Lambda:

  • Easier dependency management and application building with container
    • Install native operating system packages
    • Install language-compatible dependencies
  • Consistent tool set for containers and Lambda-based applications
    • Utilize the same container registry to store application artifacts (Amazon ECR, Docker Hub)
    • Utilize the same build and pipeline tools to deploy
    • Tools that can inspect Dockerfile work the same
  • Deploy large applications with AWS-provided or third-party images up to 10 GB
    • Include larger application dependencies that previously were impossible

When using container images with Lambda, CodePipeline automatically detects code changes in the source repository in AWS CodeCommit, then passes the artifact to the build server like AWS CodeBuild and pushes the container images to ECR, which is then deployed to Lambda functions.

Architecture diagram

 

DevOps Architecture

Lambda-docker-images-DevOps-Architecture

Application Architecture

lambda-docker-image-microservices-app

In the above architecture diagram, two architectures are combined, namely 1, DevOps Architecture and 2, Microservices Application Architecture. DevOps architecture demonstrates the use of AWS Developer services such as AWS CodeCommit, AWS CodePipeline, AWS CodeBuild along with Amazon Elastic Container Repository (ECR) and AWS CloudFormation. These are used to support Continuous Integration and Continuous Deployment/Delivery (CI/CD) for both infrastructure and application code. Microservices Application architecture demonstrates how various AWS Lambda Functions that are part of microservices utilize container images for application code. This post will focus on performing CI/CD for Lambda functions utilizing container containers. The application code used in here is a simpler version taken from Serverless DataLake Framework (SDLF). For more information, refer to the AWS Samples GitHub repository for SDLF here.

DevOps workflow

There are two CodePipelines: one for building and pushing the common layer docker image to Amazon ECR, and another for building and pushing the docker images for all the Lambda Functions within the microservices architecture to Amazon ECR, as well as deploying the microservices architecture involving Lambda Functions via CloudFormation. Common layer container image functions as a common layer in all other Lambda Function container images, therefore its code is maintained in a separate CodeCommit repository used as a source stage for a CodePipeline. Common layer CodePipeline takes the code from the CodeCommit repository and passes the artifact to a CodeBuild project that builds the container image and pushes it to an Amazon ECR repository. This common layer ECR repository functions as a source in addition to the CodeCommit repository holding the code for all other Lambda Functions and resources involved in the microservices architecture CodePipeline.

Due to all or the majority of the Lambda Functions in the microservices architecture requiring the common layer container image as a layer, any change made to it should invoke the microservices architecture CodePipeline that builds the container images for all Lambda Functions. Moreover, a CodeCommit repository holding the code for every resource in the microservices architecture is another source to that CodePipeline to get invoked. This has two sources, because the container images in the microservices architecture should be built for changes in the common layer container image as well as for the code changes made and pushed to the CodeCommit repository.

Below is the sample dockerfile that uses the common layer container image as a layer:

ARG ECR_COMMON_DATALAKE_REPO_URL
FROM ${ECR_COMMON_DATALAKE_REPO_URL}:latest AS layer
FROM public.ecr.aws/lambda/python:3.8
# Layer Code
WORKDIR /opt
COPY --from=layer /opt/ .
# Function Code
WORKDIR /var/task
COPY src/lambda_function.py .
CMD ["lambda_function.lambda_handler"]

where the argument ECR_COMMON_DATALAKE_REPO_URL should resolve to the ECR url for common layer container image, which is provided to the --build-args along with docker build command. For example:

export ECR_COMMON_DATALAKE_REPO_URL="0123456789.dkr.ecr.us-east-2.amazonaws.com/dev-routing-lambda"
docker build --build-arg ECR_COMMON_DATALAKE_REPO_URL=$ECR_COMMON_DATALAKE_REPO_URL .

Deploying a Sample

  • Step1: Clone the repository Codepipeline-lambda-docker-images to your workstation. If using the zip file, then unzip the file to a local directory.
    • git clone https://github.com/aws-samples/codepipeline-lambda-docker-images.git
  • Step 2: Change the directory to the cloned directory or extracted directory. The local code repository structure should appear as follows:
    • cd codepipeline-lambda-docker-images

code-repository-structure

  • Step 3: Deploy the CloudFormation stack used in the template file CodePipelineTemplate/codepipeline.yaml to your AWS account. This deploys the resources required for DevOps architecture involving AWS CodePipelines for common layer code and microservices architecture code. Deploy CloudFormation stacks using the AWS console by following the documentation here, providing the name for the stack (for example datalake-infra-resources) and passing the parameters while navigating the console. Furthermore, use the AWS CLI to deploy a CloudFormation stack by following the documentation here.
  • Step 4: When the CloudFormation Stack deployment completes, navigate to the AWS CloudFormation console and to the Outputs section of the deployed stack, then note the CodeCommit repository urls. Three CodeCommit repo urls are available in the CloudFormation stack outputs section for each CodeCommit repository. Choose one of them based on the way you want to access it. Refer to the following documentation Setting up for AWS CodeCommit. I will be using the git-remote-codecommit (grc) method throughout this post for CodeCommit access.
  • Step 5: Clone the CodeCommit repositories and add code:
      • Common Layer CodeCommit repository: Take the value of the Output for the key oCommonLayerCodeCommitHttpsGrcRepoUrl from datalake-infra-resources CloudFormation Stack Outputs section which looks like below:

    commonlayercodeoutput

      • Clone the repository:
        • git clone codecommit::us-east-2://dev-CommonLayerCode
      • Change the directory to dev-CommonLayerCode
        • cd dev-CommonLayerCode
      •  Add contents to the cloned repository from the source code downloaded in Step 1. Copy the code from the CommonLayerCode directory and the repo contents should appear as follows:

    common-layer-repository

      • Create the main branch and push to the remote repository
        git checkout -b main
        git add ./
        git commit -m "Initial Commit"
        git push -u origin main
      • Application CodeCommit repository: Take the value of the Output for the key oAppCodeCommitHttpsGrcRepoUrl from datalake-infra-resources CloudFormation Stack Outputs section which looks like below:

    appcodeoutput

      • Clone the repository:
        • git clone codecommit::us-east-2://dev-AppCode
      • Change the directory to dev-CommonLayerCode
        • cd dev-AppCode
      • Add contents to the cloned repository from the source code downloaded in Step 1. Copy the code from the ApplicationCode directory and the repo contents should appear as follows from the root:

    app-layer-repository

    • Create the main branch and push to the remote repository
      git checkout -b main
      git add ./
      git commit -m "Initial Commit"
      git push -u origin main

What happens now?

  • Now the Common Layer CodePipeline goes to the InProgress state and invokes the Common Layer CodeBuild project that builds the docker image and pushes it to the Common Layer Amazon ECR repository. The image tag utilized for the container image is the value resolved for the environment variable available in the AWS CodeBuild project CODEBUILD_RESOLVED_SOURCE_VERSION. This is the CodeCommit git Commit Id in this case.
    For example, if the CommitId in CodeCommit is f1769c87, then the pushed docker image will have this tag along with latest
  • buildspec.yaml files appears as follows:
    version: 0.2
    phases:
      install:
        runtime-versions:
          docker: 19
      pre_build:
        commands:
          - echo Logging in to Amazon ECR...
          - aws --version
          - $(aws ecr get-login --region $AWS_DEFAULT_REGION --no-include-email)
          - REPOSITORY_URI=$ECR_COMMON_DATALAKE_REPO_URL
          - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
          - IMAGE_TAG=${COMMIT_HASH:=latest}
      build:
        commands:
          - echo Build started on `date`
          - echo Building the Docker image...          
          - docker build -t $REPOSITORY_URI:latest .
          - docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG
      post_build:
        commands:
          - echo Build completed on `date`
          - echo Pushing the Docker images...
          - docker push $REPOSITORY_URI:latest
          - docker push $REPOSITORY_URI:$IMAGE_TAG
  • Now the microservices architecture CodePipeline goes to the InProgress state and invokes all of the application image builder CodeBuild project that builds the docker images and pushes them to the Amazon ECR repository.
    • To improve the performance, every docker image is built in parallel within the codebuild project. The buildspec.yaml executes the build.sh script. This has the logic to build docker images required for each Lambda Function part of the microservices architecture. The docker images used for this sample architecture took approximately 4 to 5 minutes when the docker images were built serially. After switching to parallel building, it took approximately 40 to 50 seconds.
    • buildspec.yaml files appear as follows:
      version: 0.2
      phases:
        install:
          runtime-versions:
            docker: 19
          commands:
            - uname -a
            - set -e
            - chmod +x ./build.sh
            - ./build.sh
      artifacts:
        files:
          - cfn/**/*
        name: builds/$CODEBUILD_BUILD_NUMBER/cfn-artifacts
    • build.sh file appears as follows:
      #!/bin/bash
      set -eu
      set -o pipefail
      
      RESOURCE_PREFIX="${RESOURCE_PREFIX:=stg}"
      AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:=us-east-1}"
      ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text 2>&1)
      ECR_COMMON_DATALAKE_REPO_URL="${ECR_COMMON_DATALAKE_REPO_URL:=$ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com\/$RESOURCE_PREFIX-common-datalake-library}"
      pids=()
      pids1=()
      
      PROFILE='new-profile'
      aws configure --profile $PROFILE set credential_source EcsContainer
      
      aws --version
      $(aws ecr get-login --region $AWS_DEFAULT_REGION --no-include-email)
      COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      BUILD_TAG=build-$(echo $CODEBUILD_BUILD_ID | awk -F":" '{print $2}')
      IMAGE_TAG=${BUILD_TAG:=COMMIT_HASH:=latest}
      
      cd dockerfiles;
      mkdir ../logs
      function pwait() {
          while [ $(jobs -p | wc -l) -ge $1 ]; do
              sleep 1
          done
      }
      
      function build_dockerfiles() {
          if [ -d $1 ]; then
              directory=$1
              cd $directory
              echo $directory
              echo "---------------------------------------------------------------------------------"
              echo "Start creating docker image for $directory..."
              echo "---------------------------------------------------------------------------------"
                  REPOSITORY_URI=$ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$RESOURCE_PREFIX-$directory
                  docker build --build-arg ECR_COMMON_DATALAKE_REPO_URL=$ECR_COMMON_DATALAKE_REPO_URL . -t $REPOSITORY_URI:latest -t $REPOSITORY_URI:$IMAGE_TAG -t $REPOSITORY_URI:$COMMIT_HASH
                  echo Build completed on `date`
                  echo Pushing the Docker images...
                  docker push $REPOSITORY_URI
              cd ../
              echo "---------------------------------------------------------------------------------"
              echo "End creating docker image for $directory..."
              echo "---------------------------------------------------------------------------------"
          fi
      }
      
      for directory in *; do 
         echo "------Started processing code in $directory directory-----"
         build_dockerfiles $directory 2>&1 1>../logs/$directory-logs.log | tee -a ../logs/$directory-logs.log &
         pids+=($!)
         pwait 20
      done
      
      for pid in "${pids[@]}"; do
        wait "$pid"
      done
      
      cd ../cfn/
      function build_cfnpackages() {
          if [ -d ${directory} ]; then
              directory=$1
              cd $directory
              echo $directory
              echo "---------------------------------------------------------------------------------"
              echo "Start packaging cloudformation package for $directory..."
              echo "---------------------------------------------------------------------------------"
              aws cloudformation package --profile $PROFILE --template-file template.yaml --s3-bucket $S3_BUCKET --output-template-file packaged-template.yaml
              echo "Replace the parameter 'pEcrImageTag' value with the latest built tag"
              echo $(jq --arg Image_Tag "$IMAGE_TAG" '.Parameters |= . + {"pEcrImageTag":$Image_Tag}' parameters.json) > parameters.json
              cat parameters.json
              ls -al
              cd ../
              echo "---------------------------------------------------------------------------------"
              echo "End packaging cloudformation package for $directory..."
              echo "---------------------------------------------------------------------------------"
          fi
      }
      
      for directory in *; do
          echo "------Started processing code in $directory directory-----"
          build_cfnpackages $directory 2>&1 1>../logs/$directory-logs.log | tee -a ../logs/$directory-logs.log &
          pids1+=($!)
          pwait 20
      done
      
      for pid in "${pids1[@]}"; do
        wait "$pid"
      done
      
      cd ../logs/
      ls -al
      for f in *; do
        printf '%s\n' "$f"
        paste /dev/null - < "$f"
      done
      
      cd ../
      

The function build_dockerfiles() loops through each directory within the dockerfiles directory and runs the docker build command in order to build the docker image. The name for the docker image and then the ECR repository is determined by the directory name in which the DockerFile is used from. For example, if the DockerFile directory is routing-lambda and the environment variables take the below values,

ACCOUNT_ID=0123456789
AWS_DEFAULT_REGION=us-east-2
RESOURCE_PREFIX=dev
directory=routing-lambda
REPOSITORY_URI=$ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$RESOURCE_PREFIX-$directory

Then REPOSITORY_URI becomes 0123456789.dkr.ecr.us-east-2.amazonaws.com/dev-routing-lambda
And the docker image is pushed to this resolved REPOSITORY_URI. Similarly, docker images for all other directories are built and pushed to Amazon ECR.

Important Note: The ECR repository names match the directory names where the DockerFiles exist and was already created as part of the CloudFormation template codepipeline.yaml that was deployed in step 3. In order to add more Lambda Functions to the microservices architecture, make sure that the ECR repository name added to the new repository in the codepipeline.yaml template matches the directory name within the AppCode repository dockerfiles directory.

Every docker image is built in parallel in order to save time. Each runs as a separate operating system process and is pushed to the Amazon ECR repository. This also controls the number of processes that could run in parallel by setting a value for the variable pwait within the loop. For example, if pwait 20, then the maximum number of parallel processes is 20 at a given time. The image tag for all docker images used for Lambda Functions is constructed via the CodeBuild BuildId, which is available via environment variable $CODEBUILD_BUILD_ID, in order to ensure that a new image gets a new tag. This is required for CloudFormation to detect changes and update Lambda Functions with the new container image tag.

Once every docker image is built and pushed to Amazon ECR in the CodeBuild project, it builds every CloudFormation package by uploading all local artifacts to Amazon S3 via AWS Cloudformation package CLI command for the templates available in its own directory within the cfn directory. Moreover, it updates every parameters.json file for each directory with the ECR image tag to the parameter value pEcrImageTag. This is required for CloudFormation to detect changes and update the Lambda Function with the new image tag.

After this, the CodeBuild project will output the packaged CloudFormation templates and parameters files as an artifact to AWS CodePipeline so that it can be deployed via AWS CloudFormation in further stages. This is done by first creating a ChangeSet and then deploying it at the next stage.

Testing the microservices architecture

As stated above, the sample application utilized for microservices architecture involving multiple Lambda Functions is a modified version of the Serverless Data Lake Framework. The microservices architecture CodePipeline deployed every AWS resource required to run the SDLF application via AWS CloudFormation stages. As part of SDLF, it also deployed a set of DynamoDB tables required for the applications to run. I utilized the meteorites sample for this, thereby the DynamoDb tables should be added with the necessary data for the application to run for this sample.

Utilize the AWS console to write data to the AWS DynamoDb Table. For more information, refer to this documentation. The sample json files are in the utils/DynamoDbConfig/ directory.

1. Add the record below to the octagon-Pipelines-dev DynamoDB table:

{
"description": "Main Pipeline to Ingest Data",
"ingestion_frequency": "WEEKLY",
"last_execution_date": "2020-03-11",
"last_execution_duration_in_seconds": 4.761,
"last_execution_id": "5445249c-a097-447a-a957-f54f446adfd2",
"last_execution_status": "COMPLETED",
"last_execution_timestamp": "2020-03-11T02:34:23.683Z",
"last_updated_timestamp": "2020-03-11T02:34:23.683Z",
"modules": [
{
"name": "pandas",
"version": "0.24.2"
},
{
"name": "Python",
"version": "3.7"
}
],
"name": "engineering-main-pre-stage",
"owner": "Yuri Gagarin",
"owner_contact": "y.gagarin@",
"status": "ACTIVE",
"tags": [
{
"key": "org",
"value": "VOSTOK"
}
],
"type": "INGESTION",
"version": 127
}

2. Add the record below to the octagon-Pipelines-dev DynamoDB table:

{
"description": "Main Pipeline to Merge Data",
"ingestion_frequency": "WEEKLY",
"last_execution_date": "2020-03-11",
"last_execution_duration_in_seconds": 570.559,
"last_execution_id": "0bb30d20-ace8-4cb2-a9aa-694ad018694f",
"last_execution_status": "COMPLETED",
"last_execution_timestamp": "2020-03-11T02:44:36.069Z",
"last_updated_timestamp": "2020-03-11T02:44:36.069Z",
"modules": [
{
"name": "PySpark",
"version": "1.0"
}
],
"name": "engineering-main-post-stage",
"owner": "Neil Armstrong",
"owner_contact": "n.armstrong@",
"status": "ACTIVE",
"tags": [
{
"key": "org",
"value": "NASA"
}
],
"type": "TRANSFORM",
"version": 4
}

3. Add the record below to the octagon-Datsets-dev DynamoDB table:

{
"classification": "Orange",
"description": "Meteorites Name, Location and Classification",
"frequency": "DAILY",
"max_items_process": 250,
"min_items_process": 1,
"name": "engineering-meteorites",
"owner": "NASA",
"owner_contact": "[email protected]",
"pipeline": "main",
"tags": [
{
"key": "cost",
"value": "meteorites division"
}
],
"transforms": {
"stage_a_transform": "light_transform_blueprint",
"stage_b_transform": "heavy_transform_blueprint"
},
"type": "TRANSACTIONAL",
"version": 1
}

 

If you want to create these samples using AWS CLI, please refer to this documentation.

Record 1:

aws dynamodb put-item --table-name octagon-Pipelines-dev --item '{"description":{"S":"Main Pipeline to Merge Data"},"ingestion_frequency":{"S":"WEEKLY"},"last_execution_date":{"S":"2021-03-16"},"last_execution_duration_in_seconds":{"N":"930.097"},"last_execution_id":{"S":"e23b7dae-8e83-4982-9f97-5784a9831a14"},"last_execution_status":{"S":"COMPLETED"},"last_execution_timestamp":{"S":"2021-03-16T04:31:16.968Z"},"last_updated_timestamp":{"S":"2021-03-16T04:31:16.968Z"},"modules":{"L":[{"M":{"name":{"S":"PySpark"},"version":{"S":"1.0"}}}]},"name":{"S":"engineering-main-post-stage"},"owner":{"S":"Neil Armstrong"},"owner_contact":{"S":"n.armstrong@"},"status":{"S":"ACTIVE"},"tags":{"L":[{"M":{"key":{"S":"org"},"value":{"S":"NASA"}}}]},"type":{"S":"TRANSFORM"},"version":{"N":"8"}}'

Record 2:

aws dynamodb put-item --table-name octagon-Pipelines-dev --item '{"description":{"S":"Main Pipeline to Ingest Data"},"ingestion_frequency":{"S":"WEEKLY"},"last_execution_date":{"S":"2021-03-28"},"last_execution_duration_in_seconds":{"N":"1.75"},"last_execution_id":{"S":"7e0e04e7-b05e-41a6-8ced-829d47866a6a"},"last_execution_status":{"S":"COMPLETED"},"last_execution_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"last_updated_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"modules":{"L":[{"M":{"name":{"S":"pandas"},"version":{"S":"0.24.2"}}},{"M":{"name":{"S":"Python"},"version":{"S":"3.7"}}}]},"name":{"S":"engineering-main-pre-stage"},"owner":{"S":"Yuri Gagarin"},"owner_contact":{"S":"y.gagarin@"},"status":{"S":"ACTIVE"},"tags":{"L":[{"M":{"key":{"S":"org"},"value":{"S":"VOSTOK"}}}]},"type":{"S":"INGESTION"},"version":{"N":"238"}}'

Record 3:

aws dynamodb put-item --table-name octagon-Pipelines-dev --item '{"description":{"S":"Main Pipeline to Ingest Data"},"ingestion_frequency":{"S":"WEEKLY"},"last_execution_date":{"S":"2021-03-28"},"last_execution_duration_in_seconds":{"N":"1.75"},"last_execution_id":{"S":"7e0e04e7-b05e-41a6-8ced-829d47866a6a"},"last_execution_status":{"S":"COMPLETED"},"last_execution_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"last_updated_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"modules":{"L":[{"M":{"name":{"S":"pandas"},"version":{"S":"0.24.2"}}},{"M":{"name":{"S":"Python"},"version":{"S":"3.7"}}}]},"name":{"S":"engineering-main-pre-stage"},"owner":{"S":"Yuri Gagarin"},"owner_contact":{"S":"y.gagarin@"},"status":{"S":"ACTIVE"},"tags":{"L":[{"M":{"key":{"S":"org"},"value":{"S":"VOSTOK"}}}]},"type":{"S":"INGESTION"},"version":{"N":"238"}}'

Now upload the sample json files to the raw s3 bucket. The raw S3 bucket name can be obtained in the output of the common-cloudformation stack deployed as part of the microservices architecture CodePipeline. Navigate to the CloudFormation console in the region where the CodePipeline was deployed and locate the stack with the name common-cloudformation, navigate to the Outputs section, and then note the output bucket name with the key oCentralBucket. Navigate to the Amazon S3 Bucket console and locate the bucket for oCentralBucket, create two path directories named engineering/meteorites, and upload every sample json file to this directory. Meteorites sample json files are available in the utils/meteorites-test-json-files directory of the previously cloned repository. Wait a few minutes and then navigate to the stage bucket noted from the common-cloudformation stack output name oStageBucket. You can see json files converted into csv in pre-stage/engineering/meteorites folder in S3. Wait a few more minutes and then navigate to the post-stage/engineering/meteorites folder in the oStageBucket to see the csv files converted to parquet format.

 

Cleanup

Navigate to the AWS CloudFormation console, note the S3 bucket names from the common-cloudformation stack outputs, and empty the S3 buckets. Refer to Emptying the Bucket for more information.

Delete the CloudFormation stacks in the following order:
1. Common-Cloudformation
2. stagea
3. stageb
4. sdlf-engineering-meteorites
Then delete the infrastructure CloudFormation stack datalake-infra-resources deployed using the codepipeline.yaml template. Refer to the following documentation to delete CloudFormation Stacks: Deleting a stack on the AWS CloudFormation console or Deleting a stack using AWS CLI.

 

Conclusion

This method lets us use CI/CD via CodePipeline, CodeCommit, and CodeBuild, along with other AWS services, to automatically deploy container images to Lambda Functions that are part of the microservices architecture. Furthermore, we can build a common layer that is equivalent to the Lambda layer that could be built independently via its own CodePipeline, and then build the container image and push to Amazon ECR. Then, the common layer container image Amazon ECR functions as a source along with its own CodeCommit repository which holds the code for the microservices architecture CodePipeline. Having two sources for microservices architecture codepipeline lets us build every docker image. This is due to a change made to the common layer docker image that is referred to in other docker images, and another source that holds the code for other microservices including Lambda Function.

 

About the Author

kirankumar.jpeg Kirankumar Chandrashekar is a Sr.DevOps consultant at AWS Professional Services. He focuses on leading customers in architecting DevOps technologies. Kirankumar is passionate about DevOps, Infrastructure as Code, and solving complex customer issues. He enjoys music, as well as cooking and traveling.

 

Build and deploy .NET web applications to ARM-powered AWS Graviton 2 Amazon ECS Clusters using AWS CDK

Post Syndicated from Matt Laver original https://aws.amazon.com/blogs/devops/build-and-deploy-net-web-applications-to-arm-powered-aws-graviton-2-amazon-ecs-clusters-using-aws-cdk/

With .NET providing first-class support for ARM architecture, running .NET applications on an AWS Graviton processor provides you with more choices to help optimize performance and cost. We have already written about .NET 5 with Graviton benchmarks; in this post, we explore how C#/.NET developers can take advantages of Graviton processors and obtain this performance at scale with Amazon Elastic Container Service (Amazon ECS).

In addition, we take advantage of infrastructure as code (IaC) by using the AWS Cloud Development Kit (AWS CDK) to define the infrastructure .

The AWS CDK is an open-source development framework to define cloud applications in code. It includes constructs for Amazon ECS resources, which allows you to deploy fully containerized applications to AWS.

Architecture overview

Our target architecture for our .NET application running in AWS is a load balanced ECS cluster, as shown in the following diagram.

Show load balanced Amazon ECS Cluster running .NET application

Figure: Show load balanced Amazon ECS Cluster running .NET application

We need to provision many components in this architecture, but this is where the AWS CDK comes in. AWS CDK is an open source-software development framework to define cloud resources using familiar programming languages. You can use it for the following:

  • A multi-stage .NET application container build
  • Create an Amazon Elastic Container Registry (Amazon ECR) repository and push the Docker image to it
  • Use IaC written in .NET to provision the preceding architecture

The following diagram illustrates how we use these services.

Show pplication and Infrastructure code written in .NET

Figure: Show Application and Infrastructure code written in .NET

Setup the development environment

To deploy this solution on AWS, we use the AWS Cloud9 development environment.

  1. On the AWS Cloud9 console, choose Create environment.
  2. For Name, enter a name for the environment.
  3. Choose Next step.
  4. On the Environment settings page, keep the default settings:
    1. Environment type – Create a new EC2 instance for the environment (direct access)
    2. Instance type – t2.micro (1 Gib RAM + 1 vCPU)
    3. Platform – Amazon Linux 2(recommended)
    Show Cloud9 Environment settings

    Figure: Show Cloud9 Environment settings

  5. Choose Next step.
  6. Choose Create environment.

When the Cloud9 environment is ready, proceed to the next section.

Install the .NET SDK

The AWS development tools we require will already be setup in the Cloud9 environment, however the .NET SDK will not be available.

Install the .NET SDK with the following code from the Cloud9 terminal:

curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin -c 5.0
export PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/.dotnet

Verify the expected version has been installed:

dotnet --version
Show installed .NET SDK version

Figure: Show installed .NET SDK version

Clone and explore the example code

Clone the example repository:

git clone https://github.com/aws-samples/aws-cdk-dotnet-graviton-ecs-example.git

This repository contains two .NET projects, the web application, and the IaC application using the AWS CDK.

The unit of deployment in the AWS CDK is called a stack. All AWS resources defined within the scope of a stack, either directly or indirectly, are provisioned as a single unit.

The stack for this project is located within /cdk/src/Cdk/CdkStack.cs. When we read the C# code, we can see how it aligns with the architecture diagram at the beginning of this post.

First, we create a virtual private cloud (VPC) and assign a maximum of two Availability Zones:

var vpc = new Vpc(this, "DotNetGravitonVpc", new VpcProps { MaxAzs = 2 });

Next, we define the cluster and assign it to the VPC:

var cluster = new Cluster(this, "DotNetGravitonCluster", new ClusterProp { Vpc = vpc });

The Graviton instance type (c6g.4xlarge) is defined in the cluster capacity options:

cluster.AddCapacity("DefaultAutoScalingGroupCapacity",
    new AddCapacityOptions
    {
        InstanceType = new InstanceType("c6g.4xlarge"),
        MachineImage = EcsOptimizedImage.AmazonLinux2(AmiHardwareType.ARM)
    });

Finally, ApplicationLoadBalancedEC2Service is defined, along with a reference to the application source code:

new ApplicationLoadBalancedEc2Service(this, "Service",
    new ApplicationLoadBalancedEc2ServiceProps
    {
        Cluster = cluster,
        MemoryLimitMiB = 8192,
        DesiredCount = 2,
        TaskImageOptions = new ApplicationLoadBalancedTaskImageOptions
        {
            Image = ContainerImage.FromAsset(Path.Combine(Directory.GetCurrentDirectory(), @"../app")),                        
        }                             
    });

With about 30 lines of AWS CDK code written in C#, we achieve the following:

  • Build and package a .NET application within a Docker image
  • Push the Docker image to Amazon Elastic Container Registry (Amazon ECR)
  • Create a VPC with two Availability Zones
  • Create a cluster with a Graviton c6g.4xlarge instance type that pulls the Docker image from Amazon ECR

The AWS CDK has several useful helpers, such as the FromAsset function:

Image =  ContainerImage.FromAsset(Path.Combine(Directory.GetCurrentDirectory(), @"../app")),  

The ContainerImage.FromAsset function instructs the AWS CDK to build the Docker image from a Dockerfile, automatically create an Amazon ECR repository, and upload the image to the repository.

For more information about the ContainerImage class, see ContainerImage.

Build and deploy the project with the AWS CDK Toolkit

The AWS CDK Toolkit, the CLI command cdk, is the primary tool for interaction with AWS CDK apps. It runs the app, interrogates the application model you defined, and produces and deploys the AWS CloudFormation templates generated by the AWS CDK.

If an AWS CDK stack being deployed uses assets such as Docker images, the environment needs to be bootstrapped. Use the cdk bootstrap command from the /cdk directory:

cdk bootstrap

Now you can deploy the stack into the AWS account with the deploy command:

cdk deploy

The AWS CDK Toolkit synthesizes fresh CloudFormation templates locally before deploying anything. The first time this runs, it has a changeset that reflects all the infrastructure defined within the stack and prompts you for confirmation before running.

When the deployment is complete, the load balancer DNS is in the Outputs section.

Show stack outputs

Figure: Show stack outputs

You can navigate to the load balancer address via a browser.

Browser navigating to .NET application

Figure: Show browser navigating to .NET application

Tracking the drift

Typically drift is a change that happens outside of the Infrastructure as Code, for example, code updates to the .NET application.

To support changes, the AWS CDK Toolkit queries the AWS account for the last deployed CloudFormation template for the stack and compares it with the locally generated template. Preview the changes with the following code:

cdk diff

If a simple text change within the application’s home page HTML is made (app/webapp/Pages/Index.cshtml), a difference is detected within the assets, but not all the infrastructure as per the first deploy.

Show cdk diff output

Figure: Show cdk diff output

Running cdk deploy again now rebuilds the Docker image, uploads it to Amazon ECR, and refreshes the containers within the ECS cluster.

cdk deploy
Show browser navigating to updated .NET application

Figure: Show browser navigating to updated .NET application

Clean up

Remove the resources created in this post with the following code:

cdk destroy

Conclusion

Using the AWS CDK to provision infrastructure in .NET provides rigor, clarity, and reliability in a language familiar to .NET developers. For more information, see Infrastructure as Code.

This post demonstrates the low barrier to entry for .NET developers wanting to apply modern application development practices while taking advantage of the price performance of ARM-based processors such as Graviton.

To learn more about building and deploying .NET applications on AWS visit our .NET Developer Center.

About the author

Author Matt Laver

 

Matt Laver is a Solutions Architect at AWS working with SMB customers in the UK. He is passionate about DevOps and loves helping customers find simple solutions to difficult problems.

 

Build and Deploy Docker Images to AWS using EC2 Image Builder

Post Syndicated from Joseph Keating original https://aws.amazon.com/blogs/devops/build-and-deploy-docker-images-to-aws-using-ec2-image-builder/

The NFL, an AWS Professional Services partner, is collaborating with NFL’s Player Health and Safety team to build the Digital Athlete Program. The Digital Athlete Program is working to drive progress in the prevention, diagnosis, and treatment of injuries; enhance medical protocols; and further improve the way football is taught and played. The NFL, in conjunction with AWS Professional Services, delivered an EC2 Image Builder pipeline for automating the production of Docker images. Following similar practices from the Digital Athlete Program, this post demonstrates how to deploy an automated Image Builder pipeline.

“AWS Professional Services faced unique environment constraints, but was able to deliver a modular pipeline solution leveraging EC2 Image Builder. The framework serves as a foundation to create hardened images for future use cases. The team also provided documentation and knowledge transfer sessions to ensure our team was set up to successfully manage the solution.”—Joseph Steinke, Director, Data Solutions Architect, National Football League

A common scenario you may face is how to build Docker images that can be utilized throughout your organization. You may already have existing processes that you’re looking to modernize. You may be looking for a streamlined, managed approach so you can reduce the overhead of operating your own workflows. Additionally, if you’re new to containers, you may be seeking an end-to-end process you can use to deploy containerized workloads. With either case, there is need for a modern, streamlined approach to centralize the configuration and distribution of Docker images. This post demonstrates how to build a secure end-to-end workflow for building secure Docker images.

Image Builder now offers a managed service for building Docker images. With Image Builder, you can automatically produce new up-to-date container images and publish them to specified Amazon Elastic Container Registry (Amazon ECR) repositories after running stipulated tests. You don’t need to worry about the underlying infrastructure. Instead, you can focus simply on your container configuration and use the AWS tools to manage and distribute your images. In this post, we walk through the process of building a Docker image and deploying the image to Amazon ECR, share some security best practices, and demonstrate deploying a Docker image to Amazon Elastic Container Service (Amazon ECS). Additionally, we dive deep into building Docker images following modern principles.

The project we create in this post addresses a use case in which an organization needs an automated workflow for building, distributing, and deploying Docker images. With Image Builder, we build and deploy Docker images and test our image locally that we have created with our Image Builder pipeline.

 

Solution Overview

The following diagram illustrates our solution architecture.

Show the architecture of the Docker EC2 Image Builder Pipeline

Figure: Show the architecture of the Docker EC2 Image Builder Pipeline

 

We configure the Image Builder pipeline with AWS CloudFormation. Then we use Amazon Simple Storage Service (Amazon S3) as our source for the pipeline. This means that when we want to update the pipeline with a new Dockerfile, we have to update the source S3 bucket. The pipeline assumes an AWS Identity and Access Management (IAM) role that we generate later in the post. When the pipeline is run, it pulls the latest Dockerfile configuration from Amazon S3, builds a Docker image, and deploys the image to Amazon ECR. Finally, we use AWS Copilot to deploy our Docker image to Amazon ECS. For more information about Copilot, see Applications.

The style in which the Dockerfile application code was written is a personal preference. For more information, see Best practices for writing Dockerfiles.

 

Overview of AWS services

For this post, we use the following services:

  • EC2 Image BuilderImage Builder is a fully managed AWS service that makes it easy to automate the creation, management, and deployment of customized, secure, and up-to-date server images that are pre-installed and pre-configured with software and settings to meet specific IT standards.
  • Amazon ECRAmazon ECR is an AWS managed container image registry service that is secure, scalable, and reliable.
  • CodeCommit – AWS CodeCommit is a fully-managed source control service that hosts secure Git-based repositories.
  • AWS KMS – Amazon Key Management Service (AWS KMS) is a fully managed service for creating and managing cryptographic keys. These keys are natively integrated with most AWS services. You use a KMS key in this post to encrypt resources.
  • Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service utilized for storing and encrypting data. We use Amazon S3 to store our configuration files.
  • AWS CloudFormation – AWS CloudFormation allows you to use domain-specific languages or simple text files to model and provision, in an automated and secure manner, all the resources needed for your applications across all Regions and accounts. You can deploy AWS resources in a safe, repeatable manner, and automate the provisioning of infrastructure.

 

Prerequisites

To provision the pipeline deployment, you must have the following prerequisites:

 

CloudFormation templates

You use the following CloudFormation templates to deploy several resources:

  • vpc.yml – Contains all the core networking configuration. It deploys the VPC, two private subnets, two public subnets, and the route tables. The private subnets utilize a NAT gateway to communicate to the internet. The public subnets have full outbound access to the internet gateway.
  • kms.yml – Contains the AWS Key Management Service (AWS KMS) configuration that we use for encrypting resources. The KMS key policy is also configured in this template.
  • s3-iam-config.yml – Contains the S3 bucket and IAM roles we use with our Image Builder pipeline.
  • docker-image-builder.yml – Contains the configuration for the Image Builder pipeline that we use to build Docker images.

 

Docker Overview

Containerizing an application comes with many benefits. By containerizing an application, the application is decoupled from the underlying infrastructure, greater consistency is gained across environments, and the application can now be deployed in a loosely coupled microservice model. The lightweight nature of containers enables teams to spend less time configuring their application and more time building features that create value for their customers. To achieve these great benefits, you need reliable resources to centralize the creation and distribution of your container images. Additionally, you need to understand container fundamentals. Let’s start by reviewing a Docker base image.

In this post, we follow the multi-stage pattern for building our Docker image. With this approach, we can selectively copy artifacts from one phase to another. This allows you to remove anything not critical to the application’s function in the final image. Let’s walk through some of the logic we put into our Docker image to optimize performance and security.

Let’s begin by looking at line 15-25. Here, we are pulling down the latest amazon/aws-cli Docker image. We are leveraging this image so that we can utilize IAM credentials to clone our CodeCommit repository. In lines 15-24 we are installing and configuring our git configuration. Finally, in line 25 we are cloning our application code from our repository.

In this next section, we set environment variables, installing packages, unpack tar files, and set up a custom Java Runtime Environment (JRE). Amazon Corretto is a no-cost, multi-platform, production-ready distribution of the Open Java Development Kit (OpenJDK). One important distinction to make here is how we are utilizing RUN and ADD in the Dockerfile. By configuring our own custom JRE we can remove unnecessary modules from our image. One of our goals with building Docker images is to keep them lightweight, which is why we are taking the extra steps to ensure that we don’t add any unnecessary configuration.

Let’s take a look at the next section of the Dockerfile. Now that we have all the package that we require, we will create a working directory where we will install our demo app. After the application code is pulled down from CodeCommit, we use Maven to build our artifact.

In the following code snippet, we use FROM to begin a new stage in our build. Notice that we are using the same base as our first stage. If objects on the disk/filesystem in in the first stage stay the same, the previous stage cache can be reused. Using this pattern can greatly reduce build time.

Docker images have a single unique digest. This is a SHA-256 value and is known as the immutable identifier for the image. When changes are made to your image, through a Dockerfile update for example, a new image with a new immutable identifier is generated. The immutable identifier is pinned to prevent unexpected behaviors in code due to change or update. You can also prevent man-in-the-middle attacks by adopting this pattern. Additionally, using a SHA can mitigate the risk of having to rely on mutable tags that can be applied or changed to the wrong image by mistake. You can use the following command to check to ensure that no unintended changes occured.

docker images <input_container_image_id> --digests

Lastly, we configure our final stage, in which we create a user and group to manage our application inside the container. As this user, we copy the binaries created from our first stage. With this pattern, you can clearly see the benefit of using stages when building Docker images. Finally, we note the port that should be published with expose for the container and we define our Entrypoint, which is the instruction we use to run our container.

 

Deploying the CloudFormation templates

To deploy your templates, complete the following steps:

1. Create a directory where we store all of our demo code by running the following from your terminal:

mkdir awsblogrepo && cd awsblogrepo

 

2. Clone the source code repository found in the following location:

git clone https://github.com/aws-samples/build-and-deploy-docker-images-to-aws-using-ec2-image-builder.git

You now use the AWS CLI to deploy the CloudFormation templates. Make sure to leave the CloudFormation template names as written in this post.

 

3. Deploy the VPC CloudFormation template:

aws cloudformation create-stack \
--stack-name vpc-config \
--template-body file://templates/vpc.yml \
--parameters file://parameters/vpc-params.json  \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following code:

{
    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/vpc-config/12e90fe0-76c9-11eb-9284-12717722e021"
}

 

4. Open the parameters/kms-params.json file and update the UserARN parameter with your account ID:

[
  {
      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
  },
  {
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::<input_your_account_id>:root"
  }
]

 

5. Deploy the KMS key CloudFormation template:

aws cloudformation create-stack \
--stack-name kms-config \
--template-body file://templates/kms.yml \
--parameters file://parameters/kms-params.json \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following:

{
    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/kms-config/66a663d0-777d-11eb-ad2b-0e84b19d341f"
}

 

6. Open the parameters/s3-iam-config.json file and update the DemoConfigS3BucketName parameter to a unique name of your choosing:

[
  {
    "ParameterKey" : "Environment",
    "ParameterValue" : "dev"
  },
  {
    "ParameterKey": "NetworkStackName",
    "ParameterValue" : "vpc-config"
  },
  {
    "ParameterKey" : "EC2InstanceRoleName",
    "ParameterValue" : "EC2InstanceRole"
  },
  {
    "ParameterKey" : "DemoConfigS3BucketName",
    "ParameterValue" : "<input_your_unique_bucket_name>"
  },
  {
    "ParameterKey" : "KMSStackName",
    "ParameterValue" : "kms-config"
  }
]

 

7. Deploy the IAM role configuration template:

aws cloudformation create-stack \
--stack-name s3-iam-config \
--template-body file://templates/s3-iam-config.yml \
--parameters file://parameters/s3-iam-config.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The output should look like the following:

{
    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/s3-iam-config/8b69c270-7782-11eb-a85c-0ead09d00613"
}

 

8. Open the parameters/kms-params.json file:

[
  {
      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
  },
  {
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::1234567891012:root"
  }
]

 

9. Add the following values as a comma-separated list to the UserARN parameter key. Make sure to provide your AWS account ID:

arn:aws:iam::<input_your_aws_account_id>:role/EC2ImageBuilderRole

When finished, the file should look similar to the following:

[
  {
      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
  },
  {
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::123456789012:role/EC2ImageBuilderRole,arn:aws:iam::123456789012:root"
  }
]

Now that the AWS KMS parameter file has been updated, you update the AWS KMS CloudFormation stack.

 

10. Run the following command to update the kms-config stack:

aws cloudformation update-stack \
--stack-name kms-config \
--template-body file://templates/kms.yml \
--parameters file://parameters/kms-params.json \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following:

{
    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/kms-config/66a663d0-777d-11eb-ad2b-0e84b19d341f"
}

 

11. Open the parameters/docker-image-builder-params.json file and update the ImageBuilderBucketName parameter to the bucket name you generated earlier:

[
  {
    "ParameterKey": "Environment",
    "ParameterValue": "dev"
  },
  {
      "ParameterKey": "ImageBuilderBucketName",
      "ParameterValue": "<input_your_s3_bucket_name>"
  },
  {
      "ParameterKey": "NetworkStackName",
      "ParameterValue": "vpc-config"
  },
  {
      "ParameterKey": "KMSStackName",
      "ParameterValue": "kms-config"
  },
  {
      "ParameterKey": "S3ConfigStackName",
      "ParameterValue": "s3-iam-config"
  },
  {
      "ParameterKey": "ECRName",
      "ParameterValue": "demo-ecr"
  }
]

 

12. Run the following commands to upload the Dockerfile and component file to S3. Make sure to update the s3 bucket name with the name you generated earlier:

aws s3 cp java/Dockerfile s3://<input_your_bucket_name>/Dockerfile && \
aws s3 cp components/component.yml s3://<input_your_bucket_name>/component.yml

The output should look like the following:

upload: java/Dockerfile to s3://demo12345/Dockerfile
upload: components/component.yml to s3://demo12345/component.yml

 

13. Deploy the docker-image-builder.yml template:

aws cloudformation create-stack \
--stack-name docker-image-builder-config \
--template-body file://templates/docker-image-builder.yml \
--parameters file://parameters/docker-image-builder-params.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The output should look like the following:

{
    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/docker-image-builder/24317190-76f4-11eb-b879-0afa5528cb21"
}

 

Configure the Repository

You use AWS CodeCommit as your source control repository. You now walk through the steps of deploying our CodeCommit repository:

 

1. On the CodeCommit console, choose Repositories.

 

2. Locate your repository and under Clone URL, choose HTTPS.

Shows DemoRepo CodeCommit Repository

Figure: Shows DemoRepo CodeCommit Repository

You clone this repository in the build directory you created when deploying the CloudFormation templates.

 

3. In your terminal, past the Git URL from the previous step and clone the repository:

git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/DemoRepo

 

4. Now let’s create and push your main branch:

cd DemoRepo
git checkout -b main
touch initial.txt
git add . && git commit -m "Initial commit"
git push -u origin main

 

5. On the Repositories page of the CodeCommit console, choose DemoRepo.

The following screenshot shows that we have created our main branch and pushed our first commit to our repository.

Shows the DemoRepo main branch

Figure: Shows the DemoRepo main branch

 

6. Back in your terminal, create a new feature branch:

git checkout -b feature/configure-repo

 

7. Create the build directories:

mkdir templates; \
mkdir parameters; \
mkdir java; \
mkdir components

You now copy over the configuration files from the cloned GitHub repository to our CodeCommit repository.

 

8. Run the following command from the awsblogrepo directory you created earlier:

cp -r build-and-deploy-docker-images-to-aws-using-ec2-image-builder/* DemoRepo/

 

9. Commit and push your changes:

git add . && git commit -m "Copying config files into source control." 
git push --set-upstream origin feature/configure-repo

 

10. On the CodeCommit console, navigate to DemoRepo.

Shows the DemoRepo CodeCommit Repository

Figure: Shows the DemoRepo CodeCommit Repository

 

11. In the navigation pane, under Repositories, choose Branches.

Shows the DemoRepo's code

Figure: Shows the DemoRepo’s code

 

12. Select the feature/configure-repo branch.

Shows the DemoRepo's branches

Figure: Shows the DemoRepo’s branches

 

13. Choose Create pull request.

Shows the DemoRepo code

Figure: Shows the DemoRepo code

 

14. For Title, enter Repository Configuration.

 

15. For Description, enter a brief description.

 

16. Choose Create pull request.

Shows a pull request for DemoRepo

Figure: Shows a pull request for DemoRepo

 

17. Choose Merge to merge the pull request.

Shows merge for DemoRepo pull request

Figure: Shows merge for DemoRepo pull request

Now that you have all the code copied into your CodeCommit repository, you now build an image using the Image Builder pipeline.

 

EC2 Image Builder Deep Dive

With Image Builder, you can build and deploy Docker images to your AWS account. Let’s look at how your Image Builder pipeline is configured.

A recipe defines the source image to use as your starting point to create a new image, along with the set of components that you add to customize your image and verify that everything is working as expected. Take note of the ParentImage property. Here, you’re declaring that the parent image that your pipeline pulls from the latest Amazon Linux image. This enables organizations to define images that they have approved to be utilized downstream by development teams. Having better control over what Docker images development teams are using improves an organization security posture while enabling the developers to have the tools they need readily available. The DockerfileTemplateUri property refers to the location of the Dockerfile that your Image Builder pipeline is deploying. Take some time to review the configuration.

 

Run the Image Builder Pipeline

Now you build a Docker image by running the pipeline.

1. Update your account ID and run the following command:

aws imagebuilder start-image-pipeline-execution \
--image-pipeline-arn arn:aws:imagebuilder:us-east-1:<input_your_aws_account_id>:image-pipeline/docker-image-builder-config-docker-java-container

The output should look like the following:

{
    "requestId": "87931a2e-cd74-44e9-9be1-948fec0776aa",
    "clientToken": "e0f710be-0776-43ea-a6d7-c10137a554bf",
    "imageBuildVersionArn": "arn:aws:imagebuilder:us-east-1:123456789012:image/docker-image-builder-config-container-recipe/1.0.0/1"
}

 

2. On the Image Builder console, choose the docker-image-builder-config-docker-java-container pipeline.

 Shows EC2 Image Builder Pipeline status

Figure: Shows EC2 Image Builder Pipeline status

At the bottom of the page, a new Docker image is building.

 

3. Wait until the image status becomes Available.

Shows docker image building in EC2 Image Builder console

Figure: Shows docker image building in EC2 Image Builder console

 

4. On the Amazon ECR console, open java-demo-ib.

The Docker image has been successfully created, tagged, and deployed to Amazon ECR from the Image Builder pipeline.

Shows demo-java-ib image in ECR

Figure: Shows demo-java-ib image in ECR

 

Test the Docker Image Locally

1. On the Amazon ECR console, open java-demo-ib.

 

2. Copy the image URI.

ECR Screenshot

 

3. Run the following commands to authenticate to your ECR repository:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <input_your_account_id>.dkr.ecr.us-east-1.amazonaws.com

 

4. Run the following command in your terminal, and update the Amazon ECR URI with the content you copied from the previous step:

docker pull <input_ecr_image_uri>

You should see output similar to the following:

1.0.0-80: Pulling from demo-java-ib
596ba82af5aa: Pull complete 
6f476912a053: Pull complete 
3e7162a86ef8: Pull complete 
ec7d8bb8d044: Pull complete 
Digest: sha256:14668cda786aa496f406062ce07087d66a14a7022023091e9b953aae0bdf2634
Status: Downloaded newer image for 123456789012.dkr.ecr.us-east-1.amazonaws.com/demo-java-ib:1.0.0-1
123456789012.dkr.ecr.us-east-1.amazonaws.com/demo-java-ib:1.0.0-1

 

5. Run the following command from your terminal:

docker image ls

You should see output similar to the following:

REPOSITORY                                                  TAG        IMAGE ID       CREATED          SIZE
123456789012.dkr.ecr.us-east-1.amazonaws.com/demo-java-ib   1.0.0-1   ac75e982863c   34 minutes ago   47.3MB

 

6. Run the following command from your terminal using the IMAGE ID value from the previous output:

docker run -dp 8090:8090 --name java_hello_world -it <docker_image_id> sh

You should see an output similar to the following:

49ea3a278639252058b55ab80c71245d9f00a3e1933a8249d627ce18c3f59ab1

 

7. Test your container by running the following command:

curl localhost:8090

You should see an output similar to the following:

Hello World!

 

8. Now that you have verified that your container is working properly, you can stop your container. Run the following command from your terminal:

docker stop java_hello_world

 

Conclusion

In this article, we showed how to leverage AWS services to automate the creation, management, and distribution of Docker Images. We walked through how to configure EC2 Image Builder to create and distribute Docker images. Finally, we built a Docker image using our EC2 Image Builder pipeline and tested the image locally. Thank you for reading!

 

 

 

Joe Keating is a Modernization Architect in Professional Services at Amazon Web Services. He works with AWS customers to design and implement a variety of solutions in the AWS Cloud. Joe enjoys cooking with a glass or two of wine and achieving mediocrity on the golf course.

 

 

 

Virginia Chu is a Sr. Cloud Infrastructure Architect in Professional Services at Amazon Web Services. She works with enterprise-scale customers around the globe to design and implement a variety of solutions in the AWS Cloud.

 

 

 

BK works as a Senior Security Architect with AWS Professional Services. He love to solve security problems for his customers, and help them feel comfortable within AWS. Outside of work, BK loves to play computer games, and go on long drives.

Scaling Zabbix with containers

Post Syndicated from Robert Silva original https://blog.zabbix.com/scaling-zabbix-with-containers/13155/

In this post, a new approach with Zabbix in High Availability is explained, as well as discussed challenges when implementing Zabbix using Docker Swarm with CI / CD and such technologies as Containers, Docker Swarm, Gitlab, and CI/CD.

Contents

I. Zabbix project requirements (0:33)
II. New approach (3:06)

III. Compose file and Deploy (8:08)
IV. Notes (16:32)
V. Gitlab CI/CD (20:34)
VI. Benefits of the architecture (24:57)
VII. Questions & Answers (25:53)

Zabbix project requirements

The first time using Docker was a challenge. The Zabbix environment needed to meet the following requirements:

  • to monitor more than 3,000 NVPS;
  • to be fault-tolerant;
  • to be resilient;
  • to scale the environment horizontally.

There are five ways to install Zabbix — using packages, compiling, Docker, cloud, or appliance.

We used virtual machines or physical servers to install Zabbix directly on the operation system. In this scenario, it is necessary to install the operating system and update it to improve performance. Then you need to install Zabbix, configure the backup of the configuration files and the database.

However, with such an installation, when the services are unavailable as Zabbix Server or Zabbix frontend is down, the usual solution is a human intervention to restart the service or the server, create a new instance, or restore the backup.

Still, we don’t need to assign a specialist to manually solve such issues. The services must be able to restore themselves.

To create a more intelligent environment, we can use some standard solutions — Corosync and Pacemaker. However, there are better solutions for High Availability.

New approach

Zabbix can be deployed using advanced technologies, such as:

  • Docker,
  • Docker Swarm,
  • Reverse Proxy,
  • GIT,
  • CI/CD.

Initially, the instance was divided into various components.

Initial architecture

HAProxy

HAProxy is responsible for receiving incoming connections and directing them to the nodes of the Docker Swarm cluster. So, with each attempt to access the Zabbix frontend, the request is sent to the HAProxy. And it will detect where there is the service listening to HAProxy and redirect the request.

Accessing the frontend.domain

We are sending the request to the HAProxy address to check which nodes are available. If a node is unavailable, the HAProxy will not send the requests to these nodes anymore.

HAProxy configuration file (haproxy.cfg)

When you configure load balancing using HAProxy, two types of nodes need to be defined: frontend and backend. Here, the traefik service is used as an example.

HAProxy listens for connections by the frontend node. In the frontend, we configure the port to receive communications and associate the backend to it.

frontend traefik
mode http
bind 0.0.0.0:80
option forwardfor
monitor-uri /health
default_backend backend_traefik

HAProxy can forward requests by the backend nodes. In the backend we define, which services are using the traefik service, the check mode, the servers running the application, and the port to listen to. 

backend backend_traefik
mode http
cookie Zabbix prefix
server DOCKERHOST1 10.250.6.52:8080 cookie DOCKERHOST1 check
server DOCKERHOST2 10.250.6.53:8080 cookie DOCKERHOST2 check
server DOCKERHOST3 10.250.6.54:8080 cookie DOCKERHOST3 check
stats admin if TRUE
option tcp-check

We also can define where the Zabbix Server can run. Here, we have only one Zabbix Server container running.

frontend zabbix_server
mode tcp
bind 0.0.0.0:10051
default_backend backend_zabbix_server
backend backend_zabbix_server
mode tcp
server DOCKERHOST1 10.250.6.52:10051 check
server DOCKERHOST2 10.250.6.53:10051 check
server DOCKERHOST3 10.250.6.54:10051 check
stats admin if TRUE
option tcp-check

NFS Server

NFS Server is responsible for storing the mapped files in the containers.

NFS Server

After installing the packages, you need to run the following commands to configure the NFS Server and NFS Client:

NFS Server

mkdir /data/data-docker
vim /etc/exports
/data/data-docker/ *(rw,sync,no_root_squash,no_subtree_check)

NFS Client

vim /etc/fstab :/data/data-docker /mnt/data-docker nfs defaults 0 0

Hosts Docker and Docker Swarm

Hosts Docker and Docker Swarm are responsible for running and orchestrating the containers.

Swarm consists of one or more nodes. The cluster can be of two types:

  • Managers that are responsible for managing the cluster and can perform workloads.
  • Workers that are responsible for performing the services or the loads.

Reverse Proxy

Reverse Proxy, another essential component of this architecture, is responsible for receiving an HTTP and HTTPS connections, identifying destinations, and redirecting to the responsible containers.

Reverse Proxy can be executed using nginx and traefik.

In this example, we have three containers running traefik. After receiving the connection from HAProxy, it will search for a destination container and send the package to it.

Compose file and Deploy

The Compose file — ./docker-compose.yml — a YAML file defining services, networks, and volumes. In this file, we determine what image of Zabbix Server is used, what network the container is going to connect to, what are the service names, and other necessary service settings.

Reverse Proxy

Here is the example of configuring Reverse Proxy using traefik.

traefik:
image: traefik:v2.2.8
deploy:
placement:
constraints:
- node.role == manager
replicas: 1
restart_policy:
condition: on-failure
labels:
# Dashboard traefik
- "traefik.enable=true"
- "traefik.http.services.justAdummyService.loadbalancer.server.port=1337"
- "traefik.http.routers.traefik.tls=true"
- "traefik.http.routers.traefik.rule=Host(`zabbix-traefik.mydomain`)"
- "traefik.http.routers.traefik.service=api@internal"

where:

traefik: — the name of the service (in the first line).
image: — here, we can define which image we can use.
deploy: — rules for creating the deploy.
constraints: — a place of deployment.
replicas: — how many replicas we can create for this service.
restart_policy: — which policy to use if the service has a problem.
labels: — defining labels for traefik, including the rules for calling the service.

Then we can define how to configure authentication for the dashboard and how to redirect all HTTP connections to HTTPS.

# Auth Dashboard - "traefik.http.routers.traefik.middlewares=traefik-auth" - "traefik.http.middlewares.traefik-auth.basicauth.users=admin:" 
# Redirect all HTTP to HTTPS permanently - "traefik.http.routers.http_catchall.rule=HostRegexp(`{any:.+}`)" - "traefik.http.routers.http_catchall.entrypoints=web" - "traefik.http.routers.http_catchall.middlewares=https_redirect" - "traefik.http.middlewares.https_redirect.redirectscheme.scheme=https" - "traefik.http.middlewares.https_redirect.redirectscheme.permanent=true"

Finally, we define the command to be executed after the container is started.

command:
- "--api=true"
- "--log.level=INFO"
- "--providers.docker.endpoint=unix:///var/run/docker.sock"
- "--providers.docker.swarmMode=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.file.directory=/etc/traefik/dynamic"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"

Zabbix Server

Zabbix Server configuration can be defined in this environment — the name of the Zabbix Server, image, OS, etc.

zabbix-server:
image: zabbix/zabbix-server-mysql:centos-5.0-latest
env_file:
- ./envs/zabbix-server/common.env
networks:
- "monitoring-network"
volumes:
- /mnt/data-docker/zabbix-server/externalscripts:/usr/lib/zabbix/externalscripts:ro
- /mnt/data-docker/zabbix-server/alertscripts:/usr/lib/zabbix/alertscripts:ro
ports:
- "10051:10051"
deploy:
<<: *template-deploy
labels:
- "traefik.enable=false"

In this case, we can use environment 5.0. Here, we can define, for instance, database address, database username, number of pollers we will start, the path for external and alert scripts, and other options.

In this example, we use two volumes — for external scripts and for alert scripts that must be stored in the NFS Server.

For this Zabbix, Server traefik is not enabled.

Frontend

For the frontend, we have another option, for instance, using the Zabbix image.

zabbix-frontend:
image: zabbix/zabbix-web-nginx-mysql:alpine-5.0.1
env_file:
- ./envs/zabbix-frontend/common.env
networks:
- "monitoring-network"
deploy:
<<: *template-deploy
replicas: 5
labels:
- "traefik.enable=true"
- "traefik.http.routers.zabbix-frontend.tls=true"
- "traefik.http.routers.zabbix-frontend.rule=Host(`frontend.domain`)"
- "traefik.http.routers.zabbix-frontend.entrypoints=web"
- "traefik.http.routers.zabbix-frontend.entrypoints=websecure"
- "traefik.http.services.zabbix-frontend.loadbalancer.server.port=8080"

Here, 5 replicas mean that we can start 5 Zabbix frontends. This can be used for more extensive environments, which also means that we have 5 containers and 5 connections.

Here, to access the frontend, we can use the ‘frontend.domain‘ name. If we use a different name, access to the frontend will not be available.

The load balancer server port defines to which port the container is listening and where the official Zabbix frontend image is stored.

Deploy

Up to now, deployment has been done manually. You needed to connect to one of the services with the Docker Swarm Manager function, enter the NFS directory, and deploy the service:

# docker stack deploy -c docker-compose.yaml zabbix

where -c defines the compose file’s name and ‘zabbix‘ — the name of the stack.

Notes

Docker Image

Typically, Docker official images from Zabbix are used. However, for the Zabbix Server and Zabbix Proxy is not enough. In production environments, additional patches are needed — scripts, ODBC drivers to monitor the database. You should learn to work with Docker and to create custom images.

Networks

When creating environments using Docker, you should be careful. The Docker environment has some internal networks, which can be in conflict with the physical network. So, it is necessary to change the default networks — Docker network overlay and Docker bridge.

Custom image

Example of customizing the Zabbix image to install ODBC drive.

ARG ZABBIX_BASE=centos 
ARG ZABBIX_VERSION=5.0.3 
FROM zabbix/zabbix-proxy-sqlite3:${ZABBIX_BASE}-${ZABBIX_VERSION}
ENV ORACLE_HOME=/usr/lib/oracle/12.2/client64
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/oracle/12.2/client64/lib
ENV PATH=$PATH:/usr/lib/oracle/12.2/client64/lib

Then we install ODBC drivers. This script allows for using ODBC drivers for Oracle, MySQL, etc.

# Install ODBC 
COPY ./drivers-oracle-12.2.0.1.0 /root/ 
COPY odbc.sh /root 
RUN chmod +x /root/odbc.sh && \ 
/root/odbc.sh

Then we install Python packages.

# Install Python3 
COPY requirements.txt /requirements.txt
WORKDIR /
RUN yum install -y epel-release && \ 
yum search python3 && \ 
yum install -y python36 python36-pip && \ 
python3 -m pip install -r requirements.txt
# Install SNMP 
RUN yum install -y net-snmp-utils net-snmp wget vim telnet traceroute

With this image, we can monitor databases, network devices, HTTP connections, etc.

To complete the image customization, we need to:

  1. build the image,
  2. push to the registry,
  3. deploy the services.

This process is performed manually and should be automated.

Gitlab CI/CD

With CI/CD, you don’t need to run the process manually to create the image and deploy the services.

1. Create a repository for each component.

  • Zabbix Server
  • Frontend
  • Zabbix Proxy

2. Enable pipelines.
3. Create .gitlab-ci.yml.

Creating .gitlab-ci.yml file

Benefits of the architecture

  • If any Zabbix component stops, Docker Swarm will automatically start a new service/container.
  • We don’t need to connect to the terminal to start the environment.
  • Simple deployment.
  • Simple administration.

Questions & Answers

Question. Can such a Docker approach be used in extremely large environments?

Answer. Docker Swarm is already used to monitor extremely large environments with over 90,000 and over 50 proxies.

Question. Do you think it’s possible to set up a similar environment with Kubernetes?

Answer. I think it is possible, though scaling Zabbix with Kubernetes is more complex than with Docker Swarm. 

Evolving Container Security With Linux User Namespaces

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/evolving-container-security-with-linux-user-namespaces-afbe3308c082

By Fabio Kung, Sargun Dhillon, Andrew Spyker, Kyle, Rob Gulewich, Nabil Schear, Andrew Leung, Daniel Muino, and Manas Alekar

As previously discussed on the Netflix Tech Blog, Titus is the Netflix container orchestration system. It runs a wide variety of workloads from various parts of the company — everything from the frontend API for netflix.com, to machine learning training workloads, to video encoders. In Titus, the hosts that workloads run on are abstracted from our users. The Titus platform maintains large pools of homogenous node capacity to run user workloads, and the Titus scheduler places workloads. This abstraction allows the compute team to influence the reliability, efficiency, and operability of the fleet via the scheduler. The hosts that run workloads are called Titus “agents.” In this post, we describe how Titus agents leverage user namespaces to improve the overall security of the Titus agent fleet.

Titus’s Multi-Tenant Clusters

The Titus agent fleet appears to users as a homogenous pool of capacity. Titus internally employs a cellular bulkhead architecture for scalability, so the fleet is composed of multiple cells. Many bulkhead architectures partition their cells on tenants, where a tenant is defined as a team and their collection of applications. We do not take this approach, and instead, we partition our cells to balance load. We do this for reliability, scalability, and efficiency reasons.

Titus is a multi-tenant system, allowing multiple teams and users to run workloads on the system, and ensuring they can all co-exist while still providing guarantees about security and performance. Much of this comes down to isolation, which comes in multiple forms. These forms include performance isolation (ensuring workloads do not degrade one another’s performance), capacity isolation (ensuring that a given tenant can acquire resources when they ask for them), fault isolation (ensuring that the failure of a part of the system doesn’t cause the whole system to fail), and security isolation (ensuring that the compromise of one tenant’s workload does not affect the security of other tenants). This post focuses on our approaches to security isolation.

Secure Multi-tenancy

One of Titus’s biggest concerns with multi-tenancy is security isolation. We want to allow different kinds of containers from different tenants to run on the same instance. Security isolation in containers has been a contentious topic. Despite the risks, we’ve chosen to leverage containers as part of our security boundary. To offset the risks brought about by the container security boundary, we employ some additional protections.

The building blocks of multi-tenancy are Linux namespaces, the very technology that makes LXC, Docker, and other kinds of containers possible. For example, the PID namespace makes it so that a process can only see PIDs in its own namespace, and therefore cannot send kill signals to random processes on the host. In addition to the default Docker namespaces (mount, network, UTS, IPC, and PID), we employ user namespaces for added layers of isolation. Unfortunately, these default namespace boundaries are not sufficient to prevent container escape, as seen in CVEs like CVE-2015–2925. These vulnerabilities arise due to the complexity of interactions between namespaces, a large number of historical decisions during kernel development, and leaky abstractions like the proc filesystem in Linux. Composing these security isolation primitives correctly is difficult, so we’ve looked to other layers for additional protection.

Running many different workloads multi-tenant on a host necessitates the prevention lateral movement, a technique in which the attacker compromises a single piece of software running in a container on the system, and uses that to compromise other containers on the same system. To mitigate this, we run containers as unprivileged users — making it so that users cannot use “root.” This is important because, in Linux, UID 0 (or root’s privileges), do not come from the mere fact that the user is root, but from capabilities. These capabilities are tied to the current process’s credentials. Capabilities can be added via privilege escalation (e.g., sudo, file capabilities) or removed (e.g., setuid, or switching namespaces). Various capabilities control what the root user can do. For example, the CAP_SYS_BOOT capability controls the ability of a given user to reboot the machine. There are also more common capabilities that are granted to users like CAP_NET_RAW, which allows a process the ability to open raw sockets. A user can automatically have capabilities added when they execute specific files via file capabilities. For example, on a stock Ubuntu system, the ping command needs CAP_NET_RAW:

One of the most powerful capabilities in Linux is CAP_SYS_ADMIN, which is effectively equivalent to having superuser access. It gives the user the ability to do everything from mounting arbitrary filesystems, to accessing tracepoints that can expose vital information about the Linux kernel. Other powerful capabilities include CAP_CHOWN and CAP_DAC_OVERRIDE, which grant the capability to manipulate file permissions.

In the kernel, you’ll often see capability checks spread throughout the code, which looks something like this:

Notice this function doesn’t check if the user is root, but if the task has the CAP_SYS_ADMIN capability before allowing it to execute.

Docker takes the approach of using an allow-list to define which capabilities a container receives. These can be extended or attenuated by the user. Even the default capabilities that are defined in the Docker profile can be abused in certain situations. When we looked into running workloads as unprivileged users without many of these capabilities, we found that it was a non-starter. Various pieces of software used elevated capabilities for FUSE, low-level packet monitoring, and performance tracing amongst other use cases. Programs will usually start with capabilities, perform any activities that require those capabilities, and then “drop” them when the process no longer needs them.

User Namespaces

Fortunately, Linux has a solution — User Namespaces. Let’s go back to that kernel code example earlier. The pcrlock function called the capable function to determine whether or not the task was capable. This function is defined as:

This checks if the task has this capability relative to the init_user_ns. The init_user_ns is the namespace that processes are initialially spawned in, as it’s the only user namespace that exists at kernel startup time. User namespaces are a mechanism to split up the init_user_ns UID space. The interface to set up the mappings is via a “uid_map” and “gid_map” that’s exposed via /proc. The mapping looks something like this:

This allows UIDs in user-namespaced containers to be mapped to host UIDs. A variety of translations occur, but from the container’s perspective, everything is from the perspective of the UID ranges (otherwise known as extents) that are mapped. This is powerful in a few ways:

  1. It allows you to make certain UIDs off-limits to the container — if a UID is not mapped in the user namespace to a real UID, and you try to examine a file on disk with it, it will show up as overflowuid / overflowgid, a UID and GID specified in /proc/sys to indicate that it cannot be mapped into the current working space. Also, the container cannot setuid to a UID that can access files owned by that “outside uid.”
  2. From the user namespace’s perspective, the container’s root user appears to be UID 0, and the container can use the entire range of UIDs that are mapped into that namespace.
  3. Kernel subsystems can then proceed to call ns_capable with the specific user namespace that is tied to the resource. Many capability checks are now done to a user namespace that is relative to the resource being manipulated. This, in turn, allows processes to exercise certain privileges without having any privileges in the init user namespace. Even if the mapping is the same across many different namespaces, capability checks are still done relative to a specific user namespace.

One critical aspect of understanding how permissions work is that every namespace belongs to a specific user namespace. For example, let’s look at the UTS namespace, which is responsible for controlling the hostname:

The namespace has a relationship with a particular user namespace. The ability for a user to manipulate the hostname is based on whether or not the process has the appropriate capability in that user namespace.

Let’s Get Into It

We can examine how the interaction of namespaces and users work ourselves. To set the hostname in the UTS namespace, you need to have CAP_SYS_ADMIN in its user namespace. We can see this in action here, where an unprivileged process doesn’t have permission to set the hostname:

The reason for this is that the process does not have CAP_SYS_ADMIN. According to /proc/self/status, the effective capability set of this process is empty:

Now, let’s try to set up a user namespace, and see what happens:

Immediately, you’ll notice the command prompt says the current user is root, and that the id command agrees. Can we set the hostname now?

We still cannot set the hostname. This is because the process is still in the initial UTS namespace. Let’s see if we can unshare the UTS namespace, and set the hostname:

This is now successful, and the process is in an isolated UTS namespace with the hostname “foo.” This is because the process now has all of the capabilities that a traditional root user would have, except they are relative to the new user namespace we created:

If we inspect this process from the outside, we can see that the process still runs as the unprivileged user, and the hostname in the original outside namespace hasn’t changed:

From here, we can do all sorts of things, like mount filesystems, create other new namespaces, and in fact, we can create an entire container environment. Notice how no privilege escalation mechanism was used to perform any of these actions. This approach is what some people refer to as “rootless containers.”

Road to Implementation

We began work to enable user namespaces in early 2017. At the time we had a naive model that was simpler. This simplicity was possible because we were running without user namespaces:

This approach mirrored the process layout and boundaries of contemporary container orchestration systems. We had a shared metrics daemon on the machine that reached in and polled metrics from the container. User access was done by exposing an SSH daemon, and automatically doing nsenter on the user’s behalf to drop them into the container. To expose files to the container we would use bind mounts. The same mechanism was used to expose configuration, such as secrets.

This had the benefit that much of our software could be installed in the host namespace, and only manage files in the that namespace. The container runtime management system (Titus) was then responsible for configuring Docker to expose the right files to the container via bind mounts. In addition to that, we could use our standard metrics daemons on the host.

Although this model was easy to reason about and write software for, it had several shortcomings that we addressed by shifting everything to running inside of the container’s unprivileged user namespace. The first shortcoming was that all of the host daemons now needed to be aware of the UID translation, and perform the proper setuid or chown calls to transition across the container boundary. Second, each of these transitions represented a security risk. If the SSH daemon only partially transitioned into the container namespace by changing into the container’s pid namespace, it would leave its /proc accessible. This could then be used by a malicious attacker to escape.

With user namespaces, we can improve our security posture and reduce the complexity of the system by running those daemons in the container’s unprivileged user namespace, which removes the need to cross the namespace boundaries. In turn, this removes the need to correctly implement a cross-namespace transition mechanism thus, reducing the risk of introducing container escapes.

We did this by moving aspects of the container runtime environment into the container. For example, we run an SSH daemon per container and a metrics daemon per container. These run inside of the namespaces of the container, and they have the same capabilities and lifecycle as the workloads in the container. We call this model “System Services” — one can think of it as a primordial version of pods. By the end of 2018, we had moved all of our containers to run in unprivileged user namespaces successfully.

Why is this useful?

This may seem like another level of indirection that just introduces complexity, but instead, it allows us to leverage an extremely useful concept — “unprivileged containers.” In unprivileged containers, the root user starts from a baseline in which they don’t automatically have access to the entire system. This means that DAC, MAC, and seccomp policies are now an extra layer of defense against accessing privileged aspects of the system — not the only layer. As new privileges are added, we do not have to add them to an exclusion list. This allows our users to write software where they can control low-level system details in their own containers, rather than forcing all of the complexity up into the container runtime.

Use Case: FUSE

Netflix internally uses a purpose built FUSE filesystem called MezzFS. The purpose of this filesystem is to provide access to our content for a variety of encoding tools. Most of these encoding tools are designed to interact with the POSIX filesystem API. Our Media Cloud Engineering team wanted to leverage containers for a new platform they were building, called Archer. Archer, in turn, uses MezzFS, which needs FUSE, and at the time, FUSE required that the user have CAP_SYS_ADMIN in the initial user namespace. To accommodate the use case from our internal partner, we had to run them in a dedicated cluster where they could run privileged containers.

In 2017, we worked with our partner, Kinvolk, to have patches added to the Linux kernel that allowed users to safely use FUSE from non-init user namespaces. They were able to successfully upstream these patches, and we’ve been using them in production. From our user’s perspective, we were able to seamlessly move them into an unprivileged environment that was more secure. This simplified operations, as this workload was no longer considered exceptional, and could run alongside every other workload in the general node pool. In turn, this allowed the media encoding team access to a massive amount of compute capacity from the shared clusters, and better reliability due to the homogeneous nature of the deployment.

Use Case: Unintended Privileges

Many CVEs related to granting containers unintended privileges have been released in the past few years:

CVE-2020–15257: Privilege escalation in containerd

CVE-2019–5736: Privilege escalation via overwriting host runc binary

CVE-2018–10892: Access to /proc/acpi, allowing an attacker to modify hardware configuration

There will certainly be more vulnerabilities in the future, as is to be expected in any complex, quickly evolving system. We already use the default settings offered by Docker, such as AppArmor, and seccomp, but by adding user namespaces, we can achieve a superior defense-in-depth security model. These CVEs did not affect our infrastructure because we were using user namespaces for all of our containers. The attenuation of capabilities in the init user namespace performed as intended and stopped these attacks.

The Future

There are still many bits of the Kernel that are receiving support for user namespaces or enhancements making user namespaces easier to use. Much of the work left to do is focused on filesystems and container orchestration systems themselves. Some of these changes are slated for upcoming kernel releases. Work is being done to add unprivileged mounts to overlayfs allowing for nested container builds in a user namespace with layers. Future work is going on to make the Linux kernel VFS layer natively understand ID translation. This will make user namespaces with different ID mappings able to access the same underlying filesystem by shifting UIDs through a bind mount. Our partners at Kinvolk are also working on bringing user namespaces to Kubernetes.

Today, a variety of container runtimes support user namespaces. Docker can set up machine-wide UID mappings with separate user namespaces per container, as outlined in their docs. Any OCI compliant runtime such as Containerd / runc, Podman, and systemd-nspawn support user namespaces. Various container orchestration engines also support user namespaces via their underlying container runtimes, such as Nomad and Docker Swarm.

As part of our move to Kubernetes, Netflix has been working with Kinvolk on getting user namespaces to work under Kubernetes. You can follow this work via the KEP discussion here, and Kinvolk has more information about running user namespaces under Kubernetes on their blog. We look forward to evolving container security together with the Kubernetes community.


Evolving Container Security With Linux User Namespaces was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reducing Docker image build time on AWS CodeBuild using an external cache

Post Syndicated from Camillo Anania original https://aws.amazon.com/blogs/devops/reducing-docker-image-build-time-on-aws-codebuild-using-an-external-cache/

With the proliferation of containerized solutions to simplify creating, deploying, and running applications, coupled with the use of automation CI/CD pipelines that continuously rebuild, test, and deploy such applications when new changes are committed, it’s important that your CI/CD pipelines run as quickly as possible, enabling you to get early feedback and allowing for faster releases.

AWS CodeBuild supports local caching, which makes it possible to persist intermediate build artifacts, like a Docker layer cache, locally on the build host and reuse them in subsequent runs. The CodeBuild local cache is maintained on the host at best effort, so it’s possible several of your build runs don’t hit the cache as frequently as you would like.

A typical Docker image is built from several intermediate layers that are constructed during the initial image build process on a host. These intermediate layers are reused if found valid in any subsequent image rebuild; doing so speeds up the build process considerably because the Docker engine doesn’t need to rebuild the whole image if the layers in the cache are still valid.

This post shows how to implement a simple, effective, and durable external Docker layer cache for CodeBuild to significantly reduce image build runtime.

Solution overview

The following diagram illustrates the high-level architecture of this solution. We describe implementing each stage in more detail in the following paragraphs.

CodeBuildExternalCacheDiagram

In a modern software engineering approach built around CI/CD practices, whenever specific events happen, such as an application code change is merged, you need to rebuild, test, and eventually deploy the application. Assuming the application is containerized with Docker, the build process entails rebuilding one or multiple Docker images. The environment for this rebuild is on CodeBuild, which is a fully managed build service in the cloud. CodeBuild spins up a new environment to accommodate build requests and runs a sequence of actions defined in its build specification.

Because each CodeBuild instance is an independent environment, build artifacts can’t be persisted in the host indefinitely. The native CodeBuild local caching feature allows you to persist a cache for a limited time so that immediate subsequent builds can benefit from it. Native local caching is performed at best effort and can’t be relied on when multiple builds are triggered at different times. This solution describes using an external persistent cache that you can reuse across builds and is valid at any time.

After the first build of a Docker image is complete, the image is tagged and pushed to Amazon Elastic Container Registry (Amazon ECR). In each subsequent build, the image is pulled from Amazon ECR and the Docker build process is forced to use it as cache for its next build iteration of the image. Finally, the newly produced image is pushed back to Amazon ECR.

In the following paragraphs, we explain the solution and walk you through an example implementation. The solution rebuilds the publicly available Amazon Linux 2 Standard 3.0 image, which is an optimized image that you can use with CodeBuild.

Creating a policy and service role

The first step is to create an AWS Identity and Access Management (IAM) policy and service role for CodeBuild with the minimum set of permissions to perform the job.

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. Provide the following policy in JSON format:
    CodeBuild Docker Cache Policy:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ecr:GetAuthorizationToken",
                    "ecr:BatchCheckLayerAvailability",
                    "ecr:GetDownloadUrlForLayer",
                    "ecr:GetRepositoryPolicy",
                    "ecr:DescribeRepositories",
                    "ecr:ListImages",
                    "ecr:DescribeImages",
                    "ecr:BatchGetImage",
                    "ecr:ListTagsForResource",
                    "ecr:DescribeImageScanFindings",
                    "ecr:InitiateLayerUpload",
                    "ecr:UploadLayerPart",
                    "ecr:CompleteLayerUpload",
                    "ecr:PutImage"
                ],
                "Resource": "*"
            }
        ]
    }
  4. In the Review policy section, enter a name (for example, CodeBuildDockerCachePolicy).
  5. Choose Create policy.
  6. Choose Roles on the navigation pane.
  7. Choose Create role.
  8. Keep AWS service as the type of role and choose CodeBuild from the list of services.
  9. Choose Next.
  10. Search for and add the policy you created.
  11. Review the role and enter a name (for example, CodeBuildDockerCacheRole).
  12. Choose Create role.

Creating an Amazon ECR repository

In this step, we create an Amazon ECR repository to store the built Docker images.

  1. On the Amazon ECR console, choose Create repository.
  2. Enter a name (for example, amazon_linux_codebuild_image).
  3. Choose Create repository.

Configuring a CodeBuild project

You now configure the CodeBuild project that builds the Docker image and configures its cache to speed up the process.

  1. On the CodeBuild console, choose Create build project.
  2. Enter a name (for example, SampleDockerCacheProject).
  3. For Source provider, choose GitHub.
  4. For Repository, select Public repository.
  5. For Repository URL, enter https://github.com/aws/aws-codebuild-docker-images.
    CodeBuildGitHubSourceConfiguration
  6. In the Environment section, for Environment image, select Managed image.
  7. For Operating system, choose Amazon Linux 2.
  8. For Runtime(s), choose Standard.
  9. For Image, enter aws/codebuild/amazonlinux2-x86_64-standard:3.0.
  10. For Image version, choose Always use the latest image for this runtime version.
  11. For Environment type, choose Linux.
  12. For Privileged, select Enable this flag if you want to build Docker images or want your builds to get elevated privileges.
  13. For Service role, select Existing service role.
  14. For Role ARN, enter the ARN for the service role you created (CodeBuildDockerCachePolicy).
  15. Select Allow AWS CodeBuild to modify this service so it can be used with this build project.
    CodeBuildEnvironmentConfiguration
  16. In the Buildspec section, select Insert build commands.
  17. Choose Switch to editor.
  18. Enter the following build specification (substitute account-ID and region).
    version: 0.2
    
    env:
        variables:
        CONTAINER_REPOSITORY_URL: account-ID.dkr.ecr.region.amazonaws.com/amazon_linux_codebuild_image
        TAG_NAME: latest
    
    phases:
      install:
        runtime-versions:
          docker: 19
    
    pre_build:
      commands:
        - $(aws ecr get-login --no-include-email)
        - docker pull $CONTAINER_REPOSITORY_URL:$TAG_NAME || true
    
    build:
      commands:
        - cd ./al2/x86_64/standard/1.0
        - docker build --cache-from $CONTAINER_REPOSITORY_URL:$TAG_NAME --tag
    $CONTAINER_REPOSITORY_URL:$TAG_NAME .
    
    post_build:
        commands:
          - docker push $CONTAINER_REPOSITORY_URL
  19. Choose Create the project.

The provided build specification instructs CodeBuild to do the following:

  • Use the Docker 19 runtime to run the build. The following process doesn’t work reliably with Docker versions lower than 19.
  • Authenticate with Amazon ECR and pull the image you want to rebuild if it exists (on the first run, this image doesn’t exist).
  • Run the image rebuild, forcing Docker to consider as cache the image pulled at the previous step using the –cache-from parameter.
  • When the image rebuild is complete, push it to Amazon ECR.

Testing the solution

The solution is fully configured, so we can proceed to evaluate its behavior.

For the first run, we record a runtime of approximately 39 minutes. The build doesn’t use any cache and the docker pull in the pre-build stage fails to find the image we indicate, as expected (the || true statement at the end of the command line guarantees that the CodeBuild instance doesn’t stop because the docker pull failed).

The second run pulls the previously built image before starting the rebuild and completes in approximately 6 minutes, most of which is spent downloading the image from Amazon ECR (which is almost 5 GB).

We trigger another run after simulating a change halfway through the Dockerfile (addition of an echo command to the statement at line 291 of the Dockerfile). Docker still reuses the layers in the cache until the point of the changed statement and then rebuilds from scratch the remaining layers described in the Dockerfile. The runtime was approximately 31 minutes; the overhead of downloading the whole image first partially offsets the advantages of using it as cache.

It’s relevant to note the image size in this use case is considerably large; on average, projects deal with smaller images that introduce less overhead. Furthermore, the previous run had the built-in CodeBuild feature to cache Docker layers at best effort disabled; enabling it provides further efficiency because the docker pull specified in the pre-build stage doesn’t have to download the image if the one available locally matches the one on Amazon ECR.

Cleaning up

When you’re finished testing, you should un-provision the following resources to avoid incurring further charges and keep the account clean from unused resources:

  • The amazon_linux_codebuild_image Amazon ECR repository and its images;
  • The SampleDockerCacheProject CodeBuild project;
  • The CodeBuildDockerCachePolicy policy and the CodeBuildDockerCacheRole role.

Conclusion

In this post, we reviewed a simple and effective solution to implement a durable external cache for Docker on CodeBuild. The solution provides significant improvements in the execution time of the Docker build process on CodeBuild and is general enough to accommodate the majority of use cases, including multi-stage builds.

The approach works in synergy with the built-in CodeBuild feature of caching Docker layers at best effort, and we recommend using it for further improvements. Shorter build processes translate to lower compute costs, and overall determine a shorter development lifecycle for features released faster and at a lower cost.

About the Author

 

 

Camillo Anania is a Global DevOps Consultant with AWS Professional Services, London, UK.

 

 

 

 

James Jacob is a Global DevOps Consultant with AWS Professional Services, London, UK.