Tag Archives: Containers

Run large-scale simulations with AWS Batch multi-container jobs

2024-03-25 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/run-large-scale-simulations-with-aws-batch-multi-container-jobs/

Industries like automotive, robotics, and finance are increasingly implementing computational workloads like simulations, machine learning (ML) model training, and big data analytics to improve their products. For example, automakers rely on simulations to test autonomous driving features, robotics companies train ML algorithms to enhance robot perception capabilities, and financial firms run in-depth analyses to better manage risk, process transactions, and detect fraud.

Some of these workloads, including simulations, are especially complicated to run due to their diversity of components and intensive computational requirements. A driving simulation, for instance, involves generating 3D virtual environments, vehicle sensor data, vehicle dynamics controlling car behavior, and more. A robotics simulation might test hundreds of autonomous delivery robots interacting with each other and other systems in a massive warehouse environment.

AWS Batch is a fully managed service that can help you run batch workloads across a range of AWS compute offerings, including Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, and Amazon EC2 Spot or On-Demand Instances. Traditionally, AWS Batch only allowed single-container jobs and required extra steps to merge all components into a monolithic container. It also did not allow using separate “sidecar” containers, which are auxiliary containers that complement the main application by providing additional services like data logging. This additional effort required coordination across multiple teams, such as software development, IT operations, and quality assurance (QA), because any code change meant rebuilding the entire container.

Now, AWS Batch offers multi-container jobs, making it easier and faster to run large-scale simulations in areas like autonomous vehicles and robotics. These workloads are usually divided between the simulation itself and the system under test (also known as an agent) that interacts with the simulation. These two components are often developed and optimized by different teams. With the ability to run multiple containers per job, you get the advanced scaling, scheduling, and cost optimization offered by AWS Batch, and you can use modular containers representing different components like 3D environments, robot sensors, or monitoring sidecars. In fact, customers such as IPG Automotive, MORAI, and Robotec.ai are already using AWS Batch multi-container jobs to run their simulation software in the cloud.

Let’s see how this works in practice using a simplified example and have some fun trying to solve a maze.

Building a Simulation Running on Containers
In production, you will probably use existing simulation software. For this post, I built a simplified version of an agent/model simulation. If you’re not interested in code details, you can skip this section and go straight to how to configure AWS Batch.

For this simulation, the world to explore is a randomly generated 2D maze. The agent has the task to explore the maze to find a key and then reach the exit. In a way, it is a classic example of pathfinding problems with three locations.

Here’s a sample map of a maze where I highlighted the start (S), end (E), and key (K) locations.

The separation of agent and model into two separate containers allows different teams to work on each of them separately. Each team can focus on improving their own part, for example, to add details to the simulation or to find better strategies for how the agent explores the maze.

Here’s the code of the maze model (app.py). I used Python for both examples. The model exposes a REST API that the agent can use to move around the maze and know if it has found the key and reached the exit. The maze model uses Flask for the REST API.

import json
import random
from flask import Flask, request, Response

ready = False

# How map data is stored inside a maze
# with size (width x height) = (4 x 3)
#
#    012345678
# 0: +-+-+ +-+
# 1: | |   | |
# 2: +-+ +-+-+
# 3: | |   | |
# 4: +-+-+ +-+
# 5: | | | | |
# 6: +-+-+-+-+
# 7: Not used

class WrongDirection(Exception):
    pass

class Maze:
    UP, RIGHT, DOWN, LEFT = 0, 1, 2, 3
    OPEN, WALL = 0, 1
    

    @staticmethod
    def distance(p1, p2):
        (x1, y1) = p1
        (x2, y2) = p2
        return abs(y2-y1) + abs(x2-x1)


    @staticmethod
    def random_dir():
        return random.randrange(4)


    @staticmethod
    def go_dir(x, y, d):
        if d == Maze.UP:
            return (x, y - 1)
        elif d == Maze.RIGHT:
            return (x + 1, y)
        elif d == Maze.DOWN:
            return (x, y + 1)
        elif d == Maze.LEFT:
            return (x - 1, y)
        else:
            raise WrongDirection(f"Direction: {d}")


    def __init__(self, width, height):
        self.width = width
        self.height = height        
        self.generate()
        

    def area(self):
        return self.width * self.height
        

    def min_lenght(self):
        return self.area() / 5
    

    def min_distance(self):
        return (self.width + self.height) / 5
    

    def get_pos_dir(self, x, y, d):
        if d == Maze.UP:
            return self.maze[y][2 * x + 1]
        elif d == Maze.RIGHT:
            return self.maze[y][2 * x + 2]
        elif d == Maze.DOWN:
            return self.maze[y + 1][2 * x + 1]
        elif d ==  Maze.LEFT:
            return self.maze[y][2 * x]
        else:
            raise WrongDirection(f"Direction: {d}")


    def set_pos_dir(self, x, y, d, v):
        if d == Maze.UP:
            self.maze[y][2 * x + 1] = v
        elif d == Maze.RIGHT:
            self.maze[y][2 * x + 2] = v
        elif d == Maze.DOWN:
            self.maze[y + 1][2 * x + 1] = v
        elif d ==  Maze.LEFT:
            self.maze[y][2 * x] = v
        else:
            WrongDirection(f"Direction: {d}  Value: {v}")


    def is_inside(self, x, y):
        return 0 <= y < self.height and 0 <= x < self.width


    def generate(self):
        self.maze = []
        # Close all borders
        for y in range(0, self.height + 1):
            self.maze.append([Maze.WALL] * (2 * self.width + 1))
        # Get a random starting point on one of the borders
        if random.random() < 0.5:
            sx = random.randrange(self.width)
            if random.random() < 0.5:
                sy = 0
                self.set_pos_dir(sx, sy, Maze.UP, Maze.OPEN)
            else:
                sy = self.height - 1
                self.set_pos_dir(sx, sy, Maze.DOWN, Maze.OPEN)
        else:
            sy = random.randrange(self.height)
            if random.random() < 0.5:
                sx = 0
                self.set_pos_dir(sx, sy, Maze.LEFT, Maze.OPEN)
            else:
                sx = self.width - 1
                self.set_pos_dir(sx, sy, Maze.RIGHT, Maze.OPEN)
        self.start = (sx, sy)
        been = [self.start]
        pos = -1
        solved = False
        generate_status = 0
        old_generate_status = 0                    
        while len(been) < self.area():
            (x, y) = been[pos]
            sd = Maze.random_dir()
            for nd in range(4):
                d = (sd + nd) % 4
                if self.get_pos_dir(x, y, d) != Maze.WALL:
                    continue
                (nx, ny) = Maze.go_dir(x, y, d)
                if (nx, ny) in been:
                    continue
                if self.is_inside(nx, ny):
                    self.set_pos_dir(x, y, d, Maze.OPEN)
                    been.append((nx, ny))
                    pos = -1
                    generate_status = len(been) / self.area()
                    if generate_status - old_generate_status > 0.1:
                        old_generate_status = generate_status
                        print(f"{generate_status * 100:.2f}%")
                    break
                elif solved or len(been) < self.min_lenght():
                    continue
                else:
                    self.set_pos_dir(x, y, d, Maze.OPEN)
                    self.end = (x, y)
                    solved = True
                    pos = -1 - random.randrange(len(been))
                    break
            else:
                pos -= 1
                if pos < -len(been):
                    pos = -1
                    
        self.key = None
        while(self.key == None):
            kx = random.randrange(self.width)
            ky = random.randrange(self.height)
            if (Maze.distance(self.start, (kx,ky)) > self.min_distance()
                and Maze.distance(self.end, (kx,ky)) > self.min_distance()):
                self.key = (kx, ky)


    def get_label(self, x, y):
        if (x, y) == self.start:
            c = 'S'
        elif (x, y) == self.end:
            c = 'E'
        elif (x, y) == self.key:
            c = 'K'
        else:
            c = ' '
        return c

                    
    def map(self, moves=[]):
        map = ''
        for py in range(self.height * 2 + 1):
            row = ''
            for px in range(self.width * 2 + 1):
                x = int(px / 2)
                y = int(py / 2)
                if py % 2 == 0: #Even rows
                    if px % 2 == 0:
                        c = '+'
                    else:
                        v = self.get_pos_dir(x, y, self.UP)
                        if v == Maze.OPEN:
                            c = ' '
                        elif v == Maze.WALL:
                            c = '-'
                else: # Odd rows
                    if px % 2 == 0:
                        v = self.get_pos_dir(x, y, self.LEFT)
                        if v == Maze.OPEN:
                            c = ' '
                        elif v == Maze.WALL:
                            c = '|'
                    else:
                        c = self.get_label(x, y)
                        if c == ' ' and [x, y] in moves:
                            c = '*'
                row += c
            map += row + '\n'
        return map


app = Flask(__name__)

@app.route('/')
def hello_maze():
    return "<p>Hello, Maze!</p>"

@app.route('/maze/map', methods=['GET', 'POST'])
def maze_map():
    if not ready:
        return Response(status=503, retry_after=10)
    if request.method == 'GET':
        return '<pre>' + maze.map() + '</pre>'
    else:
        moves = request.get_json()
        return maze.map(moves)

@app.route('/maze/start')
def maze_start():
    if not ready:
        return Response(status=503, retry_after=10)
    start = { 'x': maze.start[0], 'y': maze.start[1] }
    return json.dumps(start)

@app.route('/maze/size')
def maze_size():
    if not ready:
        return Response(status=503, retry_after=10)
    size = { 'width': maze.width, 'height': maze.height }
    return json.dumps(size)

@app.route('/maze/pos/<int:y>/<int:x>')
def maze_pos(y, x):
    if not ready:
        return Response(status=503, retry_after=10)
    pos = {
        'here': maze.get_label(x, y),
        'up': maze.get_pos_dir(x, y, Maze.UP),
        'down': maze.get_pos_dir(x, y, Maze.DOWN),
        'left': maze.get_pos_dir(x, y, Maze.LEFT),
        'right': maze.get_pos_dir(x, y, Maze.RIGHT),

    }
    return json.dumps(pos)


WIDTH = 80
HEIGHT = 20
maze = Maze(WIDTH, HEIGHT)
ready = True

The only requirement for the maze model (in requirements.txt) is the Flask module.

To create a container image running the maze model, I use this Dockerfile.

FROM --platform=linux/amd64 public.ecr.aws/docker/library/python:3.12-alpine

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

CMD [ "python3", "-m" , "flask", "run", "--host=0.0.0.0", "--port=5555"]

Here’s the code for the agent (agent.py). First, the agent asks the model for the size of the maze and the starting position. Then, it applies its own strategy to explore and solve the maze. In this implementation, the agent chooses its route at random, trying to avoid following the same path more than once.

import random
import requests
from requests.adapters import HTTPAdapter, Retry

HOST = '127.0.0.1'
PORT = 5555

BASE_URL = f"http://{HOST}:{PORT}/maze"

UP, RIGHT, DOWN, LEFT = 0, 1, 2, 3
OPEN, WALL = 0, 1

s = requests.Session()

retries = Retry(total=10,
                backoff_factor=1)

s.mount('http://', HTTPAdapter(max_retries=retries))

r = s.get(f"{BASE_URL}/size")
size = r.json()
print('SIZE', size)

r = s.get(f"{BASE_URL}/start")
start = r.json()
print('START', start)

y = start['y']
x = start['x']

found_key = False
been = set((x, y))
moves = [(x, y)]
moves_stack = [(x, y)]

while True:
    r = s.get(f"{BASE_URL}/pos/{y}/{x}")
    pos = r.json()
    if pos['here'] == 'K' and not found_key:
        print(f"({x}, {y}) key found")
        found_key = True
        been = set((x, y))
        moves_stack = [(x, y)]
    if pos['here'] == 'E' and found_key:
        print(f"({x}, {y}) exit")
        break
    dirs = list(range(4))
    random.shuffle(dirs)
    for d in dirs:
        nx, ny = x, y
        if d == UP and pos['up'] == 0:
            ny -= 1
        if d == RIGHT and pos['right'] == 0:
            nx += 1
        if d == DOWN and pos['down'] == 0:
            ny += 1
        if d == LEFT and pos['left'] == 0:
            nx -= 1 

        if nx < 0 or nx >= size['width'] or ny < 0 or ny >= size['height']:
            continue

        if (nx, ny) in been:
            continue

        x, y = nx, ny
        been.add((x, y))
        moves.append((x, y))
        moves_stack.append((x, y))
        break
    else:
        if len(moves_stack) > 0:
            x, y = moves_stack.pop()
        else:
            print("No moves left")
            break

print(f"Solution length: {len(moves)}")
print(moves)

r = s.post(f'{BASE_URL}/map', json=moves)

print(r.text)

s.close()

The only dependency of the agent (in requirements.txt) is the requests module.

This is the Dockerfile I use to create a container image for the agent.

FROM --platform=linux/amd64 public.ecr.aws/docker/library/python:3.12-alpine

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

CMD [ "python3", "agent.py"]

You can easily run this simplified version of a simulation locally, but the cloud allows you to run it at larger scale (for example, with a much bigger and more detailed maze) and to test multiple agents to find the best strategy to use. In a real-world scenario, the improvements to the agent would then be implemented into a physical device such as a self-driving car or a robot vacuum cleaner.

Running a simulation using multi-container jobs
To run a job with AWS Batch, I need to configure three resources:

The compute environment in which to run the job
The job queue in which to submit the job
The job definition describing how to run the job, including the container images to use

In the AWS Batch console, I choose Compute environments from the navigation pane and then Create. Now, I have the choice of using Fargate, Amazon EC2, or Amazon EKS. Fargate allows me to closely match the resource requirements that I specify in the job definitions. However, simulations usually require access to a large but static amount of resources and use GPUs to accelerate computations. For this reason, I select Amazon EC2.

I select the Managed orchestration type so that AWS Batch can scale and configure the EC2 instances for me. Then, I enter a name for the compute environment and select the service-linked role (that AWS Batch created for me previously) and the instance role that is used by the ECS container agent (running on the EC2 instances) to make calls to the AWS API on my behalf. I choose Next.

In the Instance configuration settings, I choose the size and type of the EC2 instances. For example, I can select instance types that have GPUs or use the Graviton processor. I do not have specific requirements and leave all the settings to their default values. For Network configuration, the console already selected my default VPC and the default security group. In the final step, I review all configurations and complete the creation of the compute environment.

Now, I choose Job queues from the navigation pane and then Create. Then, I select the same orchestration type I used for the compute environment (Amazon EC2). In the Job queue configuration, I enter a name for the job queue. In the Connected compute environments dropdown, I select the compute environment I just created and complete the creation of the queue.

I choose Job definitions from the navigation pane and then Create. As before, I select Amazon EC2 for the orchestration type.

To use more than one container, I disable the Use legacy containerProperties structure option and move to the next step. By default, the console creates a legacy single-container job definition if there’s already a legacy job definition in the account. That’s my case. For accounts without legacy job definitions, the console has this option disabled.

I enter a name for the job definition. Then, I have to think about which permissions this job requires. The container images I want to use for this job are stored in Amazon ECR private repositories. To allow AWS Batch to download these images to the compute environment, in the Task properties section, I select an Execution role that gives read-only access to the ECR repositories. I don’t need to configure a Task role because the simulation code is not calling AWS APIs. For example, if my code was uploading results to an Amazon Simple Storage Service (Amazon S3) bucket, I could select here a role giving permissions to do so.

In the next step, I configure the two containers used by this job. The first one is the maze-model. I enter the name and the image location. Here, I can specify the resource requirements of the container in terms of vCPUs, memory, and GPUs. This is similar to configuring containers for an ECS task.

I add a second container for the agent and enter name, image location, and resource requirements as before. Because the agent needs to access the maze as soon as it starts, I use the Dependencies section to add a container dependency. I select maze-model for the container name and START as the condition. If I don’t add this dependency, the agent container can fail before the maze-model container is running and able to respond. Because both containers are flagged as essential in this job definition, the overall job would terminate with a failure.

I review all configurations and complete the job definition. Now, I can start a job.

In the Jobs section of the navigation pane, I submit a new job. I enter a name and select the job queue and the job definition I just created.

In the next steps, I don’t need to override any configuration and create the job. After a few minutes, the job has succeeded, and I have access to the logs of the two containers.

The agent solved the maze, and I can get all the details from the logs. Here’s the output of the job to see how the agent started, picked up the key, and then found the exit.

SIZE {'width': 80, 'height': 20}
START {'x': 0, 'y': 18}
(32, 2) key found
(79, 16) exit
Solution length: 437
[(0, 18), (1, 18), (0, 18), ..., (79, 14), (79, 15), (79, 16)]

In the map, the red asterisks (*) follow the path used by the agent between the start (S), key (K), and exit (E) locations.

Increasing observability with a sidecar container
When running complex jobs using multiple components, it helps to have more visibility into what these components are doing. For example, if there is an error or a performance problem, this information can help you find where and what the issue is.

To instrument my application, I use AWS Distro for OpenTelemetry:

First, I update the agent and maze-model containers to use Python auto-instrumentation as described in this article. There are similar tutorials for other platforms, such as Go, Java, and JavaScript. To get information specific to my application, I have the option to manually instrument the code.
Then, I add a sidecar container using the AWS Distro for OpenTelemetry Collector from the Amazon ECR Public Gallery. The collector container will receive telemetry data from the other containers and send traces to AWS X-Ray and metrics to Amazon CloudWatch, Amazon Managed Service for Prometheus, or self-managed Prometheus. OpenTelemetry makes it easier to share data with multiple monitoring and observability platforms.

Using telemetry data collected in this way, I can set up dashboards (for example, using CloudWatch or Amazon Managed Grafana) and alarms (with CloudWatch or Prometheus) that help me better understand what is happening and reduce the time to solve an issue. More generally, a sidecar container can help integrate telemetry data from AWS Batch jobs with your monitoring and observability platforms.

Things to know
AWS Batch support for multi-container jobs is available today in the AWS Management Console, AWS Command Line Interface (AWS CLI), and AWS SDKs in all AWS Regions where Batch is offered. For more information, see the AWS Services by Region list.

There is no additional cost for using multi-container jobs with AWS Batch. In fact, there is no additional charge for using AWS Batch. You only pay for the AWS resources you create to store and run your application, such as EC2 instances and Fargate containers. To optimize your costs, you can use Reserved Instances, Savings Plan, EC2 Spot Instances, and Fargate in your compute environments.

Using multi-container jobs accelerates development times by reducing job preparation efforts and eliminates the need for custom tooling to merge the work of multiple teams into a single container. It also simplifies DevOps by defining clear component responsibilities so that teams can quickly identify and fix issues in their own areas of expertise without distraction.

To learn more, see how to set up multi-container jobs in the AWS Batch User Guide.

— Danilo

Generative AI Infrastructure at AWS

2024-01-31 Betsy Chernoff

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/generative-ai-infrastructure-at-aws/

Building and training generative artificial intelligence (AI) models, as well as predicting and providing accurate and insightful outputs requires a significant amount of infrastructure.

There’s a lot of data that goes into generating the high-quality synthetic text, images, and other media outputs that large-language models (LLMs), as well as foundational models (FMs), create. To start, the data set generally has somewhere around one billion variables present in the model that it was trained on (also known as parameters). To process that massive amount of data (think: petabytes), it can take hundreds of hardware accelerators (which are incorporated into purpose-built ML silicon or GPUs).

Given how much data is required for an effective LLM, it becomes costly and inefficient if an organization can’t access the data for these models as quickly as their GPUs/ML silicon are processing it. Selecting infrastructure for generative AI workloads impacts everything from cost to performance to sustainability goals to the ease of use. To successfully run training and inference for FMs organizations need:

Price-performant accelerated computing (including the latest GPUs and dedicated ML Silicon) to power large generative AI workloads.
High-performance and low-latency cloud storage that’s built to keep accelerators highly utilized.
The most performant and cutting-edge technologies, networking, and systems to support the infrastructure for a generative AI workload.
The ability to build with cloud services that can provide seamless integration across generative AI applications, tools, and infrastructure.

Overview of compute, storage, & networking for generative AI

Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing portfolio (including instances powered by GPUs and purpose-built ML silicon) offers the broadest choice of accelerators to power generative AI workloads.

To keep the accelerators highly utilized, they need constant access to data for processing. AWS provides this fast data transfer from storage (up to hundreds of GBs/TBs of data throughput) with Amazon FSx for Lustre and Amazon S3.

Accelerated computing instances combined with differentiated AWS technologies such as the AWS Nitro System, up to 3,200 Gbps of Elastic Fabric Adapter (EFA) networking, as well as exascale computing with Amazon EC2 UltraClusters helps to deliver the most performant infrastructure for generative AI workloads.

Coupled with other managed services such as Amazon SageMaker HyperPod and Amazon Elastic Kubernetes Service (Amazon EKS), these instances provide developers with the industry’s best platform for building and deploying generative AI applications.

This blog post will focus on highlighting announcements across Amazon EC2 instances, storage, and networking that are centered around generative AI.

AWS compute enhancements for generative AI workloads

Training large FMs requires extensive compute resources and because every project is different, a broad set of options are needed so that organization of all sizes can iterate faster, train more models, and increase accuracy. In 2023, there were a lot of launches across the AWS compute category that supported both training and inference workloads for generative AI.

One of those launches, Amazon EC2 Trn1n instances, doubled the network bandwidth (compared to Trn1 instances) to 1600 Gbps of Elastic Fabric Adapter (EFA). That increased bandwidth delivers up to 20% faster time-to-train relative to Trn1 for training network-intensive generative AI models, such as LLMs and mixture of experts (MoE).

Watashiha offers an innovative and interactive AI chatbot service, “OGIRI AI,” which uses LLMs to incorporate humor and offer a more relevant and conversational experience to their customers. “This requires us to pre-train and fine-tune these models frequently. We pre-trained a GPT-based Japanese model on the EC2 Trn1.32xlarge instance, leveraging tensor and data parallelism,” said Yohei Kobashi, CTO, Watashiha, K.K. “The training was completed within 28 days at a 33% cost reduction over our previous GPU based infrastructure. As our models rapidly continue to grow in complexity, we are looking forward to Trn1n instances which has double the network bandwidth of Trn1 to speed up training of larger models.”

AWS continues to advance its infrastructure for generative AI workloads, and recently announced that Trainium2 accelerators are also coming soon. These accelerators are designed to deliver up to 4x faster training than first generation Trainium chips and will be able to be deployed in EC2 UltraClusters of up to 100,000 chips, making it possible to train FMs and LLMs in a fraction of the time, while improving energy efficiency up to 2x.

AWS has continued to invest in GPU infrastructure over the years, too. To date, NVIDIA has deployed 2 million GPUs on AWS, across the Ampere and Grace Hopper GPU generations. That’s 3 zetaflops, or 3,000 exascale super computers. Most recently, AWS announced the Amazon EC2 P5 Instances that are designed for time-sensitive, large-scale training workloads that use NVIDIA CUDA or CuDNN and are powered by NVIDIA H100 Tensor Core GPUs. They help you accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances, and reduce cost to train ML models by up to 40%. P5 instances help you iterate on your solutions at a faster pace and get to market more quickly.

And to offer easy and predictable access to highly sought-after GPU compute capacity, AWS launched Amazon EC2 Capacity Blocks for ML. This is the first consumption model from a major cloud provider that lets you reserve GPUs for future use (up to 500 deployed in EC2 UltraClusters) to run short duration ML workloads.

AWS is also simplifying training with Amazon SageMaker HyperPod, which automates more of the processes required for high-scale fault-tolerant distributed training (e.g., configuring distributed training libraries, scaling training workloads across thousands of accelerators, detecting and repairing faulty instances), speeding up training by as much as 40%. Customers like Perplexity AI elastically scale beyond hundreds of GPUs and minimize their downtime with SageMaker HyperPod.

Deep-learning inference is another example of how AWS is continuing its cloud infrastructure innovations, including the low-cost, high-performance Amazon EC2 Inf2 instances powered by AWS Inferentia2. These instances are designed to run high-performance deep-learning inference applications at scale globally. They are the most cost-effective and energy-efficient option on Amazon EC2 for deploying the latest innovations in generative AI.

Another example is with Amazon SageMaker, which helps you deploy multiple models to the same instance so you can share compute resources—reducing inference cost by 50%. SageMaker also actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available—achieving 20% lower inference latency (on average).

AWS invests heavily in the tools for generative AI workloads. For AWS ML silicon, AWS has focused on AWS Neuron, the software development kit (SDK) that helps customers get the maximum performance from Trainium and Inferentia. Neuron supports the most popular publicly available models, including Llama 2 from Meta, MPT from Databricks, Mistral from mistral.ai, and Stable Diffusion from Stability AI, as well as 93 of the top 100 models on the popular model repository Hugging Face. It plugs into ML frameworks like PyTorch and TensorFlow, and support for JAX is coming early this year. It’s designed to make it easy for AWS customers to switch from their existing model training and inference pipelines to Trainium and Inferentia with just a few lines of code.

Cloud storage on AWS enhancements for generative AI

Another way AWS is accelerating the training and inference pipelines is with improvements to storage performance—which is not only critical when thinking about the most common ML tasks (like loading training data into a large cluster of GPUs/accelerators), but also for checkpointing and serving inference requests. AWS announced several improvements to accelerate the speed of storage requests and reduce the idle time of your compute resources—which allows you to run generative AI workloads faster and more efficiently.

To gather more accurate predictions, generative AI workloads are using larger and larger datasets that require high-performant storage at scale to handle the sheer volume in of data.

With Amazon S3 Express One Zone a new storage class purpose-built to high-performance and low-latency object storage for an organizations most frequently accessed data, making it ideal for request-intensive operations like ML training and inference. Amazon S3 Express One Zone is the lowest-latency cloud object storage available, with data access speed up to 10x faster and request costs up to 50% lower than Amazon S3 Standard, from any AWS Availability Zone within an AWS Region.

AWS continues to optimize data access speeds for ML frameworks too. Recently, Amazon S3 Connector for PyTorch launched, which loads training data up to 40% faster than with the existing PyTorch connectors to Amazon S3. While most customers can meet their training and inference requirements using Mountpoint for Amazon S3 or Amazon S3 Connector for PyTorch, some are also building and managing their own custom data loaders. To deliver the fastest data transfer speeds between Amazon S3, and Amazon EC2 Trn1, P4d, and P5 instances, AWS recently announced the ability to automatically accelerate Amazon S3 data transfer in the AWS Command Line Interface (AWS CLI) and Python SDK. Now, training jobs download training data from Amazon S3 up to 3x faster and customers like Scenario are already seeing great results, with a 5x throughput improvement to model download times without writing a single line of code.

To meet the changing performance requirements that training generative AI workloads can require, Amazon FSx for Lustre announced throughput scaling on-demand. This is particularly useful for model training because it enables you to adjust the throughput tier of your file systems to meet these requirements with greater agility and lower cost.

EC2 networking enhancements for generative AI

Last year, AWS introduced EC2 UltraCluster 2.0, a flatter and wider network fabric that’s optimized specifically for the P5 instance and future ML accelerators. It allows us to reduce latency by 16% and supports up to 20,000 GPUs, with up to 10x the overall bandwidth. In a traditional cluster architecture, as clusters get physically bigger, latency will also generally increase. But, with UltraCluster 2.0, AWS is increasing the size while reducing latency, and that’s exciting.

AWS is also continuing to help you make your network more efficient. Take for example a recent launch with Amazon EC2 Instance Topology API. It gives you an inside look at the proximity between your instances, so you can place jobs strategically. Optimized job scheduling means faster processing for distributed workloads. Moving jobs that exchange data the most frequently to the same physical location in a cluster can eliminate multiple hops in the data path. As models push boundaries, this type of software innovation is key to getting the most out of your hardware.

In addition to Amazon Q (a generative AI powered assistant from AWS), AWS also launched Amazon Q networking troubleshooting (preview).

You can ask Amazon Q to assist you in troubleshooting network connectivity issues caused by network misconfiguration in your current AWS account. For this capability, Amazon Q works with Amazon VPC Reachability Analyzer to check your connections and inspect your network configuration to identify potential issues. With Amazon Q network troubleshooting, you can ask questions about your network in conversational English—for example, you can ask, “why can’t I SSH to my server,” or “why is my website not accessible”.

Conclusion

AWS is bringing customers even more choice for their infrastructure, including price-performant, sustainability focused, and ease-of-use options. Last year, AWS capabilities across this stack solidified our commitment to meeting the customer focus and goal of: Making generative AI accessible to customers of all sizes and technical abilities so they can get to reinventing and transforming what is possible.

Additional resources

For more information on AWS generative AI Infrastructure, go to the AWS Machine Learning Infrastructure page.
For more information on how AWS is building in the cloud with Generative AI across applications, tools, and infrastructure, go to the blog, “Welcome to a New Era of Building in the Cloud with Generative AI on AWS”.

How to use AWS Secrets Manager and ABAC for enhanced secrets management in Amazon EKS

2024-01-09 Nima Fotouhi

Post Syndicated from Nima Fotouhi original https://aws.amazon.com/blogs/security/how-to-use-aws-secrets-manager-and-abac-for-enhanced-secrets-management-in-amazon-eks/

In this post, we show you how to apply attribute-based access control (ABAC) while you store and manage your Amazon Elastic Kubernetes Services (Amazon EKS) workload secrets in AWS Secrets Manager, and then retrieve them by integrating Secrets Manager with Amazon EKS using External Secrets Operator to define more fine-grained and dynamic AWS Identity and Access Management (IAM) permission policies for accessing secrets.

It’s common to manage numerous workloads in an EKS cluster, each necessitating access to a distinct set of secrets. You can verify adherence to the principle of least privilege by creating separate permission policies for each workload to restrict their access. To scale and reduce overhead, Amazon Web Services (AWS) recommends using ABAC to manage workloads’ access to secrets. ABAC helps reduce the number of permission policies needed to scale with your environment.

What is ABAC?

In IAM, a traditional authorization approach is known as role-based access control (RBAC). RBAC sets permissions based on a person’s job function, commonly known as IAM roles. To enforce RBAC in IAM, distinct policies for various job roles are created. As a best practice, only the minimum permissions required for a specific role are granted (principle of least privilege), which is achieved by specifying the resources that the role can access. A limitation of the RBAC model is its lack of flexibility. Whenever new resources are introduced, you must modify policies to permit access to the newly added resources.

Attribute-based access control (ABAC) is an approach to authorization that assigns permissions in accordance with attributes, which in the context of AWS are referred to as tags. You create and add tags to your IAM resources. You then create and configure ABAC policies to permit operations requested by a principal when there’s a match between the tags of the principal and the resource. When a principal uses temporary credentials to make a request, its associated tags come from session tags, incoming transitive sessions tags, and IAM tags. The principal’s IAM tags are persistent, but session tags, and incoming transitive session tags are temporary and set when the principal assumes an IAM role. Note that AWS tags are attached to AWS resources, whereas session tags are only valid for the current session and expire with the session.

How External Secrets Operator works

External Secrets Operator (ESO) is a Kubernetes operator that integrates external secret management systems including Secrets Manager with Kubernetes. ESO provides Kubernetes custom resources to extend Kubernetes and integrate it with Secrets Manager. It fetches secrets and makes them available to other Kubernetes resources by creating Kubernetes Secrets. At a basic level, you need to create an ESO SecretStore resource and one or more ESO ExternalSecret resources. The SecretStore resource specifies how to access the external secret management system (Secrets Manager) and allows you to define ABAC related properties (for example, session tags and transitive tags).

You declare what data (secret) to fetch and how the data should be transformed and saved as a Kubernetes Secret in the ExternalSecret resource. The following figure shows an overview of the process for creating Kubernetes Secrets. Later in this post, we review the steps in more detail.

Figure 1: ESO process

How to use ESO for ABAC

Before creating any ESO resources, you must make sure that the operator has sufficient permissions to access Secrets Manager. ESO offers multiple ways to authenticate to AWS. For the purpose of this solution, you will use the controller’s pod identity. To implement this method, you configure the ESO service account to assume an IAM role for service accounts (IRSA), which is used by ESO to make requests to AWS.

To adhere to the principle of least privilege and verify that each Kubernetes workload can access only its designated secrets, you will use ABAC policies. As we mentioned, tags are the attributes used for ABAC in the context of AWS. For example, principal and secret tags can be compared to create ABAC policies to deny or allow access to secrets. Secret tags are static tags assigned to secrets symbolizing the workload consuming the secret. On the other hand, principal (requester) tags are dynamically modified, incorporating workload specific tags. The only viable option to dynamically modifying principal tags is to use session tags and incoming transitive session tags. However, as of this writing, there is no way to add session and transitive tags when assuming an IRSA. The workaround for this issue is role chaining and passing session tags when assuming downstream roles. ESO offers role chaining, meaning that you can refer to one or more IAM roles with access to Secrets Manager in the SecretStore resource definition, and ESO will chain them with its IRSA to access secrets. It also allows you to define session tags and transitive tags to be passed when ESO assumes the IAM roles with its primary IRSA. The ability to pass session tags allows you to implement ABAC and compare principal tags (including session tags) with secret tags every time ESO sends a request to Secrets Manager to fetch a secret. The following figure shows ESO authentication process with role chaining in one Kubernetes namespace.

Figure 2: ESO AWS authentication process with role chaining (single namespace)

Architecture overview

Let’s review implementing ABAC with a real-world example. When you have multiple workloads and services in your Amazon EKS cluster, each service is deployed in its own unique namespace, and service secrets are stored in Secrets Manager and tagged with a service name (key=service, value=service name). The following figure shows the required resources to implement ABAC with EKS and Secrets Manager.

Figure 3: Amazon EKS secrets management with ABAC

Prerequisites

An AWS account
Permission to create IAM resources
Permission to create Secrets Manager secrets
Permissions to manage KMS keys for Secrets Manager secrets
An existing Amazon EKS cluster
A user who can modify your Kubernetes cluster
AWS Command Line Interface (AWS CLI) and kubectl installed
Helm and eksctl installed

Deploy the solution

Begin by installing ESO:

From a terminal where you usually run your helm commands, run the following helm command to add an ESO helm repository.
```
helm repo add external-secrets https://charts.external-secrets.io
```

Install ESO using the following helm command in a terminal that has access to your target Amazon EKS cluster:

helm install external-secrets \
   external-secrets/external-secrets \
    -n external-secrets \
    --create-namespace \
   --set installCRDs=true

To verify ESO installation, run the following command. Make sure you pass the same namespace as the one you used when installing ESO:
```
kubectl get pods -n external-secrets
```

See the ESO Getting started documentation page for more information on other installation methods, installation options, and how to uninstall ESO.

Create an IAM role to access Secrets Manager secrets

You must create an IAM role with access to Secrets Manager secrets. Start by creating a customer managed policy to attach to your role. Your policy should allow reading secrets from Secrets Manager. The following example shows a policy that you can create for your role:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",k
			"Action": [
				"kms:ListKeys",
				"kms:ListAliases",
				"secretsmanager:ListSecrets"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"kms:Decrypt",
				"kms:DescribeKey"
			],
			"Resource": <KMS Key ARN>
		},
		{
			"Effect": "Allow",
			"Action": [ 
				"secretsmanager:GetSecretValue",
				"secretsmanager:DescribeSecret",
				"secretsmanager:ListSecretVersionIds"
			],
			"Resource": "*",
			"Condition": {
				"StringEquals": {
					"secretsmanager:ResourceTag/ekssecret": "${aws:PrincipalTag/ekssecret}"
				}
			}
		}
	]
}

Consider the following in this policy:

Secrets Manager uses an AWS managed key for Secrets Manager by default to encrypt your secrets. It’s recommended to specify another encryption key during secret creation and have separate keys for separate workloads. Modify the resource element of the second policy statement and replace <KMS Key ARN> with the KMS key ARNs used to encrypt your secrets. If you use the default key to encrypt your secrets, you can remove this statement.
The policy statement conditionally allows access to all secrets. The condition element permits access only when the value of the principal tag, identified by the key service, matches the value of the secret tag with the same key. You can include multiple conditions (in separate statements) to match multiple tags.

After you create your policy, follow the guide for Creating IAM roles to create your role, attaching the policy you created. Use the default value for your role’s trust relationship for now, you will update the trust relationship in the next step. Note the role’s ARN after creation.

Create an IAM role for the ESO service account

Use eksctl to create the IAM role for the ESO service account (IRSA). Before creating the role, you must create an IAM policy. ESO IRSA only needs permission to assume the Secrets Manager access role that you created in the previous step.

Use the following example of an IAM policy that you can create. Replace <Secrets Manager Access Role ARN> with the ARN of the role you created in the previous step and follow creating a customer managed policy to create the policy. After creating the policy, note the policy ARN.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Resource": "<Secrets Manager Access Role ARN>"
        }
    ]
}

Next, run the following command to get the account name of the ESO service. You will see a list of service accounts, pick the one that has the same name as your helm release, in this example, the service account is external-secrets.
```
kubectl get serviceaccounts -n external-secrets
```
Next, create an IRSA and configure an ESO service account to assume the role. Run the following command to create a new role and associate it with the ESO service account. Replace the variables in brackets (<example>) with your specific information:
```
eksctl create iamserviceaccount --name <ESO service account> \
--namespace <ESO namespace> --cluster <cluster name> \
--role-name <IRSA name> --override-existing-serviceaccounts \
--attach-policy-arn <policy arn you created earlier> --approve
```
You can validate the operation by following the steps listed in Configuring a Kubernetes service account to assume an IAM role. Note that you had to pass the ‑‑override-existing-serviceaccounts argument because the ESO service account was already created.
After you’ve validated the operation, run the following command to retrieve the IRSA ARN (replace <IRSA name> with the name you used in the previous step):
```
aws iam get-role --role-name <IRSA name> --query Role.Arn
```

Modify the trust relationship of the role you created previously and limit it to your newly created IRSA. The following should resemble your trust relationship. Replace <IRSA Arn> with the IRSA ARN returned in the previous step:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<AWS ACCOUNT ID>:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "<IRSA Arn>"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<IRSA Arn>"
            },
            "Action": "sts:TagSession",
            "Condition": {
                "StringLike": {
                    "aws:RequestTag/ekssecret": "*"
                }
            }
        }
    ]
}

Note that you will be using session tags to implement ABAC. When using session tags, trust policies for all roles connected to the identity provider (IdP) passing the tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.

Moreover, the condition block of the second statement limits ESO’s ability to pass session tags with the key name ekssecret. We’re using this condition to verify that the ESO role can only create session tags used for accessing secrets manager, and doesn’t gain the ability to set principal tags that might be used for any other purpose. This way, you’re creating a namespace to help prevent further privilege escalations or escapes.

Create secrets in Secrets Manager

You can create two secrets in Secrets Manager and tag them.

Follow the steps in Create an AWS Secrets Manager secret to create two secrets named service1_secret and service2_secret. Add the following tags to your secrets:
- service1_secret:
  - key=ekssecret, value=service1
- service2_secret:
  - key=ekssecret, value=service2
Run the following command to verify both secrets are created and tagged properly:
```
aws secretsmanager list-secrets --query 'SecretList[*].{Name:Name, Tags:Tags}'
```

Create ESO objects in your cluster

Create two namespaces in your cluster:

❯ kubectl create ns service1-ns
❯ kubectl create ns service2-ns

Assume that service1-ns hosts service1 and service2-ns hosts service2. After creating the namespaces for your services, verify that each service is restricted to accessing secrets that are tagged with a specific key-value pair. In this example the key should be ekssecret and the value should match the name of the corresponding service. This means that service1 should only have access to service1_secret, while service2 should only have access to service2_secret. Next, declare session tags in SecretStore object definitions.

Edit the following command snippet using the text editor of your choice and replace every instance of <Secrets Manager Access Role ARN> with the ARN of the IAM role you created earlier to access Secrets Manager secrets. Copy and paste the edited command in your terminal and run it to create a .yaml file in your working directory that contains the SecretStore definitions. Make sure to change the AWS Region to reflect the Region of your Secrets Manager.

cat > secretstore.yml <<EOF
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secretsmanager
  namespace: service1-ns
spec:
  provider:
    aws:
      service: SecretsManager
      role: <Secrets Manager Access Role ARN>
      region: us-west-2
      sessionTags:
        - key: ekssecret
          value: service1
---
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secretsmanager
  namespace: service2-ns
spec:
  provider:
    aws:
      service: SecretsManager
      role: <Secrets Manager Access Role ARN>
      region: us-west-2
      sessionTags:
        - key: ekssecret
          value: service2
EOF

Create SecretStore objects by running the following command:
```
kubectl apply -f secretstore.yml
```
Validate object creation by running the following command:
```
kubectl describe secretstores.external-secrets.io -A
```
Check the status and events section for each object and make sure the store is validated.

Next, create two ExternalSecret objects requesting service1_secret and service2_secret. Copy and paste the following command in your terminal and run it. The command will create a .yaml file in your working directory that contains ExternalSecret definitions.

cat > exrternalsecret.yml <<EOF
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: service1-es1
  namespace: service1-ns
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: service1-ns-secret1
    creationPolicy: Owner
  data:
  - secretKey: service1_secret
    remoteRef:
      key: "service1_secret"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: service2-es2
  namespace: service2-ns
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: service1-ns-secret2
    creationPolicy: Owner
  data:
  - secretKey: service2_secret
    remoteRef:
      key: "service2_secret"
EOF

Run the following command to create objects:
```
kubectl apply -f exrternalsecret.yml
```
Verify the objects are created by running following command:
```
kubectl get externalsecrets.external-secrets.io -A
```
Each ExternalSecret object should create a Kubernetes secret in the same namespace it was created in. Kubernetes secrets are accessible to services in the same namespace. To demonstrate that both Service A and Service B has access to their secrets, run the following command.
```
kubectl get secrets -A
```

You should see service1-ns-secret1 created in service1-ns namespace which is accessible to Service 1, and service1-ns-secret2 created in service2-ns which is accessible to Service2.

Try creating an ExternalSecrets object in service1-ns referencing service2_secret. Notice that your object shows SecretSyncedError status. This is the expected behavior, because ESO passes different session tags for ExternalSecret objects in each namespace, and when the tag where key is ekssecret doesn’t match the secret tag with the same key, the request will be rejected.

What about AWS Secrets and Configuration Provider (ASCP)?

Amazon offers a capability called AWS Secrets and Configuration Provider (ASCP), which allows applications to consume secrets directly from external stores, including Secrets Manager, without modifying the application code. ASCP is actively maintained by AWS, which makes sure that it remains up to date and aligned with the latest features introduced in Secrets Manager. See How to use AWS Secrets & Configuration Provider with your Kubernetes Secrets Store CSI driver to learn more about how to use ASCP to retrieve secrets from Secrets Manager.

Today, customers who use AWS Fargate with Amazon EKS can’t use the ASCP method due to the incompatibility of daemonsets on Fargate. Kubernetes also doesn’t provide a mechanism to add specific claims to JSON web tokens (JWT) used to assume IAM roles. Today, when using ASCP in Kubernetes, which assumes IAM roles through IAM roles for service accounts (IRSA), there’s a constraint in appending session tags during the IRSA assumption due to JWT claim restrictions, limiting the ability to implement ABAC.

With ESO, you can create Kubernetes Secrets and have your pods retrieve secrets from them instead of directly mounting secrets as volumes in your pods. ESO is also capable of using its controller pod’s IRSA to retrieve secrets, so you don’t need to set up IRSA for each pod. You can also role chain and specify secondary roles to be assumed by ESO IRSA and pass session tags to be used with ABAC policies. ESO’s role chaining and ABAC capabilities help decrease the number of IAM roles required for secrets retrieval. See Leverage AWS secrets stores from EKS Fargate with External Secrets Operator on the AWS Containers blog to learn how to use ESO on an EKS Fargate cluster to consume secrets stored in Secrets Manager.

Conclusion

In this blog post, we walked you through how to implement ABAC with Amazon EKS and Secrets Manager using External Secrets Operator. Implementing ABAC allows you to create a single IAM role for accessing Secrets Manager secrets while implementing granular permissions. ABAC also decreases your team’s overhead and reduces the risk of misconfigurations. With ABAC, you require fewer policies and don’t need to update existing policies to allow access to new services and workloads.

If you have feedback about this post, submit comments in the Comments section below.

Best Practices to help secure your container image build pipeline by using AWS Signer

2023-12-22 Jorge Castillo

Post Syndicated from Jorge Castillo original https://aws.amazon.com/blogs/security/best-practices-to-help-secure-your-container-image-build-pipeline-by-using-aws-signer/

AWS Signer is a fully managed code-signing service to help ensure the trust and integrity of your code. It helps you verify that the code comes from a trusted source and that an unauthorized party has not accessed it. AWS Signer manages code signing certificates and public and private keys, which can reduce the overhead of your public key infrastructure (PKI) management. It also provides a set of features to simplify lifecycle management of your keys and certificates so that you can focus on signing and verifying your code.

In June 2023, AWS announced Container Image Signing with AWS Signer and Amazon EKS, a new capability that gives you native AWS support for signing and verifying container images stored in Amazon Elastic Container Registry (Amazon ECR).

Containers and AWS Lambda functions are popular serverless compute solutions for applications built on the cloud. By using AWS Signer, you can verify that the software running in these workloads originates from a trusted source.

In this blog post, you will learn about the benefits of code signing for software security, governance, and compliance needs. Flexible continuous integration and continuous delivery (CI/CD) integration, management of signing identities, and native integration with other AWS services can help you simplify code security through automation.

Background

Code signing is an important part of the software supply chain. It helps ensure that the code is unaltered and comes from an approved source.

To automate software development workflows, organizations often implement a CI/CD pipeline to push, test, and deploy code effectively. You can integrate code signing into the workflow to help prevent untrusted code from being deployed, as shown in Figure 1. Code signing in the pipeline can provide you with different types of information, depending on how you decide to use the functionality. For example, you can integrate code signing into the build stage to attest that the code was scanned for vulnerabilities, had its software bill of materials (SBOM) approved internally, and underwent unit and integration testing. You can also use code signing to verify who has pushed or published the code, such as a developer, team, or organization. You can verify each of these steps separately by including multiple signing stages in the pipeline. For more information on the value provided by container image signing, see Cryptographic Signing for Containers.

Figure 1: Security IN the pipeline

In the following section, we will walk you through a simple implementation of image signing and its verification for Amazon Elastic Kubernetes Service (Amazon EKS) deployment. The signature attests that the container image went through the pipeline and came from a trusted source. You can use this process in more complex scenarios by adding multiple AWS CodeBuild code signing stages that make use of various AWS Signer signing profiles.

Services and tools

In this section, we discuss the various AWS services and third-party tools that you need for this solution.

CI/CD services

For the CI/CD pipeline, you will use the following AWS services:

AWS CodePipeline — a fully managed continuous delivery service that you can use to automate your release pipelines for fast and reliable application and infrastructure updates.
AWS CodeCommit — a fully managed source control service that hosts secure Git-based repositories.
AWS Signer — a fully managed code-signing service that you can use to help ensure the trust and integrity of your code.
AWS CodeBuild — A fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy.

Container services

You will use the following AWS services for containers for this walkthrough:

Amazon EKS — a managed Kubernetes service to run Kubernetes in the AWS Cloud and on-premises data centers.
Amazon ECR — a fully managed container registry for high-performance hosting, so that you can reliably deploy application images and artifacts anywhere.

Verification tools

The following are publicly available sign verification tools that we integrated into the pipeline for this post, but you could integrate other tools that meet your specific requirements.

Notation — A publicly available Notary project within the Cloud Native Computing Foundation (CNCF). With contributions from AWS and others, Notary is an open standard and client implementation that allows for vendor-specific plugins for key management and other integrations. AWS Signer manages signing keys, key rotation, and PKI management for you, and is integrated with Notation through a curated plugin that provides a simple client-based workflow.
Kyverno — A publicly available policy engine that is designed for Kubernetes.

Solution overview

Figure 2: Solution architecture

Here’s how the solution works, as shown in Figure 2:

Developers push Dockerfiles and application code to CodeCommit. Each push to CodeCommit starts a pipeline hosted on CodePipeline.
CodeBuild packages the build, containerizes the application, and stores the image in the ECR registry.
CodeBuild retrieves a specific version of the image that was previously pushed to Amazon ECR. AWS Signer and Notation sign the image by using the signing profile established previously, as shown in more detail in Figure 3.

Figure 3: Signing images described

AWS Signer and Notation verify the signed image version and then deploy it to an Amazon EKS cluster.

If the image has not previously been signed correctly, the CodeBuild log displays an output similar to the following:

Error: signature verification failed: no signature is associated with "<AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server@<DIGEST>" , make sure the artifact was signed successfully

If there is a signature mismatch, the CodeBuild log displays an output similar to the following:

Error: signature verification failed for all the signatures associated with <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server@<DIGEST>

Kyverno verifies the container image signature for use in the Amazon EKS cluster.
Figure 4 shows steps 4 and 5 in more detail.

Figure 4: Verification of image signature for Kubernetes

Prerequisites

Before getting started, make sure that you have the following prerequisites in place:

An Amazon EKS cluster provisioned.
An Amazon ECR repository for your container images.
A CodeCommit repository with your application code. For more information, see Create an AWS CodeCommit repository.
A CodePipeline pipeline deployed with the CodeCommit repository as the code source and four CodeBuild stages: Build, ApplicationSigning, ApplicationDeployment, and VerifyContainerSign. The CI/CD pipeline should look like that in Figure 5.

Figure 5: CI/CD pipeline with CodePipeline

Walkthrough

You can create a signing profile by using the AWS Command Line Interface (AWS CLI), AWS Management Console or the AWS Signer API. In this section, we’ll walk you through how to sign the image by using the AWS CLI.

To sign the image (AWS CLI)

Create a signing profile for each identity.

# Create an AWS Signer signing profile with default validity period
$ aws signer put-signing-profile \
    --profile-name build_signer \
    --platform-id Notation-OCI-SHA384-ECDSA

Sign the image from the CodeBuild build—your buildspec.yaml configuration file should look like the following:

version: 0.2

phases:
  pre_build:
    commands:
      - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
      - REPOSITORY_URI=$AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=${COMMIT_HASH:=latest}
      - DIGEST=$(docker manifest inspect $AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server:$IMAGE_TAG -v | jq -r '.Descriptor.digest')
      - echo $DIGEST
      
      - wget https://d2hvyiie56hcat.cloudfront.net/linux/amd64/installer/rpm/latest/aws-signer-notation-cli_amd64.rpm
      - sudo rpm -U aws-signer-notation-cli_amd64.rpm
      - notation version
      - notation plugin ls
  build:
    commands:
      - notation sign $REPOSITORY_URI@$DIGEST --plugin com.amazonaws.signer.notation.plugin --id arn:aws:signer: $AWS_REGION:$AWS_ACCOUNT_ID:/signing-profiles/notation_container_signing
      - notation inspect $AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server@$DIGEST
      - notation verify $AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server@$DIGEST
  post_build:
    commands:
      - printf '[{"name":"hello-server","imageUri":"%s"}]' $REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json
artifacts:
    files: imagedefinitions.json

The commands in the buildspec.yaml configuration file do the following:

Sign you in to Amazon ECR to work with the Docker images.
Reference the specific image that will be signed by using the commit hash (or another versioning strategy that your organization uses). This gets the digest.
Sign the container image by using the notation sign command. This command uses the container image digest, instead of the image tag.
Install the Notation CLI. In this example, you use the installer for Linux. For a list of installers for various operating systems, see the AWS Signer Developer Guide,
Sign the image by using the notation sign command.
Inspect the signed image to make sure that it was signed successfully by using the notation inspect command.

To verify the signed image, use the notation verify command. The output should look similar to the following:

Successfully verified signature for <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server@<DIGEST>

(Optional) For troubleshooting, print the notation policy from the pipeline itself to check that it’s working as expected by running the notation policy show command:

notation policy show

For this, include the command in the pre_build phase after the notation version command in the buildspec.yaml configuration file.

After the notation policy show command runs, CodeBuild logs should display an output similar to the following:

{
  "version": "1.0",
  "trustPolicies": [
    {
      "name": "aws-signer-tp",
      "registryScopes": [
      "<AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server"
      ],
      "signatureVerification": {
        "level": "strict"
      },
      "trustStores": [
        "signingAuthority:aws-signer-ts"
      ],
      "trustedIdentities": [
        "arn:aws:signer:<AWS_REGION>:<AWS_ACCOUNT_ID>:/signing-profiles/notation_test"
      ]
    }
  ]
}

To verify the image in Kubernetes, set up both Kyverno and the Kyverno-notation-AWS Signer in your EKS cluster. To get started with Kyverno and the Kyverno-notation-AWS Signer solution, see the installation instructions.

After you install Kyverno and Kyverno-notation-AWS Signer, verify that the controller is running—the STATUS should show Running:

$ kubectl get pods -n kyverno-notation-aws -w

NAME                                    READY   STATUS    RESTARTS   AGE
kyverno-notation-aws-75b7ddbcfc-kxwjh   1/1     Running   0          6h58m

Configure the CodeBuild buildspec.yaml configuration file to verify that the images deployed in the cluster have been previously signed. You can use the following code to configure the buildspec.yaml file.

version: 0.2

phases:
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - aws --version
      - REPOSITORY_URI=${REPO_ECR}
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=${COMMIT_HASH:=latest}
      - curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
      - curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
      - echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
      — chmod +x kubectl
      - mv ./kubectl /usr/local/bin/kubectl
      - kubectl version --client
  build:
    commands:
      - echo Build started on `date`
      - aws eks update-kubeconfig -—name ${EKS_NAME} —-region ${AWS_DEFAULT_REGION}
      - echo Deploying Application
      - sed -i '/image:\ image/image:\ '\"${REPOSITORY_URI}:${IMAGE_TAG}\"'/g' deployment.yaml
      - kubectl apply -f deployment.yaml 
      - KYVERNO_NOTATION_POD=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" -n kyverno-notation-aws)
      - STATUS=$(kubectl logs --tail=1 kyverno-notation-aws-75b7ddbcfc-kxwjh -n kyverno-notation-aws | grep $IMAGE_TAG | grep ERROR)
      - |
        if [[ $STATUS ]]; then
          echo "There is an error"
          exit 1
        else
          echo "No Error"
        fi
  post_build:
    commands:
      - printf '[{"name":"hello-server","imageUri":"%s"}]' $REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json
artifacts:
    files: imagedefinitions.json

The commands in the buildspec.yaml configuration file do the following:

Set up the environment variables, such as the ECR repository URI and the Commit hash, to build the image tag. The kubectl tool will use this later to reference the container image that will be deployed with the Kubernetes objects.
Use kubectl to connect to the EKS cluster and insert the container image reference in the deployment.yaml file.
After the container is deployed, you can observe the kyverno-notation-aws controller and access its logs. You can check if the deployed image is signed. If the logs contain an error, stop the pipeline run with an error code, do a rollback to a previous version, or delete the deployment if you detect that the image isn’t signed.

Decommission the AWS resources

If you no longer need the resources that you provisioned for this post, complete the following steps to delete them.

To clean up the resources

Delete the EKS cluster and delete the ECR image.
Delete the IAM roles and policies that you used for the configuration of IAM roles for service accounts.
Revoke the AWS Signer signing profile that you created and used for the signing process by running the following command in the AWS CLI:
```
$ aws signer revoke-signing-profile
```

Delete signatures from the Amazon ECR repository. Make sure to replace <AWS_ACCOUNT_ID> and <AWS_REGION> with your own information.

# Use oras CLI, with Amazon ECR Docker Credential Helper, to delete signature
$ oras manifest delete <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/pause@sha256:ca78e5f730f9a789ef8c63bb55275ac12dfb9e8099e6a0a64375d8a95ed501c4

Note: Using the ORAS project’s oras client, you can delete signatures and other reference type artifacts. It implements deletion by first removing the reference from an index, and then deleting the manifest.

Conclusion

In this post, you learned how to implement container image signing in a CI/CD pipeline by using AWS services such as CodePipeline, CodeBuild, Amazon ECR, and AWS Signer along with publicly available tools such as Notary and Kyverno. By implementing mandatory image signing in your pipelines, you can confirm that only validated and authorized container images are deployed to production. Automating the signing process and signature verification is vital to help securely deploy containers at scale. You also learned how to verify signed images both during deployment and at runtime in Kubernetes. This post provides valuable insights for anyone looking to add image signing capabilities to their CI/CD pipelines on AWS to provide supply chain security assurances. The combination of AWS managed services and publicly available tools provides a robust implementation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Let’s Architect! Security in software architectures

2023-08-16 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-security-in-software-architectures/

Security is fundamental for each product and service you are building with. Whether you are working on the back-end or the data and machine learning components of a system, the solution should be securely built.

In 2022, we discussed security in our post Let’s Architect! Architecting for Security. Today, we take a closer look at general security practices for your cloud workloads to secure both networks and applications, with a mix of resources to show you how to architect for security using the services offered by Amazon Web Services (AWS).

In this edition of Let’s Architect!, we share some practices for protecting your workloads from the most common attacks, introduce the Zero Trust principle (you can learn how AWS itself is implementing it!), plus how to move to containers and/or alternative approaches for managing your secrets.

A deep dive on the current security threat landscape with AWS

This session from AWS re:Invent, security engineers guide you through the most common threat vectors and vulnerabilities that AWS customers faced in 2022. For each possible threat, you can learn how it’s implemented by attackers, the weaknesses attackers tend to leverage, and the solutions offered by AWS to avert these security issues. We describe this as fundamental architecting for security: this implies adopting suitable services to protect your workloads, as well as follow architectural practices for security.

Take me to this re:Invent 2022 session!

Statistics about common attacks and how they can be launched

Zero Trust: Enough talk, let’s build better security

What is Zero Trust? It is a security model that produces higher security outcomes compared with the traditional network perimeter model.

How does Zero Trust work in practice, and how can you start adopting it? This AWS re:Invent 2022 session defines the Zero Trust models and explains how to implement one. You can learn how it is used within AWS, as well as how any architecture can be built with these pillars in mind. Furthermore, there is a practical use case to show you how Delphix put Zero Trust into production.

Take me to this re:Invent 2022 session!

AWS implements the Zero Trust principle for managing interactions across different services

A deep dive into container security on AWS

Nowadays, it’s vital to have a thorough understanding of a container’s underlying security layers. AWS services, like Amazon Elastic Kubernetes Service and Amazon Elastic Container Service, have harnessed these Linux security-layer protections, keeping a sharp focus on the principle of least privilege. This approach significantly minimizes the potential attack surface by limiting the permissions and privileges of processes, thus upholding the integrity of the system.

This re:Inforce 2023 session discusses best practices for securing containers for your distributed systems.

Take me to this re:Inforce 2023 session!

Fundamentals and best practices to secure containers

Migrating your secrets to AWS Secrets Manager

Secrets play a critical role in providing access to confidential systems and resources. Ensuring the secure and consistent management of these secrets, however, presents a challenge for many organizations.

Anti-patterns observed in numerous organizational secrets management systems include sharing plaintext secrets via unsecured means, such as emails or messaging apps, which can allow application developers to view these secrets in plaintext or even neglect to rotate secrets regularly. This detailed guidance walks you through the steps of discovering and classifying secrets, plus explains the implementation and migration processes involved in transferring secrets to AWS Secrets Manager.

Take me to this AWS Security Blog post!

An organization’s perspectives and responsibilities when building a secrets management solution

Conclusion

We’re glad you joined our conversation on building secure architectures! Join us in a couple of weeks when we’ll talk about cost optimization on AWS.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

Deploy container applications in a multicloud environment using Amazon CodeCatalyst

2023-07-31 Pawan Shrivastava

Post Syndicated from Pawan Shrivastava original https://aws.amazon.com/blogs/devops/deploy-container-applications-in-a-multicloud-environment-using-amazon-codecatalyst/

In the previous post of this blog series, we saw how organizations can deploy workloads to virtual machines (VMs) in a hybrid and multicloud environment. This post shows how organizations can address the requirement of deploying containers, and containerized applications to hybrid and multicloud platforms using Amazon CodeCatalyst. CodeCatalyst is an integrated DevOps service which enables development teams to collaborate on code, and build, test, and deploy applications with continuous integration and continuous delivery (CI/CD) tools.

One prominent scenario where multicloud container deployment is useful is when organizations want to leverage AWS’ broadest and deepest set of Artificial Intelligence (AI) and Machine Learning (ML) capabilities by developing and training AI/ML models in AWS using Amazon SageMaker, and deploying the model package to a Kubernetes platform on other cloud platforms, such as Azure Kubernetes Service (AKS) for inference. As shown in this workshop for operationalizing the machine learning pipeline, we can train an AI/ML model, push it to Amazon Elastic Container Registry (ECR) as an image, and later deploy the model as a container application.

Scenario description

The solution described in the post covers the following steps:

Setup Amazon CodeCatalyst environment.
Create a Dockerfile along with a manifest for the application, and a repository in Amazon ECR.
Create an Azure service principal which has permissions to deploy resources to Azure Kubernetes Service (AKS), and store the credentials securely in Amazon CodeCatalyst secret.
Create a CodeCatalyst workflow to build, test, and deploy the containerized application to AKS cluster using Github Actions.

The architecture diagram for the scenario is shown in Figure 1.

Solution architecture diagram

Figure 1 – Solution Architecture

Solution Walkthrough

This section shows how to set up the environment, and deploy a HTML application to an AKS cluster.

Setup Amazon ECR and GitHub code repository

Create a new Amazon ECR and a code repository. In this case we’re using GitHub as the repository but you can create a source repository in CodeCatalyst or you can choose to link an existing source repository hosted by another service if that service is supported by an installed extension. Then follow the application and Docker image creation steps outlined in Step 1 in the environment creation process in exposing Multiple Applications on Amazon EKS. Create a file named manifest.yaml as shown, and map the “image” parameter to the URL of the Amazon ECR repository created above.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multicloud-container-deployment-app
  labels:
    app: multicloud-container-deployment-app
spec:
  selector:
    matchLabels:
      app: multicloud-container-deployment-app
  replicas: 2
  template:
    metadata:
      labels:
        app: multicloud-container-deployment-app
    spec:
      nodeSelector:
        "beta.kubernetes.io/os": linux
      containers:
      - name: ecs-web-page-container
        image: <aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>
        imagePullPolicy: Always
        ports:
            - containerPort: 80
        resources:
          limits:
            memory: "100Mi"
            cpu: "200m"
      imagePullSecrets:
          - name: ecrsecret
---
apiVersion: v1
kind: Service
metadata:
  name: multicloud-container-deployment-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: multicloud-container-deployment-app

Push the files to Github code repository. The multicloud-container-app github repository should look similar to Figure 2 below

Files in multicloud container app github repository

Figure 2 – Files in Github repository

Configure Azure Kubernetes Service (AKS) cluster to pull private images from ECR repository

Pull the docker images from a private ECR repository to your AKS cluster by running the following command. This setup is required during the azure/k8s-deploy Github Actions in the CI/CD workflow. Authenticate Docker to an Amazon ECR registry with get-login-password by using aws ecr get-login-password. Run the following command in a shell where AWS CLI is configured, and is used to connect to the AKS cluster. This creates a secret called ecrsecret, which is used to pull an image from the private ECR repository.

kubectl create secret docker-registry ecrsecret\
 --docker-server=<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>\
 --docker-username=AWS\
 --docker-password= $(aws ecr get-login-password --region us-west-2)

Provide ECR URI in the variable “–docker-server =”.

CodeCatalyst setup

Follow these steps to set up CodeCatalyst environment:

Create a CodeCatalyst space and associated AWS account.
Create a CodeCatalyst project.
A CodeCatalyst environment connected to the AWS account, where the ECR repository is configured.

Configure access to the AKS cluster

In this solution, we use three GitHub Actions – azure/login, azure/aks-set-context and azure/k8s-deploy – to login, set the AKS cluster, and deploy the manifest file to the AKS cluster respectively. For the Github Actions to access the Azure environment, they require credentials associated with an Azure Service Principal.

Service Principals in Azure are identified by the CLIENT_ID, CLIENT_SECRET, SUBSCRIPTION_ID, and TENANT_ID properties. Create the Service principal by running the following command in the azure cloud shell:

az ad sp create-for-rbac \
    --name "ghActionHTMLapplication" \
    --scope /subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP> \
    --role Contributor \
    --sdk-auth

The command generates a JSON output (shown in Figure 3), which is stored in CodeCatalyst secret called AZURE_CREDENTIALS. This credential is used by azure/login Github Actions.

JSON output stored in AZURE-CREDENTIALS secret

Figure 3 – JSON output

Configure secrets inside CodeCatalyst Project

Create three secrets CLUSTER_NAME (Name of AKS cluster), RESOURCE_GROUP(Name of Azure resource group) and AZURE_CREDENTIALS(described in the previous step) as described in the working with secret document. The secrets are shown in Figure 4.

Secrets in CodeCatalyst

Figure 4 – CodeCatalyst Secrets

CodeCatalyst CI/CD Workflow

To create a new CodeCatalyst workflow, select CI/CD from the navigation on the left and select Workflows (1). Then, select Create workflow (2), leave the default options, and select Create (3) as shown in Figure 5.

Create CodeCatalyst CI/CD workflow

Figure 5 – Create CodeCatalyst CI/CD workflow

Add “Push to Amazon ECR” Action

Add the Push to Amazon ECR action, and configure the environment where you created the ECR repository as shown in Figure 6. Refer to adding an action to learn how to add CodeCatalyst action.

Create ‘Push to ECR’ CodeCatalyst Action

Figure 6 – Create ‘Push to ECR’ Action

Select the Configuration tab and specify the configurations as shown in Figure7.

Configure ‘Push to ECR’ CodeCatalyst Action

Figure 7 – Configure ‘Push to ECR’ Action

Configure the Deploy action

1. Add a GitHub action for deploying to AKS as shown in Figure 8.

Github action to deploy to AKS

Figure 8 – Github action to deploy to AKS

2. Configure the GitHub action from the configurations tab by adding the following snippet to the GitHub Actions YAML property:

- name: Install Azure CLI
  run: pip install azure-cli
- name: Azure login
  id: login
  uses: azure/[email protected]
  with:
    creds: ${Secrets.AZURE_CREDENTIALS}
- name: Set AKS context
  id: set-context
  uses: azure/aks-set-context@v3
  with:
    resource-group: ${Secrets.RESOURCE_GROUP}
    cluster-name: ${Secrets.CLUSTER_NAME}
- name: Setup kubectl
  id: install-kubectl
  uses: azure/setup-kubectl@v3
- name: Deploy to AKS
  id: deploy-aks
  uses: Azure/k8s-deploy@v4
  with:
    namespace: default
    manifests: manifest.yaml
    pull-images: true

Github action configuration for deploying application to AKS

Figure 9 – Github action configuration

3. The workflow is now ready and can be validated by choosing ‘Validate’ and then saved to the repository by choosing ‘Commit’.
We have implemented an automated CI/CD workflow that builds the container image of the application (refer Figure 10), pushes the image to ECR, and deploys the application to AKS cluster. This CI/CD workflow is triggered as application code is pushed to the repository.

Automated CI/CD workflow

Figure 10 – Automated CI/CD workflow

Test the deployment

When the HTML application runs, Kubernetes exposes the application using a public facing load balancer. To find the external IP of the load balancer, connect to the AKS cluster and run the following command:

kubectl get service multicloud-container-deployment-service

The output of the above command should look like the image in Figure 11.

Output of kubectl get service command

Figure 11 – Output of kubectl get service

Paste the External IP into a browser to see the running HTML application as shown in Figure 12.

HTML application running successfully in AKS

Figure 12 – Application running in AKS

Cleanup

If you have been following along with the workflow described in the post, you should delete the resources you deployed so you do not continue to incur charges. First, delete the Amazon ECR repository using the AWS console. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project. There’s no cost associated with the CodeCatalyst project and you can continue using it. Finally, if you deployed the application on a new AKS cluster, delete the cluster from the Azure console. In case you deployed the application to an existing AKS cluster, run the following commands to delete the application resources.

kubectl delete deployment multicloud-container-deployment-app
kubectl delete services multicloud-container-deployment-service

Conclusion

In summary, this post showed how Amazon CodeCatalyst can help organizations deploy containerized workloads in a hybrid and multicloud environment. It demonstrated in detail how to set up and configure Amazon CodeCatalyst to deploy a containerized application to Azure Kubernetes Service, leveraging a CodeCatalyst workflow, and GitHub Actions. Learn more and get started with your Amazon CodeCatalyst journey!

If you have any questions or feedback, leave them in the comments section.

About Authors

AWS Fargate Enables Faster Container Startup using Seekable OCI

2023-07-17 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-fargate-enables-faster-container-startup-using-seekable-oci/

While developing with containers is becoming an increasingly popular way for deploying and scaling applications, there are still areas where improvements can be made. One of the main issues with scaling containerized applications is the long startup time, especially during scale up when newer instances need to be added. This issue can have a negative impact on the customer experience, for example when a website needs to scale out to serve additional traffic.

A research paper shows that container image downloads account for 76 percent of container startup time, but on average only 6.4 percent of the data is needed for the container to start doing useful work. Starting and scaling out containerized applications requires downloading container images from a remote container registry. This may introduce a non-trivial latency, as the entire image must be downloaded and unpacked before the applications can be started.

One solution to this problem is lazy loading (also known as asynchronous loading) container images. This approach downloads data from the container registry in parallel with the application startup, such as stargz-snapshotter, a project that aims to improve the overall container start time.

Last year, we introduced Seekable OCI (SOCI), a technology open sourced by Amazon Web Services (AWS) that enables container runtimes to implement lazy loading the container image to start applications faster without modifying the container images. As part of that effort, we open sourced SOCI Snapshotter, a snapshotter plugin that enables lazy loading with SOCI in containerd.

AWS Fargate Support for SOCI
Today, I’m excited to share that AWS Fargate now supports Seekable OCI (SOCI), which helps applications deploy and scale out faster by enabling containers to start without waiting to download the entire container image. At launch, this new capability is available for Amazon Elastic Container Service (Amazon ECS) applications running on AWS Fargate.

Here’s a quick look to show how AWS Fargate support for SOCI works:

SOCI works by creating an index (SOCI index) of the files within an existing container image. This index is a key enabler to launching containers faster, providing the capability to extract an individual file from a container image without having to download the entire image. Your applications no longer need to wait to complete pulling and unpacking a container image before your applications start running. This allows you to deploy and scale out applications more quickly and reduce the rollout time for application updates.

A SOCI index is generated and stored separately from the container images. This means that your container images don’t need to be converted to use SOCI, therefore not breaking secure hash algorithm (SHA)-based security, such as container image signing. The index is then stored in the registry alongside the container image. At release, AWS Fargate support for SOCI works with Amazon Elastic Container Registry (Amazon ECR).

When you use Amazon ECS with AWS Fargate to run your SOCI-indexed containerized images, AWS Fargate automatically detects if a SOCI index for the image exists and starts the container without waiting for the entire image to be pulled. This also means that AWS Fargate will still continue to run container images that don’t have SOCI indexes.

Let’s Get Started
There are two ways to create SOCI indexes for container images.

Use AWS SOCI Index Builder – AWS SOCI Index Builder is a serverless solution for indexing container images in the AWS Cloud. This AWS CloudFormation stack deploys an Amazon EventBridge rule to identify Amazon ECR action events and invoke an AWS Lambda function to match the defined filter. Then, another AWS Lambda function generates and pushes SOCI indexes to repositories in the Amazon ECR registry.
Create SOCI indexes manually – This approach provides more flexibility on in how the SOCI indexes are created, including for existing container images in Amazon ECR repositories. To create SOCI indexes, you can use the soci CLI provided by the soci-snapshotter project.

The AWS SOCI Index Builder provides you with an automated process to get started and build SOCI indexes for your container images. The sociCLI provides you with more flexibility around index generation and the ability to natively integrate index generation in your CI/CD pipelines.

In this article, I manually generate SOCI indexes using the soci CLI from the soci-snapshotter project.

Create a Repository and Push Container Images
First, I create an Amazon ECR repository called pytorch-socifor my container image using AWS CLI.

$ aws ecr create-repository --region us-east-1 --repository-name pytorch-soci

I keep the Amazon ECR URI output and define it as a variable to make it easier for me to refer to the repository in the next step.

$ ECRSOCIURI=xyz.dkr.ecr.us-east-1.amazonaws.com/pytorch-soci:latest

For the sample application, I use a PyTorch training (CPU-based) container image from AWS Deep Learning Containers. I use the nerdctl CLI to pull the container image because, by default, the Docker Engine stores the container image in the Docker Engine image store, not the containerd image store.

$ SAMPLE_IMAGE="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04" 
$ aws ecr get-login-password --region us-east-1 | sudo nerdctl login --username AWS --password-stdin xyz.dkr.ecr.ap-southeast-1.amazonaws.com
$ sudo nerdctl pull --platform linux/amd64 $SAMPLE_IMAGE

Then, I tag the container image for the repository that I created in the previous step.

$ sudo nerdctl tag $SAMPLE_IMAGE $ECRSOCIURI

Next, I need to push the container image into the ECR repository.

$ sudo nerdctl push $ECRSOCIURI

At this point, my container image is already in my Amazon ECR repository.

Create SOCI Indexes
Next, I need to create SOCI index.

A SOCI index is an artifact that enables lazy loading of container images. A SOCI index consists of 1) a SOCI index manifest and 2) a set of zTOCs. The following image illustrates the components in a SOCI index manifest, and how it refers to a container image manifest.

The SOCI index manifest contains the list of zTOCs and a reference to the image for which the manifest was generated. A zTOC, or table of contents for compressed data, consists of two parts:

TOC, a table of contents containing file metadata and the corresponding offset in the decompressed TAR archive.
zInfo, a collection of checkpoints representing the state of the compression engine at various points in the layer.

To learn more about the concept and term, please visit soci-snapshotter Terminology page.

Before I can create SOCI indexes, I need to install the sociCLI. To learn more about how to install the soci, visit Getting Started with soci-snapshotter.

To create SOCI indexes, I use the soci create command.

$ sudo soci create $ECRSOCIURI
layer sha256:4c6ec688ebe374ea7d89ce967576d221a177ebd2c02ca9f053197f954102e30b -> ztoc skipped
layer sha256:ab09082b308205f9bf973c4b887132374f34ec64b923deef7e2f7ea1a34c1dad -> ztoc skipped
layer sha256:cd413555f0d1643e96fe0d4da7f5ed5e8dc9c6004b0731a0a810acab381d8c61 -> ztoc skipped
layer sha256:eee85b8a173b8fde0e319d42ae4adb7990ed2a0ce97ca5563cf85f529879a301 -> ztoc skipped
layer sha256:3a1b659108d7aaa52a58355c7f5704fcd6ab1b348ec9b61da925f3c3affa7efc -> ztoc skipped
layer sha256:d8f520dcac6d926130409c7b3a8f77aea639642ba1347359aaf81a8b43ce1f99 -> ztoc skipped
layer sha256:d75d26599d366ecd2aa1bfa72926948ce821815f89604b6a0a49cfca100570a0 -> ztoc skipped
layer sha256:a429d26ed72a85a6588f4b2af0049ae75761dac1bb8ba8017b8830878fb51124 -> ztoc skipped
layer sha256:5bebf55933a382e053394e285accaecb1dec9e215a5c7da0b9962a2d09a579bc -> ztoc skipped
layer sha256:5dfa26c6b9c9d1ccbcb1eaa65befa376805d9324174ac580ca76fdedc3575f54 -> ztoc skipped
layer sha256:0ba7bf18aa406cb7dc372ac732de222b04d1c824ff1705d8900831c3d1361ff5 -> ztoc skipped
layer sha256:4007a89234b4f56c03e6831dc220550d2e5fba935d9f5f5bcea64857ac4f4888 -> ztoc sha256:0b4d78c856b7e9e3d507ac6ba64e2e2468997639608ef43c088637f379bb47e4
layer sha256:089632f60d8cfe243c5bc355a77401c9a8d2f415d730f00f6f91d44bb96c251b -> ztoc sha256:f6a16d3d07326fe3bddbdb1aab5fbd4e924ec357b4292a6933158cc7cc33605b
layer sha256:f18dd99041c3095ade3d5013a61a00eeab8b878ba9be8545c2eabfbca3f3a7f3 -> ztoc sha256:95d7966c964dabb54cb110a1a8373d7b88cfc479336d473f6ba0f275afa629dd
layer sha256:69e1edcfbd217582677d4636de8be2a25a24775469d677664c8714ed64f557c3 -> ztoc sha256:ac0e18bd39d398917942c4b87ac75b90240df1e5cb13999869158877b400b865

From the above output, I can see that sociCLI created zTOCs for four layers, which and this means only these four layers will be lazily pulled and the other container image layers will be downloaded in full before the container image starts. This is because there is less of a launch time impact in lazy loading very small container image layers. However, you can configure this behavior using the --min-layer-size flag when you run soci create.

Verify and Push SOCI Indexes
The soci CLI also provides several commands that can help you to review the SOCI Indexes that have been generated.

To see a list of all index manifests, I can run the following command.

$ sudo soci index list

DIGEST                                                                     SIZE    IMAGE REF                                                                                   PLATFORM       MEDIA TYPE                                    CREATED
sha256:ea5c3489622d4e97d4ad5e300c8482c3d30b2be44a12c68779776014b15c5822    1931    xyz.dkr.ecr.us-east-1.amazonaws.com/pytorch-soci:latest                                     linux/amd64    application/vnd.oci.image.manifest.v1+json    10m4s ago
sha256:ea5c3489622d4e97d4ad5e300c8482c3d30b2be44a12c68779776014b15c5822    1931    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04    linux/amd64    application/vnd.oci.image.manifest.v1+json    10m4s ago

While optional, if I need to see the list of zTOC, I can use the following command.

$ sudo soci ztoc list
DIGEST                                                                     SIZE        LAYER DIGEST
sha256:0b4d78c856b7e9e3d507ac6ba64e2e2468997639608ef43c088637f379bb47e4    2038072     sha256:4007a89234b4f56c03e6831dc220550d2e5fba935d9f5f5bcea64857ac4f4888
sha256:95d7966c964dabb54cb110a1a8373d7b88cfc479336d473f6ba0f275afa629dd    11442416    sha256:f18dd99041c3095ade3d5013a61a00eeab8b878ba9be8545c2eabfbca3f3a7f3
sha256:ac0e18bd39d398917942c4b87ac75b90240df1e5cb13999869158877b400b865    36277264    sha256:69e1edcfbd217582677d4636de8be2a25a24775469d677664c8714ed64f557c3
sha256:f6a16d3d07326fe3bddbdb1aab5fbd4e924ec357b4292a6933158cc7cc33605b    10152696    sha256:089632f60d8cfe243c5bc355a77401c9a8d2f415d730f00f6f91d44bb96c251b

This series of zTOCs contains all of the information that SOCI needs to find a given file in a layer. To review the zTOC for each layer, I can use one of the digest sums from the preceding output and use the following command.

$ sudo soci ztoc info sha256:0b4d78c856b7e9e3d507ac6ba64e2e2468997639608ef43c088637f379bb47e4
{
  "version": "0.9",
  "build_tool": "AWS SOCI CLI v0.1",
  "size": 2038072,
  "span_size": 4194304,
  "num_spans": 33,
  "num_files": 5552,
  "num_multi_span_files": 26,
  "files": [
    {
      "filename": "bin/",
      "offset": 512,
      "size": 0,
      "type": "dir",
      "start_span": 0,
      "end_span": 0
    },
    {
      "filename": "bin/bash",
      "offset": 1024,
      "size": 1037528,
      "type": "reg",
      "start_span": 0,
      "end_span": 0
    }

---Trimmed for brevity---

Now, I need to use the following command to push all SOCI-related artifacts into the Amazon ECR.

$ PASSWORD=$(aws ecr get-login-password --region us-east-1)
$ sudo soci push --user AWS:$PASSWORD $ECRSOCIURI

If I go to my Amazon ECR repository, I can verify the index is created. Here, I can see that two additional objects are listed alongside my container image: a SOCI Index and an Image index. The image index allows AWS Fargate to look up SOCI indexes associated with my container image.

Understanding SOCI Performance
The main objective of SOCI is to minimize the required time to start containerized applications. To measure the performance of AWS Fargate lazy loading container images using SOCI, I need to understand how long it takes for my container images to start with SOCI and without SOCI.

To understand the duration needed for each container image to start, I can use metrics available from the DescribeTasks API on Amazon ECS. The first metric is createdAt, the timestamp for the time when the task was created and entered the PENDING state. The second metric is startedAt, the time when the task transitioned from the PENDING state to the RUNNING state.

For this, I have created another Amazon ECR repository using the same container image but without generating a SOCI index, called pytorch-without-soci. If I compare these container images, I have two additional objects in pytorch-soci(an image index and a SOCI index) that don’t exist in pytorch-without-soci.

Deploy and Run Applications
To run the applications, I have created an Amazon ECS cluster called demo-pytorch-soci-cluster, a VPC and the required ECS task execution role. If you’re new to Amazon ECS, you can follow Getting started with Amazon ECS to be more familiar with how to deploy and run your containerized applications.

Now, let’s deploy and run both the container images with FARGATE as the launch type. I deﬁne five tasks for each pytorch-sociand pytorch-without-soci.

$ aws ecs \ 
    --region us-east-1 \ 
    run-task \ 
    --count 5 \ 
    --launch-type FARGATE \ 
    --task-definition arn:aws:ecs:us-east-1:XYZ:task-definition/pytorch-soci \ 
    --cluster socidemo 

$ aws ecs \ 
    --region us-east-1 \ 
    run-task \ 
    --count 5 \ 
    --launch-type FARGATE \ 
    --task-definition arn:aws:ecs:us-east-1:XYZ:task-definition/pytorch-without-soci \ 
    --cluster socidemo

After a few minutes, there are 10 running tasks on my ECS cluster.

After verifying that all my tasks are running, I run the following script to get two metrics: createdAt and startedAt.

#!/bin/bash
CLUSTER=<CLUSTER_NAME>
TASKDEF=<TASK_DEFINITION>
REGION="us-east-1"
TASKS=$(aws ecs list-tasks \
    --cluster $CLUSTER \
    --family $TASKDEF \
    --region $REGION \
    --query 'taskArns[*]' \
    --output text)

aws ecs describe-tasks \
    --tasks $TASKS \
    --region $REGION \
    --cluster $CLUSTER \
    --query "tasks[] | reverse(sort_by(@, &createdAt)) | [].[{startedAt: startedAt, createdAt: createdAt, taskArn: taskArn}]" \
    --output table

Running the above command for the container image without SOCI indexes — pytorch-without-soci— produces following output:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                                   DescribeTasks                                                                                   |
+----------------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------+
|             createdAt            |             startedAt             |                                                  taskArn                                                   |
+----------------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------+
|  2023-07-07T17:43:59.233000+00:00|  2023-07-07T17:46:09.856000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/dcdf19b6e66444aeb3bc607a3114fae0   |
|  2023-07-07T17:43:59.233000+00:00|  2023-07-07T17:46:09.459000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/9178b75c98ee4c4e8d9c681ddb26f2ca   |
|  2023-07-07T17:43:59.233000+00:00|  2023-07-07T17:46:21.645000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/7da51e036c414cbab7690409ce08cc99   |
|  2023-07-07T17:43:59.233000+00:00|  2023-07-07T17:46:00.606000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/5ee8f48194874e6dbba75a5ef753cad2   |
|  2023-07-07T17:43:59.233000+00:00|  2023-07-07T17:46:02.461000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/58531a9e94ed44deb5377fa997caec36   |
+----------------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------+

From the average aggregated delta time (between startedAt and createdAt) for each task, the pytorch-without-soci (without SOCI indexes) successfully ran after 129 seconds.

Next, I’m running same command but for pytorch-sociwhich comes with SOCI indexes.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                                   DescribeTasks                                                                                   |
+----------------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------+
|             createdAt            |             startedAt             |                                                  taskArn                                                   |
+----------------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------+
|  2023-07-07T17:43:53.318000+00:00|  2023-07-07T17:44:51.076000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/c57d8cff6033494b97f6fd0e1b797b8f   |
|  2023-07-07T17:43:53.318000+00:00|  2023-07-07T17:44:52.212000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/6d168f9e99324a59bd6e28de36289456   |
|  2023-07-07T17:43:53.318000+00:00|  2023-07-07T17:45:05.443000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/4bdc43b4c1f84f8d9d40dbd1a41645da   |
|  2023-07-07T17:43:53.318000+00:00|  2023-07-07T17:44:50.618000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/43ea53ea84154d5aa90f8fdd7414c6df   |
|  2023-07-07T17:43:53.318000+00:00|  2023-07-07T17:44:50.777000+00:00 |  arn:aws:ecs:ap-southeast-1:xyz:task/demo-pytorch-soci-cluster/0731bea30d42449e9006a5d8902756d5   |
+----------------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------+

Here, I see my container image with SOCI-enabled — pytorch-soci — was started 60 seconds after being created.

This means that running my sample application with SOCI indexes on AWS Fargate is approximately 50 percent faster compared to running without SOCI indexes.

It’s recommended to benchmark the startup and scaling-out time of your application with and without SOCI. This helps you to have a better understanding of how your application behaves and if your applications benefit from AWS Fargate support for SOCI.

Customer Voices
During the private preview period, we heard lots of feedback from our customers about AWS Fargate support for SOCI. Here’s what our customers say:

Autodesk provides critical design, make, and operate software solutions across the architecture, engineering, construction, manufacturing, media, and entertainment industries. “SOCI has given us a 50% improvement in startup performance for our time-sensitive simulation workloads running on Amazon ECS with AWS Fargate. This allows our application to scale out faster, enabling us to quickly serve increased user demand and save on costs by reducing idle compute capacity. The AWS Partner Solution for creating the SOCI index is easy to configure and deploy.” – Boaz Brudner, Head of Innovyze SaaS Engineering, AI and Architecture, Autodesk.

Flywire is a global payments enablement and software company, on a mission to deliver the world’s most important and complex payments. “We run multi-step deployment pipelines on Amazon ECS with AWS Fargate which can take several minutes to complete. With SOCI, the total pipeline duration is reduced by over 50% without making any changes to our applications, or the deployment process. This allowed us to drastically reduce the rollout time for our application updates. For some of our larger images of over 750MB, SOCI improved the task startup time by more than 60%.”, Samuel Burgos, Sr. Cloud Security Engineer, Flywire.

Virtuoso is a leading software corporation that makes functional UI and end-to-end testing software. “SOCI has helped us reduce the lag between demand and availability of compute. We have very bursty workloads which our customers expect to start as fast as possible. SOCI helps our ECS tasks spin-up 40% faster, allowing us to quickly scale our application and reduce the pool of idle compute capacity, enabling us to deliver value more efficiently. Setting up SOCI was really easy. We opted to use the quick-start AWS Partner’s solution with which we could leave our build and deployment pipelines untouched.”, Mathew Hall, Head of Site Reliability Engineering, Virtuoso.

Things to Know
Availability — AWS Fargate support for SOCI is available in all AWS Regions where Amazon ECS, AWS Fargate, and Amazon ECR are available.

Pricing — AWS Fargate support for SOCI is available at no additional cost and you will only be charged for storing the SOCI indexes in Amazon ECR.

Get Started — Learn more about benefits and how to get started on the AWS Fargate Support for SOCI page.

Happy building.
— Donnie

Let’s Architect! Multi-tenant SaaS architectures

2023-06-07 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-multi-tenant-saas-architectures/

In a multi-tenant architecture multiple instances of an application run on a shared infrastructure. With this type of approach, each tenant is isolated from others, typically through logical separation, while utilizing a shared infrastructure. This allows multiple tenants to use the same application and maintain their data security, privacy, and customization requirements.

Understanding architectural patterns for multi-tenancy has become crucial for architects and developers aiming to deliver scalable, secure, and cost-effective solutions. Isolating tenant data is a fundamental responsibility for Software as a Service (SaaS) providers. In this edition of Let’s Architect!, we talk about comprehensive exploration of multi-tenant architectures, covering various aspects, such as SaaS microservices, SaaS serverless, SaaS EKS, and an insightful whitepaper.

SaaS microservices deep dive: Simplifying multi-tenant development

In this session, Michael Beardsley, Principal Solutions Architect at AWS, takes a deep dive into the realm of multi-tenant microservices, exploring various patterns and strategies that enable the seamless implementation of multi-tenant microservices, all while ensuring that additional complexity is not imposed upon the SaaS builders. He shares practical patterns to simplify the development process by addressing crucial aspect, such as authorization, data access, tenant isolation, metrics, billing, logging, and a plethora of other considerations; this is irrespective of the chosen compute platform (like Amazon Elastic Container Service, Amazon Elastic Kubernetes Service [Amazon EKS], or AWS Lambda) or database solution.

There is another session available that highlights specific techniques and architecture strategies that can directly impact the success of a SaaS business. If you’re interested in learning more about optimizing multi-tenant SaaS architecture, this session is a great opportunity.

Take me to this video!

SaaS multi-tenant microservices

Building a Multi-Tenant SaaS Solution Using AWS Serverless Services

In this AWS Partner Network (APN) Blog post, you will explore a reference solution that presents a comprehensive perspective on a functional multi-tenant serverless SaaS environment. This solution effectively showcases various essential components required to construct a multi-tenant SaaS solution using serverless services, including onboarding processes, tenant isolation mechanisms, data partitioning techniques, a tenant deployment pipeline, and robust observability measures.

By delving into these aspects, you can gain valuable insights into the architecture and design considerations involved in creating a successful multi-tenant SaaS solution.

Take me to this AWS APN blogpost!

Tenant registration flow

Amazon EKS SaaS deep dive: A multi-tenant EKS SaaS solution

In this re:Invent 2021 presentation, Tod Golding, Principal Partner Solutions Architect, chats about a SaaS reference solution that addresses fundamental multi-tenant considerations, examining its approach to core SaaS topics, including tenant isolation, identity, onboarding, tenant administration, and data partitioning. The goal is to explore an Amazon EKS SaaS architecture through the lens of working code and highlight the key architectural strategies that were used in this reference environment.

There is also valuable information available on Github regarding EKS multi-tenancy. Exploring the Github repositories related to EKS multi-tenancy can provide further insights, resources, and practical examples for implementing multi-tenant architectures on EKS. This presentation is an engaging way to dive deeper into this topic and gain a more comprehensive understanding of best practices and real-world implementations.

Take me to this video!

Tenant deployment model

Saas Storage Strategies

Storage represents a challenging aspect of building and delivering multi-tenant software solutions. There are different strategies that can be used to partition tenant data, each with a unique set of trade-offs for implementing separation between tenants. This whitepaper covers different storage models for multi-tenancy; in particular, you can learn about the:

Silo model (data from the tenant is fully isolated)
Pool model (all the tenants use the same database and table)
Bridge model (single database but a different table for each tenant)

For each of these models, the whitepaper describes in detail how they can be implemented, as well as the different trade-offs in terms of isolation and agility. You can also discover how these tenancy models can be implemented specifically on databases, such as Amazon DynamoDB and Amazon Relational Database Service, thus covering both NoSQL and SQL scenarios.

Take me to this whitepaper!

Partitioning model tradeoffs

See you next time!

Thanks for joining our conversation on multi-tenant SaaS architectures! Next time, we’ll talk about open-source technologies.

To find all the blogs from this series, you can check out the Let’s Architect! list of content on the AWS Architecture Blog.

AWS Week in Review – Amazon Security Lake Now GA, New Actions on AWS Fault Injection Simulator, and More – June 5, 2023

2023-06-05 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/aws-week-in-review-amazon-security-lake-now-ga-new-actions-on-aws-fault-injection-simulator-and-more-june-5-2023/

Last Wednesday, I traveled to Cape Town to speak at the .Net Developer User Group. My colleague Francois Bouteruche also gave a talk but joined virtually. I enjoyed my time there—what an amazing community! Join the group in order to learn about upcoming events.

Now onto the AWS updates from last week. There was a lot of news related to AWS, and I have compiled a few announcements you need to know. Let’s get started!

Last Week’s Launches
Here are a few launches from last week that you might have missed:

Amazon Security Lake is now Generally Available – This service automatically centralizes security data from AWS environments, SaaS providers, on-premises environments, and cloud sources into a purpose-built data lake stored in your account, making it easier to analyze security data, gain a more comprehensive understanding of security across your entire organization, and improve the protection of your workloads, applications, and data. Read more in Channy’s post announcing the preview of Security Lake.

New AWS Direct Connect Location in Santiago, Chile – The AWS Direct Connect service lets you create a dedicated network connection to AWS. With this service, you can build hybrid networks by linking your AWS and on-premises networks to build applications that span environments without compromising performance. Last week we announced the opening of a new AWS Direct Connect location in Santiago, Chile. This new Santiago location offers dedicated 1 Gbps and 10 Gbps connections, with MACsec encryption available for 10 Gbps. For more information on over 115 Direct Connect locations worldwide, visit the locations section of the Direct Connect product detail pages.

New actions on AWS Fault Injection Simulator for Amazon EKS and Amazon ECS – Had it not been for Adrian Hornsby’s LinkedIn post I would have missed this announcement. We announced the expanded support of AWS Fault Injection Simulator (FIS) for Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (ECS). This expanded support adds additional AWS FIS actions for Amazon EKS and Amazon ECS. Learn more about Amazon ECS task actions here, and Amazon EKS pod actions here.

Other AWS News
A few more news items and blog posts you might have missed:

Autodesk Uses Sagemaker to Improve Observability – One of our customers, Autodesk, used AWS services including Amazon Sagemaker, Amazon Kinesis, and Amazon API Gateway to build a platform that enables development and deployment of near-real time personalization experiments by modeling and responding to user behavior data. All this delivered a dynamic, personalized experience for Autodesk’s customers. Read more about the story at AWS Customer Stories.

AWS DMS Serverless – We announced AWS DMS Serverless which lets you automatically provision and scale capacity for migration and data replication. Donnie wrote about this announcement here.

For AWS open-source news and updates, check out the latest newsletter curated by my colleague Ricardo Sueiras to bring you the most recent updates on open-source projects, posts, events, and more.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Upcoming AWS Events
We have the following upcoming events. These give you the opportunity to meet with other tech enthusiasts and learn:

AWS Silicon Innovation Day (June 21) – A one-day virtual event that will allow you to understand AWS Silicon and how you can use AWS’s unique silicon offerings to innovate. Learn more and register here.

AWS Global Summits – Sign up for the AWS Summit closest to where you live: London (June 7), Washington, DC (June 7–8), Toronto (June 14).

AWS Community Days – Join these community-led conferences where event logistics and content are planned, sourced, and delivered by community leaders: Chicago, Illinois (June 15), and Chile (July 1).

And with that, I end my very first Week in Review post, and this was such fun to write. Come back next Monday for another Week in Review!

– Veliswa x

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Let’s Architect! Designing serverless solutions

2023-05-10 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-serverless-solutions/

During his re:Invent 2022 keynote, Werner Vogels, AWS Vice President and Chief Technology Officer, emphasized the asynchronous nature of our world and the challenges associated with incorporating asynchronicity into our architectures. AWS serverless services can help users concentrate on the asynchronous aspects of their workloads, easing the execution of event-driven architectures and enabling the adoption of effective integration patterns for communication both within and beyond a bounded context.

In this edition of Let’s Architect!, we offer an in-depth exploration of the architecture of serverless AWS services, such as AWS Lambda. We also present a new workshop centered on design patterns employing serverless AWS services, which ultimately delivers valuable insights on implementing event-driven architectures within systems.

A closer look at AWS Lambda

This video is the perfect companion for those seeking to learn and master a Lambda architecture, empowering you to effectively leverage its capabilities in your workloads.

With the knowledge gained from this video, you will be well-equipped to design your functions’ code in a highly optimized manner, ensuring efficient performance and resource utilization. Furthermore, a comprehensive understanding of Lambda functions can help identify and apply the most suitable approach to cloud workloads, resulting in an agile and robust cloud infrastructure that meets a project’s unique requirements.

Take me to this video!

Discover how AWS Lambda functions work under the hood

Implementing an event-driven serverless story generation application with ChatGPT and DALL-E

This example of an event-driven serverless architecture showcases the power of leveraging AWS services and AI technologies to develop innovative solutions. Built upon a foundation of serverless services, including Amazon EventBridge, Amazon DynamoDB, Lambda, Amazon Simple Storage Service, and managed artificial intelligence (AI ) services like Amazon Polly, this architecture demonstrates the seamless capacity to create daily stories with a scheduled launch. By utilizing EventBridge scheduler, an Lambda function is initiated every night to generate new content. The integration of AI services, like ChatGPT and DALL-E, further elevates the solution, as their compatibility with the serverless model enables efficient and dynamic content creation. This case serves as a testament to the potential of combining event-driven serverless architectures, with cutting-edge AI technologies for inventive and impactful applications.

Take me to this Compute Blog post!

How to build an event-driven architecture with serverless AWS services integrating ChatGPT and DALL-E

AWS Workshop Studio: Serverless Patterns

The AWS Serverless Patterns workshop offers a comprehensive learning experience to enhance your understanding of architectural patterns applicable to serverless projects. Throughout the workshop, participants will delve into various patterns, such as synchronous and asynchronous implementations, tailored to meet the demands of modern serverless applications. This hands-on approach ensures a production-ready understanding, encompassing crucial topics like testing serverless workloads, establishing automation pipelines, and more. Take this workshop to elevate your serverless architecture knowledge!

Take me to the serverless workshop!

The high-level architecture of the workshops modules

Building Serverlesspresso: Creating event-driven architectures

Serverlesspresso is an event-driven, serverless workload that uses EventBridge and AWS Step Functions to coordinate events across microservices and support thousands of orders per day. This comprehensive session delves into design considerations, development processes, and valuable lessons learned from creating a production-ready solution. Discover practical patterns and extensibility options that contribute to a robust, scalable, and cost-effective application. Gain insights into combining EventBridge and Step Functions to address complex architectural challenges in larger applications.

Take me to this video!

How to leverage AWS Step Functions for orchestrating your workflows

See you next time!

Thanks for joining our conversation on serverless solutions! We’ll see you next time when we talk about AWS microservices.

Can’t get enough of the Let’s Architect! series? Visit the Let’s Architect! page of the AWS Architecture Blog!

Extending a serverless, event-driven architecture to existing container workloads

2023-05-04 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/extending-a-serverless-event-driven-architecture-to-existing-container-workloads/

This post is written by Dhiraj Mahapatro, Principal Specialist SA, and Sascha Moellering, Principal Specialist SA, and Emily Shea, WW Lead, Integration Services.

Many serverless services are a natural fit for event-driven architectures (EDA), as events invoke them and only run when there is an event to process. When building in the cloud, many services emit events by default and have built-in features for managing events. This combination allows customers to build event-driven architectures easier and faster than ever before.

The insurance claims processing sample application in this blog series uses event-driven architecture principles and serverless services like AWS Lambda, AWS Step Functions, Amazon API Gateway, Amazon EventBridge, and Amazon SQS.

When building an event-driven architecture, it’s likely that you have existing services to integrate with the new architecture, ideally without needing to make significant refactoring changes to those services. As services communicate via events, extending applications to new and existing microservices is a key benefit of building with EDA. You can write those microservices in different programming languages or running on different compute options.

This blog post walks through a scenario of integrating an existing, containerized service (a settlement service) to the serverless, event-driven insurance claims processing application described in this blog post.

Overview of sample event-driven architecture

The sample application uses a front-end to sign up a new user and allow the user to upload images of their car and driver’s license. Once signed up, they can file a claim and upload images of their damaged car. Previously, it did not yet integrate with a settlement service for completing the claims and settlement process.

In this scenario, the settlement service is a brownfield application that runs Spring Boot 3 on Amazon ECS with AWS Fargate. AWS Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building container applications without managing servers.

The Spring Boot application exposes a REST endpoint, which accepts a POST request. It applies settlement business logic and creates a settlement record in the database for a car insurance claim. Your goal is to make settlement work with the new EDA application that is designed for claims processing without re-architecting or rewriting. Customer, claims, fraud, document, and notification are the other domains that are shown as blue-colored boxes in the following diagram:

Project structure

The application uses AWS Cloud Development Kit (CDK) to build the stack. With CDK, you get the flexibility to create modular and reusable constructs imperatively using your language of choice. The sample application uses TypeScript for CDK.

The following project structure enables you to build different bounded contexts. Event-driven architecture relies on the choreography of events between domains. The object oriented programming (OOP) concept of CDK helps provision the infrastructure to separate the domain concerns while loosely coupling them via events.

You break the higher level CDK constructs down to these corresponding domains:

Application and infrastructure code are present in each domain. This project structure creates a seamless way to add new domains like settlement with its application and infrastructure code without affecting other areas of the business.

With the preceding structure, you can use the settlement-service.ts CDK construct inside claims-processing-stack.ts:

const settlementService = new SettlementService(this, "SettlementService", {
  bus,
});

The only information the SettlementService construct needs to work is the EventBridge custom event bus resource that is created in the claims-processing-stack.ts.

To run the sample application, follow the setup steps in the sample application’s README file.

Existing container workload

The settlement domain provides a REST service to the rest of the organization. A Docker containerized Spring Boot application runs on Amazon ECS with AWS Fargate. The following sequence diagram shows the synchronous request-response flow from an external REST client to the service:

External REST client makes POST /settlement call via an HTTP API present in front of an internal Application Load Balancer (ALB).
SettlementController.java delegates to SettlementService.java.
SettlementService applies business logic and calls SettlementRepository for data persistence.
SettlementRepository persists the item in the Settlement DynamoDB table.

A request to the HTTP API endpoint looks like:

curl --location <settlement-api-endpoint-from-cloudformation-output> \
--header 'Content-Type: application/json' \
--data '{
  "customerId": "06987bc1-1234-1234-1234-2637edab1e57",
  "claimId": "60ccfe05-1234-1234-1234-a4c1ee6fcc29",
  "color": "green",
  "damage": "bumper_dent"
}'

The response from the API call is:

You can learn more here about optimizing Spring Boot applications on AWS Fargate.

Extending container workload for events

To integrate the settlement service, you must update the service to receive and emit events asynchronously. The core logic of the settlement service remains the same. When you file a claim, upload damaged car images, and the application detects no document fraud, the settlement domain subscribes to Fraud.Not.Detected event and applies its business logic. The settlement service emits an event back upon applying the business logic.

The following sequence diagram shows a new interface in settlement to work with EDA. The settlement service subscribes to events that a producer emits. Here, the event producer is the fraud service that puts an event in an EventBridge custom event bus.

Producer emits Fraud.Not.Detected event to EventBridge custom event bus.
EventBridge evaluates the rules provided by the settlement domain and sends the event payload to the target SQS queue.
SubscriberService.java polls for new messages in the SQS queue.
On message, it transforms the message body to an input object that is accepted by SettlementService.
It then delegates the call to SettlementService, similar to how SettlementController works in the REST implementation.
SettlementService applies business logic. The flow is like the REST use case from 7 to 10.
On receiving the response from the SettlementService, the SubscriberService transforms the response to publish an event back to the event bus with the event type as Settlement.Finalized.

The rest of the architecture consumes this Settlement.Finalized event.

Using EventBridge schema registry and discovery

Schema enforces a contract between a producer and a consumer. A consumer expects the exact structure of the event payload every time an event arrives. EventBridge provides schema registry and discovery to maintain this contract. The consumer (the settlement service) can download the code bindings and use them in the source code.

Enable schema discovery in EventBridge before downloading the code bindings and using them in your repository. The code bindings provide a marshaller that unmarshals the incoming event from SQS queue to a plain old Java object (POJO) FraudNotDetected.java. You download the code bindings using the choice of your IDE. AWS Toolkit for IntelliJ makes it convenient to download and use them.

The final architecture for the settlement service with REST and event-driven architecture looks like:

Transition to become fully event-driven

With the new capability to handle events, the Spring Boot application now supports both the REST endpoint and the event-driven architecture by running the same business logic through different interfaces. In this example scenario, as the event-driven architecture matures and the rest of the organization adopts it, the need for the POST endpoint to save a settlement may diminish. In the future, you can deprecate the endpoint and fully rely on polling messages from the SQS queue.

You start with using an ALB and Fargate service CDK ECS pattern:

const loadBalancedFargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
  this,
  "settlement-service",
  {
    cluster: cluster,
    taskImageOptions: {
      image: ecs.ContainerImage.fromDockerImageAsset(asset),
      environment: {
        "DYNAMODB_TABLE_NAME": this.table.tableName
      },
      containerPort: 8080,
      logDriver: new ecs.AwsLogDriver({
        streamPrefix: "settlement-service",
        mode: ecs.AwsLogDriverMode.NON_BLOCKING,
        logRetention: RetentionDays.FIVE_DAYS,
      })
    },
    memoryLimitMiB: 2048,
    cpu: 1024,
    publicLoadBalancer: true,
    desiredCount: 2,
    listenerPort: 8080
  });

To adapt to EDA, you update the resources to retrofit the SQS queue to receive messages and EventBridge to put events. Add new environment variables to the ApplicationLoadBalancerFargateService resource:

environment: {
  "SQS_ENDPOINT_URL": queue.queueUrl,
  "EVENTBUS_NAME": props.bus.eventBusName,
  "DYNAMODB_TABLE_NAME": this.table.tableName
}

Grant the Fargate task permission to put events in the custom event bus and consume messages from the SQS queue:

props.bus.grantPutEventsTo(loadBalancedFargateService.taskDefinition.taskRole);
queue.grantConsumeMessages(loadBalancedFargateService.taskDefinition.taskRole);

When you transition the settlement service to become fully event-driven, you do not need the HTTP API endpoint and ALB anymore, as SQS is the source of events.

A better alternative is to use QueueProcessingFargateService ECS pattern for the Fargate service. The pattern provides auto scaling based on the number of visible messages in the SQS queue, besides CPU utilization. In the following example, you can also add two capacity provider strategies while setting up the Fargate service: FARGATE_SPOT and FARGATE. This means, for every one task that is run using FARGATE, there are two tasks that use FARGATE_SPOT. This can help optimize cost.

const queueProcessingFargateService = new ecs_patterns.QueueProcessingFargateService(this, 'Service', {
  cluster,
  memoryLimitMiB: 1024,
  cpu: 512,
  queue: queue,
  image: ecs.ContainerImage.fromDockerImageAsset(asset),
  desiredTaskCount: 2,
  minScalingCapacity: 1,
  maxScalingCapacity: 5,
  maxHealthyPercent: 200,
  minHealthyPercent: 66,
  environment: {
    "SQS_ENDPOINT_URL": queueUrl,
    "EVENTBUS_NAME": props?.bus.eventBusName,
    "DYNAMODB_TABLE_NAME": tableName
  },
  capacityProviderStrategies: [
    {
      capacityProvider: 'FARGATE_SPOT',
      weight: 2,
    },
    {
      capacityProvider: 'FARGATE',
      weight: 1,
    },
  ],
});

This pattern abstracts the automatic scaling behavior of the Fargate service based on the queue depth.

Running the application

To test the application, follow How to use the Application after the initial setup. Once complete, you see that the browser receives a Settlement.Finalized event:

{
  "version": "0",
  "id": "e2a9c866-cb5b-728c-ce18-3b17477fa5ff",
  "detail-type": "Settlement.Finalized",
  "source": "settlement.service",
  "account": "123456789",
  "time": "2023-04-09T23:20:44Z",
  "region": "us-east-2",
  "resources": [],
  "detail": {
    "settlementId": "377d788b-9922-402a-a56c-c8460e34e36d",
    "customerId": "67cac76c-40b1-4d63-a8b5-ad20f6e2e6b9",
    "claimId": "b1192ba0-de7e-450f-ac13-991613c48041",
    "settlementMessage": "Based on our analysis on the damage of your car per claim id b1192ba0-de7e-450f-ac13-991613c48041, your out-of-pocket expense will be $100.00."
  }
}

Cleaning up

The stack creates a custom VPC and other related resources. Be sure to clean up resources after usage to avoid the ongoing cost of running these services. To clean up the infrastructure, follow the clean-up steps shown in the sample application.

Conclusion

The blog explains a way to integrate existing container workload running on AWS Fargate with a new event-driven architecture. You use EventBridge to decouple different services from each other that are built using different compute technologies, languages, and frameworks. Using AWS CDK, you gain the modularity of building services decoupled from each other.

This blog shows an evolutionary architecture that allows you to modernize existing container workloads with minimal changes that still give you the additional benefits of building with serverless and EDA on AWS.

The major difference between the event-driven approach and the REST approach is that you unblock the producer once it emits an event. The event producer from the settlement domain that subscribes to that event is loosely coupled. The business functionality remains intact, and no significant refactoring or re-architecting effort is required. With these agility gains, you may get to the market faster

The sample application shows the implementation details and steps to set up, run, and clean up the application. The app uses ECS Fargate for a domain service, but you do not limit it to just Fargate. You can also bring container-based applications running on Amazon EKS similarly to event-driven architecture.

Learn more about event-driven architecture on Serverless Land.

Let’s Architect! Getting started with containers

2023-04-26 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-getting-started-with-containers/

Most of AWS customers building cloud-native applications or modernizing applications choose containers to run their microservices applications to accelerate innovation and time to market while lowering their total cost of ownership (TCO). Using containers in AWS comes with other benefits, such as increased portability, scalability, and flexibility.

The combination of containers technologies and AWS services also provides features such as load balancing, auto scaling, and service discovery, making it easier to deploy and manage applications at scale.

In this edition of Let’s Architect! we share useful resources to help you to get started with containers on AWS.

Container Build Lens

This whitepaper describes the Container Build Lens for the AWS Well-Architected Framework. It helps customers review and improve their cloud-based architectures and better understand the business impact of their design decisions. The document describes general design principles for containers, as well as specific best practices and implementation guidance using the Six Pillars of the Well-Architected Framework.

Take me to explore the Containers Build Lens!

Follow Containers Build Lens Best practices to architect your containers-based workloads.

EKS Workshop

The EKS Workshop is a useful resource to familiarize yourself with Amazon Elastic Kubernetes Service (Amazon EKS) by practicing on real use-cases. It is built to help users learn about Amazon EKS features and integrations with popular open-source projects. The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more. These are further broken down into standalone labs focusing on a particular feature, tool, or use case.

Once you’re done experimenting with EKS Workshop, start building your environments with Amazon EKS Blueprints, a collection of Infrastructure as Code (IaC) modules that helps you configure and deploy consistent, batteries-included Amazon EKS clusters across accounts and regions following AWS best practices. Amazon EKS Blueprints are available in both Terraform and CDK.

Take me to this workshop!

The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more.

Architecting for resiliency on AWS App Runner

Learn how to architect an highly available and resilient application using AWS App Runner. With App Runner, you can start with just the source code of your application or a container image. The complexity of running containerized applications is abstracted away, including the cloud resources needed for running your web application or API. App Runner manages load balancers, TLS certificates, auto scaling, logs, metrics, teachability and more, so you can focus on implementing your business logic in a highly scalable and elastic environment.

Take me to this blog post!

A high-level architecture for an available and resilient application with AWS App Runner

Securing Kubernetes: How to address Kubernetes attack vectors

As part of designing any modern system on AWS, it is necessary to think about the security implications and what can affect your security posture. This session introduces the fundamentals of the Kubernetes architecture and common attack vectors. It also includes security controls provided by Amazon EKS and suggestions on how to address them. With these strategies, you can learn how to reduce risk for your Kubernetes-based workloads.

Take me to this video!

Some common attack vectors that need addressing with Kubernetes

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about serverless.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Let’s Architect! Streamlining business with migration and modernization

2023-03-29 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-streamlining-business-with-migration-and-modernization/

Many customers migrate their systems to Amazon Web Services (AWS) to increase their competitive edge and drive business value. To maximize the benefits of a cloud migration, companies tend to move their applications in conjunction with modernization initiatives. These joined efforts help your applications gain more agility, scalability, and resilience. Modernizing the portfolio of workloads with AWS means that you can re-platform, refactor, or replace these workloads by using containers, serverless technologies, purpose-built data stores, and software automation. These functionalities allow you to benefit from the best of the AWS agility and total cost optimization (TCO) benefits.

In this edition of Let’s Architect! we share hands-on activities, customer stories, and tips and tricks to migrate and modernize your applications with AWS.

Migrating to the cloud: What is the cost of doing nothing?

Would you think that small companies always migrate faster than large enterprises? Actually, cloud migration speed doesn’t necessarily depend on the size of the business! Company size is not a clear indicator of migration and modernization success, but a shift of culture and mindset is essential for successful company evolution.

When it comes to migration, the cost of doing nothing is not just financial: Businesses can also expect a slower pace of innovation and a higher security burden. This video analyzes the financial benefits of migration and shares mental models for approaching an AWS cloud migration, and Marriott team members explain how they planned their migration and the lessons learned along the way.

Take me to this re:Invent 2022 video!

Benefits of an early migration start

Modernization pathways for a legacy .NET Framework monolithic application on AWS

Organizations aim to deliver the best technological solutions based on customer needs. At any stage in their cloud adoption journey, businesses often end up managing and building monolithic applications. Let’s explore a migration path for a monolithic .NET Framework application to a modern microservices-based stack on AWS, and discuss AWS tools to break the monolith into microservices and containerize applications.

Cost optimization is another key factor for modernizing your workloads and solutions include moving to Linux-based systems or using open-source database engines. This Migrate and Modernize enterprise workloads with AWS video walks you through the process of migrating and modernizing enterprise workloads with AWS.

Take me to this blog post with more detail!

A modernized microservices-based rearchitecture

Implementing a serverless-first strategy in an enterprise

Organizations of all sizes want to benefit from the agility, cost savings, and developer experience that serverless architectures can provide on AWS. For large enterprises, the return on investment (ROI) can be massive, but overcoming architecture inertia while ensuring security best practices and governance stay in place is a hurdle that many struggle with. In this lightning talk, learn how your organization can implement a serverless-first strategy to overcome these obstacles. Delta Air Lines shares the story of making serverless-first a reality as part of their AWS journey.

Take me to this video

Benefits of serverless

Application Migration with AWS

This workshop shows you how to migrate and modernize a fictional application to the AWS Cloud by:

Performing a database migration
Migrating and modernizing your web server using different migration strategies (for example, breaking down the monolith into containers)
Teaching you how to improve Operation excellence, Security, Performance efficiency, and Cost optimization of the deployed architecture by following these pillars of the AWS Well-Architected Framework.

Take me to this workshop!

Different migration strategies for web servers

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about distributed systems with containers.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Amazon Linux 2023, a Cloud-Optimized Linux Distribution with Long-Term Support

2023-03-15 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-linux-2023-a-cloud-optimized-linux-distribution-with-long-term-support/

I am excited to announce the general availability of Amazon Linux 2023 (AL2023). AWS has provided you with a cloud-optimized Linux distribution since 2010. This is the third generation of our Amazon Linux distributions.

Every generation of Amazon Linux distribution is secured, optimized for the cloud, and receives long-term AWS support. We built Amazon Linux 2023 on these principles, and we go even further. Deploying your workloads on Amazon Linux 2023 gives you three major benefits: a high-security standard, a predictable lifecycle, and a consistent update experience.

Let’s look at security first. Amazon Linux 2023 includes preconfigured security policies that make it easy for you to implement common industry guidelines. You can configure these policies at launch time or run time.

For example, you can configure the system crypto policy to enforce system-wide usage of a specific set of cipher suites, TLS versions, or acceptable parameters in certificates and key exchanges. Also, the Linux kernel has many hardening features enabled by default.

Amazon Linux 2023 makes it easier to plan and manage the operating system lifecycle. New Amazon Linux major versions will be available every two years. Major releases include new features and improvements in security and performance across the stack. The improvements might include major changes to the kernel, toolchain, GLib C, OpenSSL, and any other system libraries and utilities.

During those two years, a major release will receive an update every three months. These updates include security updates, bug fixes, and new features and packages. Each minor version is a cumulative list of updates that includes security and bug fixes in addition to new features and packages. These releases might include the latest language runtimes such as Python or Java. They might also include other popular software packages such as Ansible and Docker. In addition to these quarterly updates, security updates will be provided as soon as they are available.

Each major version, including 2023, will come with five years of long-term support. After the initial two-year period, each major version enters a three-year maintenance period. During the maintenance period, it will continue to receive security bug fixes and patches as soon as they are available. This support commitment gives you the stability you need to manage long project lifecycles.

The following diagram illustrates the lifecycle of Amazon Linux distributions:

Last—and this policy is by far my favorite—Amazon Linux provides you with deterministic updates through versioned repositories, a flexible and consistent update mechanism. The distribution locks to a specific version of the Amazon Linux package repository, giving you control over how and when you absorb updates. By default, and in contrast with Amazon Linux 2, a dnf update command will not update your installed packages (dnf is the successor to yum). This helps to ensure that you are using the same package versions across your fleet. All Amazon Elastic Compute Cloud (Amazon EC2) instances launched from an Amazon Machine Image (AMI) will have the same version of packages. Deterministic updates also promote usage of immutable infrastructure, where no infrastructure is updated after deployment. When an update is required, you update your infrastructure as code scripts and redeploy a new infrastructure. Of course, if you really want to update your distribution in place, you can point dnf to an updated package repository and update your machine as you do today. But did I tell you this is not a good practice for production workloads? I’ll share more technical details later in this blog post.

How to Get Started
Getting started with Amazon Linux 2023 is no different than with other Linux distributions. You can use the EC2 run-instances API, the AWS Command Line Interface (AWS CLI), or the AWS Management Console, and one of the four Amazon Linux 2023 AMIs that we provide. We support two machine architectures (x86_64 and Arm) and two sizes (standard and minimal). Minimal AMIs contain the most basic tools and utilities to start the OS. The standard version comes with the most commonly used applications and tools installed.

To retrieve the latest AMI ID for a specific Region, you can use AWS Systems Manager get-parameter API and query the /aws/service/ami-amazon-linux-latest/<alias> parameter.

Be sure to replace <alias> with one of the four aliases available:

For arm64 architecture (standard AMI): al2023-ami-kernel-default-arm64
For arm64 architecture (minimal AMI): al2023-ami-minimal-kernel-default-arm64
For x86_64 architecture (standard AMI): al2023-ami-kernel-default-x86_64
For x86_64 architecture (minimal AMI): al2023-ami-minimal-kernel-default-x86_64

For example, to search for the latest Arm64 full distribution AMI ID, I open a terminal and enter:

~ aws ssm get-parameters --region us-east-2 --names /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64
{
    "Parameters": [
        {
            "Name": "/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64",
            "Type": "String",
            "Value": "ami-02f9b41a7af31dded",
            "Version": 1,
            "LastModifiedDate": "2023-02-24T22:54:56.940000+01:00",
            "ARN": "arn:aws:ssm:us-east-2::parameter/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64",
            "DataType": "text"
        }
    ],
    "InvalidParameters": []
}

To launch an instance, I use the run-instances API. Notice how I use Systems Manager resolution to dynamically lookup the AMI ID from the CLI.

➜ aws ec2 run-instances                                                                            \
       --image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64  \
       --key-name my_ssh_key_name                                                                   \
       --instance-type c6g.medium                                                                   \
       --region us-east-2 
{
    "Groups": [],
    "Instances": [
        {
          "AmiLaunchIndex": 0,
          "ImageId": "ami-02f9b41a7af31dded",
          "InstanceId": "i-0740fe8e23f903bd2",
          "InstanceType": "c6g.medium",
          "KeyName": "my_ssh_key_name",
          "LaunchTime": "2023-02-28T14:12:34+00:00",

...(redacted for brevity)
}

When the instance is launched, and if the associated security group allows SSH (TCP 22) connections, I can connect to the machine:

~ ssh [email protected]
Warning: Permanently added '3.145.19.213' (ED25519) to the list of known hosts.
   ,     #_
   ~\_  ####_        Amazon Linux 2023
  ~~  \_#####\       Preview
  ~~     \###|
  ~~       \#/ ___   https://aws.amazon.com/linux/amazon-linux-2023
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'
Last login: Tue Feb 28 14:14:44 2023 from 81.49.148.9
[ec2-user@ip-172-31-9-76 ~]$ uname -a
Linux ip-172-31-9-76.us-east-2.compute.internal 6.1.12-19.43.amzn2023.aarch64 #1 SMP Thu Feb 23 23:37:18 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

We also distribute Amazon Linux 2023 as Docker images. The Amazon Linux 2023 container image is built from the same software components that are included in the Amazon Linux 2023 AMI. The container image is available for use in any environment as a base image for Docker workloads. If you’re using Amazon Linux for applications in EC2, you can containerize your applications with the Amazon Linux container image.

These images are available from Amazon Elastic Container Registry (Amazon ECR) and from Docker Hub. Here is a quick demo to start a Docker container using Amazon Linux 2023 from Elastic Container Registry.

$ aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
Login Succeeded

~ docker run --rm -it public.ecr.aws/amazonlinux/amazonlinux:2023 /bin/bash
Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
b4265814d5cf: Pull complete 
Digest: sha256:bbd7a578cff9d2aeaaedf75eb66d99176311b8e3930c0430a22e0a2d6c47d823
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
bash-5.2# uname -a 
Linux 9d5b45e9f895 5.15.49-linuxkit #1 SMP PREEMPT Tue Sep 13 07:51:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
bash-5.2# exit

When pulling from Docker Hub, you can use this command to pull the image: docker pull amazonlinux:2023.

What Are the Main Differences Compared to Amazon Linux 2?
Amazon Linux 2023 has some differences compared to Amazon Linux 2. The documentation explains these differences in detail. The two differences I would like to focus on are dnf and the package management policies.

AL2023 comes with Fedora’s dnf, the successor to yum. But don’t worry, dnf provides similar commands as yum to search, install, or remove packages. Where you used to run the commands yum list or yum install httpd, you may now run dnf list or dnf install httpd. For convenience, we create a symlink for /usr/bin/yum, so you can run your scripts unmodified.

$ which yum
/usr/bin/yum
$ ls -al /usr/bin/yum
lrwxrwxrwx. 1 root root 5 Jun 19 18:06 /usr/bin/yum -> dnf-3

The biggest difference, in my opinion, is the deterministic updates through versioned repositories. By default, the software repository is locked to the AMI version. This means that a dnf update command will not return any new packages to install. Versioned repositories give you the assurance that all machines started from the same AMI ID are identical. Your infrastructure will not deviate from the baseline.

$ sudo dnf update 
Last metadata expiration check: 0:14:10 ago on Tue Feb 28 14:12:50 2023.
Dependencies resolved.
Nothing to do.
Complete!

Yes, but what if you want to update a machine? You have two options to update an existing machine. The cleanest one for your production environment is to create duplicate infrastructure based on new AMIs. As I mentioned earlier, we publish updates for every security fix and a consolidated update every three months for two years after the initial release. Each update is provided as a set of AMIs and their corresponding software repository.

For smaller infrastructure, such as test or development machines, you might choose to update the operating system or individual packages in place as well. This is a three-step process:

first, list the available updated software repositories;
second, point dnf to a specific software repository;
and third, update your packages.

To show you how it works, I purposely launched an EC2 instance with an “old” version of Amazon Linux 2023 from February 2023. I first run dnf check-release-update to list the available updated software repositories.

$ dnf check-release-update
WARNING:
  A newer release of "Amazon Linux" is available.

  Available Versions:

  Version 2023.0.20230308:
    Run the following command to upgrade to 2023.0.20230308:

      dnf upgrade --releasever=2023.0.20230308

    Release notes:
     https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes.html

Then, I might either update the full distribution using dnf upgrade --releasever=2023.0.20230308 or point dnf to the updated repository to select individual packages.

$ dnf check-update --releasever=2023.0.20230308

Amazon Linux 2023 repository                                                    28 MB/s |  11 MB     00:00
Amazon Linux 2023 Kernel Livepatch repository                                  1.2 kB/s | 243  B     00:00

amazon-linux-repo-s3.noarch                          2023.0.20230308-0.amzn2023                amazonlinux
binutils.aarch64                                     2.39-6.amzn2023.0.5                       amazonlinux
ca-certificates.noarch                               2023.2.60-1.0.amzn2023.0.1                amazonlinux
(redacted for brevity)
util-linux-core.aarch64 2.37.4-1.amzn2022.0.1 amazonlinux

Finally, I might run a dnf update <package_name> command to update a specific package.

This might look like overkill for a simple machine, but when managing enterprise infrastructure or large-scale fleets of instances, this facilitates the management of your fleet by ensuring that all instances run the same version of software packages. It also means that the AMI ID is now something that you can fully run through your CI/CD pipelines for deployment and that you have a way to roll AMI versions forward and backward according to your schedule.

Where is Fedora?
When looking for a base to serve as a starting point for Amazon Linux 2023, Fedora was the best choice. We found that Fedora’s core tenets (Freedom, Friends, Features, First) resonate well with our vision for Amazon Linux. However, Amazon Linux focuses on a long-term, stable OS for the cloud, which is a notable different release cycle and lifecycle than Fedora. Amazon Linux 2023 provides updated versions of open-source software, a larger variety of packages, and frequent releases.

Amazon Linux 2023 isn’t directly comparable to any specific Fedora release. The Amazon Linux 2023 GA version includes components from Fedora 34, 35, and 36. Some of the components are the same as the components in Fedora, and some are modified. Other components more closely resemble the components in CentOS Stream 9 or were developed independently. The Amazon Linux kernel, on its side, is sourced from the long-term support options that are on kernel.org, chosen independently from the kernel provided by Fedora.

Like every good citizen in the open-source community, we give back and contribute our changes to upstream distributions and sources for the benefit of the entire community. Amazon Linux 2023 itself is open source. The source code for all RPM packages that are used to build the binaries that we ship are available through the SRPM yum repository (sudo dnf install -y 'dnf-command(download)' && dnf download --source bash)

One More Thing: Amazon EBS Gp3 Volumes
Amazon Linux 2023 AMIs use gp3 volumes by default.

Gp3 is the latest generation general-purpose solid-state drive (SSD) volume for Amazon Elastic Block Store (Amazon EBS). Gp3 provides 20 percent lower storage costs compared to gp2. Gp3 volumes deliver a baseline performance of 3,000 IOPS and 125MB/s at any volume size. What I particularly like about gp3 volumes is that I can now provision performance independently of capacity. When using gp3 volumes, I can now increase IOPS and throughput without incurring charges for extra capacity that I don’t actually need.

With the availability of gp3-backed AL2023 AMIs, this is the first time a gp3-backed Amazon Linux AMI is available. Gp3-backed AMIs have been a common customer request since gp3 was launched in 2020. It is now available by default.

Price and Availability
Amazon Linux 2023 is provided at no additional charge. Standard Amazon EC2 and AWS charges apply for running EC2 instances and other services. This distribution includes full support for five years. When deploying on AWS, our support engineers will provide technical support according to the terms and conditions of your AWS Support plan. AMIs are available in all AWS Regions.

Amazon Linux is the most used Linux distribution on AWS, with hundreds of thousands of customers using Amazon Linux 2. Dozens of Independent Software Vendors (ISVs) and hardware partners are supporting Amazon Linux 2023 today. You can adopt this new version with the confidence that the partner tools you rely on are likely to be supported. We are excited about this release, which brings you an even higher level of security, a predictable release lifecycle, and a consistent update experience.

Now go build and deploy your workload on Amazon Linux 2023 today.

— seb

Developing portable AWS Lambda functions

2023-02-23 Pascal Vogel

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/developing-portable-aws-lambda-functions/

This blog post is written by Uri Segev, Principal Serverless Specialist Solutions Architect

When developing new applications or modernizing existing ones, you might face a dilemma: which compute technology to use? A serverless compute service such as AWS Lambda or maybe containers? Often, serverless can be the better approach thanks to automatic scaling, built-in high availability, and a pay-for-use billing model. However, you may hesitate to choose serverless for reasons such as:

Perceived higher cost or difficulty in estimating cost
It is a paradigm shift, which requires learning to bridge the knowledge gap
Misconceptions about Lambda capabilities and use cases
Concern that using Lambda will result in lock-in
Existing investments in non-serverless platforms and tooling

This blog post suggests best practices for developing portable Lambda functions that allow you to easily port your code to containers if you later choose to. By doing so, you can avoid lock-in and try out the serverless approach in a risk-free way.

Each section of this blog post describes what you need to consider when writing portable code and the steps needed to migrate this code from Lambda to containers, if you later choose to do so.

Best practices for portable Lambda functions

Separate business logic and Lambda handler

Lambda functions are event-driven in nature. When a specific event happens, it invokes the Lambda function by calling its handler method. The handler method receives an event object which contains information regarding the reason for the function invocation. Once the function execution completes, it returns from the handler method. Whatever is returned from the handler is the function’s return value.

To write portable code, we recommend using the handler method only as an interface between the Lambda runtime (event object) and the business logic. Using Hexagonal architecture terminology, the handler should be a driving adapter making calls into the port, which is the interface exposed by the business logic The handler should extract all required information from the event object and then call a separate method that implements the business logic.

When that method returns, the handler constructs the result in the format expected by the function invoker and returns it. We also recommend splitting the handler code and the business logic code into separate files. Should you choose to migrate to containers later, you simply migrate your business logic code files with no additional changes.

The following pseudocode shows a Lambda handler that extracts information from the event object and calls the business logic. Once the business logic is done, the handler places the response in the function’s return value:

import business_logic

# The Lambda handler extracts needed information from the event
# object and invokes the business logic
handler(event, context) {
  # Extract needed information from event object payload = event[‘payload’]

  # Invoke business logic
  result = do_some_logic(payload)
  
  # Construct result for API Gateway
  return {
    statusCode: 200,
	body: result
  }
}

The following pseudocode shows the business logic. It’s located in a separate file and is unaware that it is being invoked from a Lambda function. It is pure logic.

# This is the business logic. It knows nothing about who invokes it.
do_some_logic(data) {
result = "This is my result."
  return result
}

This approach also makes it easier to run unit tests on the business logic without the need to construct event objects and to invoke the Lambda handler.

If you migrate to containers later, you include the business logic files in your container with new interface code as described in the following section.

Event source integration

One benefit of Lambda functions is the event source integration. For instance, if you integrate Lambda with Amazon Simple Queue Service (Amazon SQS), the Lambda service will take care of polling the queue, invoking the Lambda function and deleting the messages from the queue when done. By using this integration, you need to write less boilerplate code. You can focus only on implementing business logic and not the integration with the event source.

The following pseudocode shows how the Lambda handler looks like for an SQS event source:

import business_logic

handler(event, context) {
  entries = []
  # Iterate over all the messages in the event object
  for message in event[‘Records’] {
    # Call the business logic to process a single message
    success = handle_message(message)

    # Start building the response
    if Not success {
      entries.append({
      'itemIdentifier': message['messageId']
      })
    }
  }

  # Notify Lambda about failed items.
  if (let(entries) > 0) {
    return {
      'batchItemFailures': entries
    }
  }
}

As you can see in the previous code, the Lambda function has almost no knowledge that it is being invoked from SQS. There are no SQS API calls. It only knows the structure of the event object, which is specific to SQS.

When moving to a container, the integration responsibility moves from the Lambda service to you, the developer. There are different event sources in AWS, and each of them will require a different approach for consuming events and invoking business logic. For example, if the event source is Amazon API Gateway, your application will need to create an HTTP server that listens on an HTTP port and waits for incoming requests in order to invoke the business logic.

If the event source is Amazon Kinesis Data Streams, your application will need to run a poller that reads records from the shards, keep track of processed records, handle the case of a change in the number of shards in the stream, retry on errors, and more. Regardless of the event source, if you follow the previous recommendations, you will not need to change anything in the business logic code.

The following pseudocode shows how the integration with SQS will look like in a container. Note that you will lose some features such as batching, filtering, and, of course, automatic scaling.

import aws_sdk
import business_logic

QUEUE_URL = os.environ['QUEUE_URL']
BATCH_SIZE = os.environ.get('BATCH_SIZE', 1)
sqs_client = aws_sdk.client('sqs')

main() {
  # Infinite loop to poll for messages from SQS
  while True {

    # Receive a batch of messages from the queue
    response = sqs_client.receive_message(
      QueueUrl = QUEUE_URL,
      MaxNumberOfMessages = BATCH_SIZE,
      WaitTimeSeconds = 20 )

    # Loop over the messages in the batch
    entries = []
    i = 1
    for message in response.get('Messages',[]) {
      # Process a single message
      success = handle_message(message)

      # Append the message handle to an array that is later
      # used to delete processed messages
      if success {
        entries.append(
          {
            'Id': f'index{i}',
            'ReceiptHandle': message['receiptHandle']
          }
        )
        i += 1
      }
    }

    # Delete all the processed messages
    if (len(entries) > 0) {
      sqs_client.delete_message_batch(
        QueueUrl = QUEUE_URL,
        Entries = entries
      )
    }
  }
}

Another point to consider here is Lambda destinations. If your function is invoked asynchronously and you configured a destination for your function, you will need to include that in the interface code. It will need to catch any business logic error and, based on that, invoke the right destination.

Package functions as containers

Lambda supports packaging functions as .zip files and container images. To develop portable code, we recommend using container images as your default packaging method. Even though you package the function as a container image, you can’t run it on other container platforms such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (EKS). However, by packaging it this way, the migration to containers later will be easier as you are already using the same tools and you already created a Dockerfile that will require minimal changes.

An example Dockerfile for Lambda looks like this:

FROM public.ecr.aws/lambda/python:3.9
COPY *.py requirements.txt ./
RUN python3.9 -m pip install -r requirements.txt -t .
CMD ["app.lambda_handler"]

If you move to containers later, you will need to change the Dockerfile to use a different base image and adapt the CMD line that defines how to start the application. This is in addition to the code changes described in the previous section.

The corresponding Dockerfile for the container will look like this:

FROM python:3.9
COPY *.py requirements.txt ./
RUN python3.9 -m pip install -r requirements.txt -t .
CMD ["python", "./app.py"]

The deployment pipeline also needs to change as we deploy to a different target. However, building the artifacts remains the same.

Single invocation per instance

Lambda functions run in their own isolated runtime environment. Each environment handles a single request at a time which works great for Lambda. However, if you migrate your application to containers, you will likely invoke the business logic from multiple threads in a single process at the same time.

This section discusses aspects of moving from a single invocation to multiple concurrent invocations within the same process.

Static variables

Static variables are those that are instantiated once and then reused across multiple invocations. Examples of such variables are database connections or configuration information.

For function optimization, and specifically for reducing cold starts and the duration of warm function invocations, we recommend initializing all static variables outside the function handler and storing them in global variables so that further invocations will reuse them.

We recommend using an initialization function that you write as part of the business logic module and that you invoke from outside the handler. This function saves information in global variables that the business logic code reuses across invocations.

The following pseudocode shows the Lambda function:

import business_logic

# Call the initialization code
initialize()

handler(event, context) {
  ...
  # Call the business logic
  ...
}

And the business logic code will look like this:

# Global variables used to store static data
var config

initialize() {
  config = read_Config()
}

do_some_logic(data) {
  # Do something with config object
  ...
}

The same also applies to containers. You will usually initialize static variables when the process starts and not for every single request. When moving to containers, all you need to do is call the initialization function before starting the main application loop.

import business_logic

# Call the initialization code
initialize()

main() {
  while True {
    ...
    # Call the business logic
    ...
  }
}

As you can see, there are no changes in the business logic code.

Database connections

As Lambda functions share nothing between the runtime environments, unlike containers they can’t rely on connection pools when connecting to a relational database. For this reason, we created Amazon RDS Proxy, which acts as a centralized connection pool used by many functions.

To write portable Lambda functions, we recommend using a connection pool object with a single connection. Your business logic code will always ask for a connection from the pool when making a database request. You will still need to use RDS Proxy.

If you later move to containers, you can increase the number of connections in the pool to a larger number with no further changes and the application will scale without overwhelming the database.

File system

Lambda functions come with a writable /tmp folder in the size of 512 MB to 10 GB. As each function instance runs in an isolated runtime environment, developers usually use fixed file names for files stored in that folder. If you run the same business logic code in a container in multiple threads, the different threads will overwrite the files created by others.

We recommended using unique file names in each invocation. Append a UUID or another random number to the file name. Delete the files once you are done with them to avoid running out of space.

If you move your code to containers later, there is nothing to do.

Portable web applications

If you develop a web application, there is another way to achieve portability. You can use the AWS Lambda Web Adapter project to host a web app inside a Lambda function. This way you can develop a web application with familiar frameworks (e.g., Express.js, Next.js, Flask, Spring Boot, Laravel, or anything that uses HTTP 1.1/1.0), and run it on Lambda. If you package your web application as a container, the same Docker image can run on Lambda (using the web adapter) and containers.

Porting from containers to Lambda

This blog post demonstrates how to develop portable Lambda functions you can easily port to containers. Taking these recommendations into consideration can also help develop portable code in general, which allows you to port containers to Lambda functions.

Some things to consider:

Separate the business logic from the interface code in the container. The interface code should interact with the event sources and invoke the business logic.
As Lambda functions only have a /tmp writable folder, replicate this in your containers (even though you could write to different locations).

Conclusion

This blog post suggests best practices for developing Lambda functions that allow you to gain the benefits of a serverless approach without risking lock-in.

By following these best practices for separating business logic from Lambda handlers, packaging functions as containers, handling Lambda’s single invocation per instance, and more, you can develop portable Lambda functions. As a consequence, you will be able to port your code from Lambda to containers with minimal effort if you choose to move to containers later.

Refer to these best practices and code samples to ease the adoption of a serverless approach when developing your next application.

For more serverless learning resources, visit Serverless Land.

How to run AWS CloudHSM workloads in container environments

2023-01-25 Derek Tumulak

Post Syndicated from Derek Tumulak original https://aws.amazon.com/blogs/security/how-to-run-aws-cloudhsm-workloads-on-docker-containers/

January 25, 2023: We updated this post to reflect the fact that CloudHSM SDK3 does not support serverless environments and we strongly recommend deploying SDK5.

AWS CloudHSM provides hardware security modules (HSMs) in the AWS Cloud. With CloudHSM, you can generate and use your own encryption keys in the AWS Cloud, and manage your keys by using FIPS 140-2 Level 3 validated HSMs. Your HSMs are part of a CloudHSM cluster. CloudHSM automatically manages synchronization, high availability, and failover within a cluster.

CloudHSM is part of the AWS Cryptography suite of services, which also includes AWS Key Management Service (AWS KMS), AWS Secrets Manager, and AWS Private Certificate Authority (AWS Private CA). AWS KMS, Secrets Manager, and AWS Private CA are fully managed services that are convenient to use and integrate. You’ll generally use CloudHSM only if your workload requires single-tenant HSMs under your own control, or if you need cryptographic algorithms or interfaces that aren’t available in the fully managed alternatives.

CloudHSM offers several options for you to connect your application to your HSMs, including PKCS#11, Java Cryptography Extensions (JCE), OpenSSL Dynamic Engine, or Microsoft Cryptography API: Next Generation (CNG). Regardless of which library you choose, you’ll use the CloudHSM client to connect to HSMs in your cluster.

In this blog post, I’ll show you how to use Docker to develop, deploy, and run applications by using the CloudHSM SDK, and how to manage and orchestrate workloads by using tools and services like Amazon Elastic Container Service (Amazon ECS), Kubernetes, Amazon Elastic Kubernetes Service (Amazon EKS), and Jenkins.

Solution overview

This solution demonstrates how to create a Docker container that uses the CloudHSM JCE SDK to generate a key and use it to encrypt and decrypt data.

Note: In this example, you must manually enter the crypto user (CU) credentials as environment variables when you run the container. For production workloads, you’ll need to consider how to secure and automate the handling and distribution of these credentials. You should work with your security or compliance officer to ensure that you’re using an appropriate method of securing HSM login credentials. For more information on securing credentials, see AWS Secrets Manager.

Figure 1 shows the solution architecture. The Java application, running in a Docker container, integrates with JCE and communicates with CloudHSM instances in a CloudHSM cluster through HSM elastic network interfaces (ENIs). The Docker container runs in an EC2 instance, and access to the HSM ENIs is controlled with a security group.

Figure 1: Architecture diagram

Prerequisites

To implement this solution, you need to have working knowledge of the following items:

CloudHSM
Docker 20.10.17 – used at the time of this post
Java 8 or Java 11 – supported at the time of this post
Maven 3.05 – used at the time of this post

Here’s what you’ll need to follow along with my example:

An active CloudHSM cluster with at least one active HSM instance. You can follow the CloudHSM getting started guide to create, initialize, and activate a CloudHSM cluster.

Note: For a production cluster, you should have at least two active HSM instances spread across Availability Zones in the Region.
An Amazon Linux 2 EC2 instance in the same virtual private cloud (VPC) in which you created your CloudHSM cluster. The Amazon Elastic Compute Cloud (Amazon EC2) instance must have the CloudHSM cluster security group attached—this security group is automatically created during the cluster initialization and is used to control network access to the HSMs. To learn about attaching security groups to allow EC2 instances to connect to your HSMs, see Create a cluster in the AWS CloudHSM User Guide.
A CloudHSM crypto user (CU) account. You can create a CU by following the steps in the topic Managing HSM users in AWS CloudHSM in the AWS CloudHSM User Guide.

Solution details

In this section, I’ll walk you through how to download, configure, compile, and run a solution in Docker.

To set up Docker and run the application that encrypts and decrypts data with a key in AWS CloudHSM

On your Amazon Linux EC2 instance, install Docker by running the following command.
# sudo yum -y install docker
Start the docker service.
# sudo service docker start
Create a new directory and move to it. In my example, I use a directory named cloudhsm_container. You’ll use the new directory to configure the Docker image.
# mkdir cloudhsm_container # cd cloudhsm_container
Copy the CloudHSM cluster’s trust anchor certificate (customerCA.crt) to the directory that you just created. You can find the trust anchor certificate on a working CloudHSM client instance under the path /opt/cloudhsm/etc/customerCA.crt. The certificate is created during initialization of the CloudHSM cluster and is required to connect to the CloudHSM cluster. This enables our application to validate that the certificate presented by the CloudHSM cluster was signed by our trust anchor certificate.
In your new directory (cloudhsm_container), create a new file with the name run_sample.sh that includes the following contents. The script runs the Java class that is used to generate an Advanced Encryption Standard (AES) key to encrypt and decrypt your data.
```
#! /bin/bash

# start application
echo -e "\n* Entering AES GCM encrypt/decrypt sample in Docker ... \n"

java -ea -jar target/assembly/aesgcm-runner.jar -method environment

echo -e "\n* Exiting AES GCM encrypt/decrypt sample in Docker ... \n"
```
In the new directory, create another new file and name it Dockerfile (with no extension). This file will specify that the Docker image is built with the following components:
- The CloudHSM client package.
- The CloudHSM Java JCE package.
- OpenJDK 1.8 (Java 8). This is needed to compile and run the Java classes and JAR files.
- Maven, a build automation tool that is needed to assist with building the Java classes and JAR files.
- The AWS CloudHSM Java JCE samples that will be downloaded and built as part of the solution.

Cut and paste the following contents into Dockerfile.

Note: You will need to customize your Dockerfile, as follows:

Make sure to specify the SDK version to replace the one specified in the pom.xml file in the sample code. As of the writing of this post, the most current version is 5.7.0. To find the SDK version, follow the steps in the topic Check your client SDK version. For more information, see the Building section in the README file for the Cloud HSM JCE examples.

Make sure to update the HSM_IP line with the IP of an HSM in your CloudHSM cluster. You can get your HSM IPs from the CloudHSM console, or by running the describe-clusters AWS CLI command.

	# Use the amazon linux image
	FROM amazonlinux:2

	# Pass HSM IP address as a build argument
	ARG HSM_IP

	# Install CloudHSM client
	RUN yum install -y https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-jce-latest.el7.x86_64.rpm

	# Install Java, Maven, wget, unzip and ncurses-compat-libs
	RUN yum install -y java maven wget unzip ncurses-compat-libs
        
	# Create a work dir
	WORKDIR /app
        
	# Download sample code
	RUN wget https://github.com/aws-samples/aws-cloudhsm-jce-examples/archive/refs/heads/sdk5.zip
        
	# unzip sample code
	RUN unzip sdk5.zip
       
	# Change to the create directory
	WORKDIR aws-cloudhsm-jce-examples-sdk5

# Build JAR files using the installed CloudHSM JCE Provider version
RUN export CLOUDHSM_CLIENT_VERSION=`rpm -qi cloudhsm-jce | awk -F': ' '/Version/ {print $2}'` \
        && mvn validate -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION \
        && mvn clean package -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION
        
  # Configure cloudhsm-client
  COPY customerCA.crt /opt/cloudhsm/etc/
  RUN /opt/cloudhsm/bin/configure-jce -a $HSM_IP
       
  # Copy the run_sample.sh script
  COPY run_sample.sh .
        
  # Run the script
  CMD ["bash","run_sample.sh"]

Now you’re ready to build the Docker image. Run the following command, with the name jce_sample. This command will let you use the Dockerfile that you created in step 6 to create the image.
# sudo docker build --build-arg HSM_IP=”<your HSM IP address>” -t jce_sample .
To run a Docker container from the Docker image that you just created, run the following command. Make sure to replace the user and password with your actual CU username and password. (If you need help setting up your CU credentials, see prerequisite 3. For more information on how to provide CU credentials to the AWS CloudHSM Java JCE Library, see Providing credentials to the JCE provider in the CloudHSM User Guide).
# sudo docker run --env HSM_USER=<user> --env HSM_PASSWORD=<password> jce_sample

If successful, the output should look like this:
```
	* Entering AES GCM encrypt/decrypt sample in Docker ... 

	737F92D1B7346267D329C16E
	Successful decryption

	* Exiting AES GCM encrypt/decrypt sample in Docker ...
```

Conclusion

This solution provides an example of how to run CloudHSM client workloads in Docker containers. You can use the solution as a reference to implement your cryptographic application in a way that benefits from the high availability and load balancing built in to CloudHSM without compromising the flexibility that Docker provides for developing, deploying, and running applications.

If you have comments about this post, submit them in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Monitoring Kubernetes with Zabbix

2023-01-24 Michaela DeForest

Post Syndicated from Michaela DeForest original https://blog.zabbix.com/monitoring-kubernetes-with-zabbix/25055/

There are many options available for monitoring Kubernetes and cloud-native applications. In this multi-part blog series, we’ll explore how to use Zabbix to monitor a Kubernetes cluster and understand the metrics generated within Zabbix. We’ll also learn how to exploit Prometheus endpoints exposed by applications to monitor application-specific metrics.

Want to see Kubernetes monitoring in action? Watch the step-by-step Zabbix Kubernetes monitoring configuration and deployment guide.

Why Choose Zabbix to Monitor Kubernetes?

Before choosing Zabbix as a Kubernetes monitoring tool, we asked ourselves, “why would we choose to use Zabbix rather than Prometheus, Grafana, and alertmanager?” After all, they have become the standard monitoring tools in the cloud ecosystem. We decided that our minimum criteria for Zabbix would be that it was just as effective as Prometheus for monitoring both Kubernetes and cloud-native applications.

Through our discovery process, we concluded that Zabbix meets (and exceeds) this minimum requirement. Zabbix provides similar metrics and triggers as Prometheus, alert manager, and Grafana for Kubernetes, as they both use the same backend tools to do this. However, Zabbix can do this in one product while still maintaining flexibility and allowing you to monitor pretty much anything you can write code to collect. Regarding application monitoring, Zabbix can transform Prometheus metrics fed to it by Prometheus exporters and endpoints. In addition, because Zabbix can make calls to any HTTP endpoint, it can monitor applications that do not have a dedicated Prometheus endpoint, unlike Prometheus.

The Zabbix Helm Chart

Zabbix monitors Kubernetes by collecting metrics exposed via the Kubernetes API and kube-state-metrics. The components necessary to monitor a cluster are installed within the cluster using this helm chart provided by Zabbix. The helm chart includes the Zabbix agent installed as a daemon set and is used to monitor local resources and applications on each node. A Zabbix proxy is also installed to collect monitoring data and transfer it to the external Zabbix server.

Only the Zabbix proxy needs access to the Zabbix server, while the agents can send data to the proxy installed in the same namespace as each agent. A cluster role allows Zabbix to access resources in the cluster via the Kubernetes API. While the cluster role could be modified to restrict privileges given to Zabbix, this will result in some items becoming unsupported. We recommend keeping this the same if you want to get the most out of Kubernetes monitoring with Zabbix.

The Zabbix helm chart installs the kube-state-metrics project as a dependency. You may already be familiar with this project under the Kubernetes organization, which generates Prometheus format metrics based on the current state of the Kubernetes resources. In addition, if you have experience using Prometheus to monitor a cluster, you may already have this installed. If that is the case, you can point to this deployment rather than installing another one.

In this tutorial, we will install kube-state-metrics via the Zabbix helm chart.

For more information on skipping this step, refer to the values file in the Zabbix Kubernetes helm chart.

Installing the Zabbix Helm Chart

Now that we’ve explained how the Zabbix helm chart works, let’s go ahead and install it. In this example, we will assume that you have a running Zabbix 6.0 (or higher) instance that is reachable from the cluster you wish to monitor. I am running a 6.0 instance in a different cluster than the one we want to monitor. The server is reachable via the DNS name mdeforest.zabbix.atsgroup.io with a non-standard port of 31103.

We will start by installing the latest Zabbix helm chart. I recommend visiting zabbix.com/integrations/kubernetes to get any sources that may be referred to in this tutorial. There you will find a link to the Zabbix helm chart and templates. For the most part, we will follow the steps outlined in the readme.

Using a terminal window, I am going to make sure the active cluster is set to the cluster that I want to monitor:

kubectl config use-context <cluster context name>

I’m then going to add the Zabbix chart repo to my local helm repository:

helm repo add zabbix-chart-6.0 https://cdn.zabbix.com/zabbix/integrations/kubernetes-helm/6.0/

If you’re running Zabbix 6.2 or newer, change the references to 6.0 in this command to 6.2.

Depending on your circumstances, you will need to set a few values for the installation. In most cases, you only need to set a few environment variables for the Zabbix agent and the proxy. The complete list of values and environment variables is available in the helm chart repo, alongside the agent and proxy images on Docker Hub.

In this case, I’m setting the passive server environment variable for the agent to allow any IP to connect. For the proxy, I am setting the server host accessible from the proxy alongside the non-standard port. I’ve also set here some variables related to cache size. These variables may depend on your cluster size, so you may need to play around with them to find the correct values.

Now that I have the values file ready, I’m ready to install the chart. So, we’ll use the following command. Of course, the chart path might vary depending on what version of the chart you’re using.

helm install -f </path/to/values/file> [-n <namespace>] zabbix zabbix-chart-6.0/zabbix-helm-chart

You can also optionally add a namespace. You must wait until everything is running, so I’ll check just that with the following:

watch kubectl get pods

Now that everything is installed, we’re ready to set up hosts in Zabbix that will be associated with the cluster. The last step before we have all the information we need is to obtain the token created via the service account installed with the helm chart. We’ll get this by running the next command, which is the name of the service account that was created:

kubectl get secret -o jsonpath={.data.token} zabbix service-account | base64 -d

This will get the secret created for the service account and grab just the token from that, which is passed to the base64 utility to decode it. Be sure to copy that value somewhere because you’ll need it for later.

You’ll also need the Kubernetes API endpoint. In most cases, you’ll use the proxy installed rather than the server directly or a proxy outside the cluster. If this is the case, you can use the service DNS for the API. We should be able to reach it by pointing to https://kubernetes.default.svc.cluster.local:443/api.

If this is not the case, you can use the output from the command:

kubectl cluster-info

Now, let’s head over to the Zabbix UI. All the templates we need are shipped in Zabbix 6. If for some reason, you can’t find them, they are available for download and import by visiting the integrations page that I pointed out earlier on the Zabbix site.

Adding the Proxy

We will add our proxy by heading to Administration -> Proxies:

Click Create Proxy. Because this is an active proxy by default, we only need to specify the proxy name. If you didn’t make any changes to the helm chart, this should default to zabbix-proxy. If you’d like to name this differently, you can change the environment variable zbx_hostname for the proxy in the helm chart. We’re going to leave it as the default for now. You’re going to enter this name and then click “Add.” After a few minutes, you’ll start to see that it says that the proxy has been seen.
Create a Host Group to put hosts related to Kubernetes. For this example, let’s create one, which we’ll call Kubernetes.
Head to the host page under configuration and click Create Host. The first host will collect metrics related to monitoring Kubernetes nodes, and we’ll discover nodes and create new hosts using Zabbix low-level discovery.
Give this host the name Kubernetes Nodes. We’ll also assign this host to the Kubernetes host group we created and attach the template Kubernetes nodes by HTTP.
Change the line “Monitored by proxy” to the proxy created earlier, called zabbix-proxy.
Click the Macros tab and select “Inherited and host macros.” You should be able to see all the macros that may be set to influence what is monitored in your cluster. In this case, we need to change the first two macros. The first, {KUBE.API.ENDPOINT.URL}, should be set to the Kubernetes API endpoint. In our case, we can set it to what I mentioned earlier: default.svc.cluster.local:443/api. Next, the token should be set to the previously retrieved value from the command line.
lick Add. After a few minutes, you should start seeing data on the latest data page and new hosts on the host page representing each node.

Creating an Additional Host

Now let’s create another host that will represent the metrics available via the Kubernetes API and the kube-state-metrics endpoint.

Click Create Host again, name this host Kubernetes Cluster State, and add it to the Kubernetes group again.
Let’s also attach the Kubernetes Cluster State template by HTTP. Again, we’re going to choose the proxy that we created earlier.
In the Macro section, change the kube.api.url to the same thing we used before, but this time leave off the /api at the end. Simply: default.svc.cluster.local:443. Be sure to set the token as we did before.
Assuming nothing else was changed in the installation of the helm chart, we can now add that host.

After a few minutes, you should receive metrics related to the cluster state, including hosts representing the kubelet on each node.

What’s Next?

Now you’re all set to start monitoring your Kubernetes cluster in Zabbix! Give it a try, and let us know your thoughts in the comments.

In the next blog post, we’ll look at what you can do with your newly monitored cluster and how to get the most out of it.

If you’d like help with any of this, ATS has advanced monitoring, orchestration, and automation skills to make this process a snap. Set up a 15-minute with our team to go through any questions you have.

About the Author

Michaela DeForest is a Platform Engineer for The ATS Group. She is a Zabbix Certified Specialist on Zabbix 6.0 with additional areas of expertise, including Terraform, Amazon Web Services (AWS), Ansible, and Kubernetes, to name a few. As ATS’s resident authority in DevOps, Michaela is critical in delivering cutting-edge solutions that help businesses improve efficiency, reduce errors, and achieve a faster ROI.

About ATS Group: The ATS Group provides a fully inclusive set of technology services and tools designed to innovate and transform IT. Their systems integration, business resiliency, cloud enablement, infrastructure intelligence, and managed services help businesses of all sizes “get IT done.” With over 20 years in business, ATS has become the trusted advisor to nearly 500 customers across multiple industries. They have built their reputation around honesty, integrity, and technical expertise unrivaled by the competition.

How to investigate and take action on security issues in Amazon EKS clusters with Amazon Detective – Part 2

2022-12-05 Marshall Jones

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/how-to-investigate-and-take-action-on-security-issues-in-amazon-eks-clusters-with-amazon-detective-part-2/

In part 1 of this of this two-part series, How to detect security issues in Amazon EKS cluster using Amazon GuardDuty, we walked through a real-world observed security issue in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster and saw how Amazon GuardDuty detected each phase by following MITRE ATT&CK tactics.

In this blog post, we’ll walk you through investigative techniques to use with Amazon Detective, paired with the GuardDuty EKS and malware findings from the security issue. After we have identified impacted resources through our investigation, we’ll provide example remediation tactics and preventative controls to address and help prevent security issues in EKS clusters.

Amazon Detective can help you investigate security issues and related resources in your account. Detective provides EKS coverage that you can enable within your accounts. When this coverage is enabled, Detective can help investigate and remediate potentially unauthorized EKS activity that results from misconfiguration of the control plane nodes or application. Although GuardDuty is not a prerequisite to enable Detective, it is recommended that you enable GuardDuty to enhance the visualization capabilities in Detective with GuardDuty findings.

Prerequisites

You must have the following services enabled in your AWS account to generate and investigate findings associated with EKS security events in a similar manner as outlined in this blog. If you do not have GuardDuty enabled, you can still investigate with Detective, but in a limited capacity.

Amazon GuardDuty, along with these features of GuardDuty:
- Kubernetes Protection
- Malware Protection
Amazon Detective, along with this feature of Detective:
- EKS audit logs

Investigate with Amazon Detective

In the five phases we walked through in part 1, we discussed GuardDuty findings and MITRE ATT&CK tactics that can help you detect and understand each phase of the unauthorized activity, from the initial misconfiguration to the impact on our application when the EKS cluster is used for crypto mining.

The next recommended step is to investigate the EKS cluster and any associated resources. Amazon Detective can help you to investigate whether there was any other related unauthorized activity in the environment. We will walk through Detective capabilities for visualizing and gathering important information to effectively respond to the security issue. If you’re interested in creating detailed incident response playbooks for your security team to follow in your own environment, refer to these sample AWS incident response playbooks.

Depending on your scenario, there are various resources you can use to start your investigation, such as Security Hub findings, GuardDuty findings, related Kubernetes subjects, or an AWS account’s AWS CloudTrail activity. For our walkthrough, we’ll start our investigation from the GuardDuty finding and use the EKS cluster resource to pivot to the Detective console, as shown in Figure 7. Although we initially focus on the EKS cluster, you could start from any entities that are supported in the Detective behavior graph structure in the Amazon Detective User Guide. For example, we could start directly with the Kubernetes subject system:anonymous and find activity associated with the anonymous user.

Figure 7: Example Detective popup from GuardDuty finding for EKS cluster

We’ll now go over the information that you would need to gather from Detective in order to investigate the example security issue.

To investigate EKS cluster findings with Detective

In the GuardDuty console, navigate to an individual finding and hover over Investigate with Detective. Choose one of the specific resources to start. In the image below, we selected the EKS cluster resource to investigate with Detective. You will need to gather some preliminary information about the IAM roles associated with the EKS cluster.
- Questions: When was the cluster created? What IAM role created the cluster? What IAM role is assigned to the cluster?
- Why it matters: If you are an incident responder, these details can potentially help you identify the owner of the cluster and help you determine what IAM principals are involved.
- What next: Start looking into each IAM principal’s activity, as seen in CloudTrail, to investigate whether the IAM entity itself is potentially compromised or what other resources may have been impacted.
Figure 8: Detective summary page for EKS cluster metadata details
Next, on the EKS cluster overview page, you can see the container details associated with the cluster.
- Question: What are some of the other container details for the cluster? Does anything look out of the ordinary? Is it using a public image? Is it missing a network policy?
- Why it matters: Based on the architecture related to this cluster, you might be able to use this information to determine whether there are unauthorized containers. The contents of unauthorized containers will depend on your organization but typically consist of public images or unauthorized RBAC, pod security policies, or network policy configurations. It’s important to keep in mind that when you look at data in Detective, the scope time is very important. When you pivot from a GuardDuty finding, the scope time will be set to the first time the GuardDuty finding was seen to the last time the finding was seen. The container details reflect the containers that were running during the selected scope time. Changing the scope time might change the containers that are listed in the table shown in Figure 9.
- What next: Information found on this page can help to highlight unauthorized resources or configurations that will need to be remediated. You will also need to look at how these resources were initially created and if there are missing guardrails that should have been created during the provisioning of the cluster.
Figure 9: Detective summary page for EKS container metadata details
Finally, you will see associated security findings with this specific EKS cluster, similar to Figure 10, at the bottom of the EKS cluster overview page in Detective.
- Question: Are there any other security findings associated with this cluster that I previously was not aware of?
- Why it matters: In our example scenario, we walked through the findings that were initially detected and the events that unfolded from those findings. After further investigation, you might see other findings that were not part of the original investigation. This can occur if your security team is only investigating specific findings or severity values. The finding for PrivilegeEscalation:Kubernetes/PrivilegedContainer informs you that a privileged container was launched on your Kubernetes cluster by using an image that has never before been used to launch privileged containers in your cluster. A privileged container has root level access to the host. The other finding, Persistence:Kubernetes/ContainerWithSensitiveMount, informs you that a container was launched with a configuration that included a sensitive host path with write access in the volumeMounts section. This makes the sensitive host path accessible and writable from inside the container. Any finding associated to the suspicious or compromised cluster is valuable because it provides additional insight into what the unauthorized entity was trying to accomplish after the initial detection.
- What next: With Detective, you might want to continue your investigation by selecting each of these findings and reviewing all details related to the finding. Depending on the findings, you could bring in additional team members to help investigate further. For this example, we will move on to the next step.
Figure 10: Example Detective summary of security findings associated with the EKS cluster
Shift from the EKS cluster overview section to the Kubernetes API activity section, similar to Figure 11 below. This will give you the opportunity to dig into the API activity associated with this cluster.
1. Question: What other Kubernetes API activity was attempted from the cluster? Which API calls were successful? Which API calls failed? What was the unauthorized user trying to do?
2. Why it matters: It’s important to determine which actions were successfully invoked by the unauthorized user so that appropriate remediation actions can be taken. You can look at trends of successful and failed API calls, and can even search by Subject, IP address, or Kubernetes API call.
3. What next: You might want to look at all cluster role binding from days before the first GuardDuty finding was seen to determine if there was any other suspicious activity you should be investigating regarding the cluster.
Figure 11: Example Detective summary page for Kubernetes API activity on the EKS cluster
Next, you will want to look at the Newly observed Kubernetes API calls section, similar to Figure 12 below.
- Question: What are some of the more recent Kubernetes API calls? What are they trying to access right now and are they successful? Do I need to start taking action for other resources outside of EKS?
- Why it matters: This data shows Kubernetes subjects who were observed issuing API calls to this cluster for the first time during our scope time. Detective provides you this information by keeping a baseline of the activity associated with supported AWS resources. This can help you more quickly determine whether activity might be suspicious and worth looking into. In our example, we used the search functionality to look at API calls associated with the built-in Kubernetes secrets management. A common way to start your search is to see if an unauthorized user has successfully accessed any secrets, which can help you determine what information you might want to search in the overall API call volume section discussed in step 4.
- What next: If the unauthorized user has successfully accessed any secret, those secrets should be marked as compromised, and they should be rotated immediately.
Figure 12: Example Detective summary for newly observed Kubernetes API calls from the EKS cluster
You can also consider the following question when you look at the Newly observed Kubernetes API calls section.
- Question: Has the IP address associated with the finding been communicating with any other resources in our environment, and if so, what are the details of that communication?
- Why it matters: To answer this question, you can use Detective’s search functionality and the ability to use wild cards to search for IP addresses with the same first three octets. Also note that you can use CIDR notation to search, as well. Based on the results in the example in Figure 13, you can see that there are a number of related IP addresses associated with the environment. With this information, you now can look at the traffic associated with these different IPs and what resources they were communicating with.
Figure 13: Example Detective results page from a query against IP addresses associated with the EKS cluster
You can select one of the IP addresses in the search results to get more information related to it, similar to Figure 14 below.
1. Question: What was the first time an IP address was observed in the environment? When was the last time it was observed?
2. Why it matters: You can use this information to start isolating where unauthorized activity is coming from and what actions are being taken. You can also start creating a time series of unauthorized activity and scope.
3. What next: You can repeat some of the previous investigation steps for each IP address, like looking at the different tabs to review New behavior, Resource interaction, and Kubernetes activity.
Figure 14: Example Detective results page for specific IP address and associated metadata details

In summary, we began our investigation with a GuardDuty finding about an anonymous API request that was successful in using system:anonymous on one of our EKS clusters. We then used Detective to investigate and visualize activity associated with that EKS cluster, such as volume of successful or unsuccessful API requests, where and when those actions were attempted and other security findings associated with the resource. Once we have completed the investigation, we can confirm scope and impact of the security event and start moving towards taking action.

Remediation techniques for Amazon EKS

In this section, we will focus on how to remediate the security issue in our example. Your actions will vary based on your organization and the resources affected. It’s important to note that these actions will impact the EKS cluster and associated workloads, and should accordingly be performed by or coordinated with the cluster operator.

Before you take action on the EKS cluster, you will need to preserve forensic artifacts and evidence for the impacted EKS resources. The order of operations for these actions matters, because you want to get all the data from forensic artifacts in order to determine the overall impact to the resources affected. If you quarantine resources before you capture forensic artifacts, there is a risk that running processes will be interrupted or that the malware attempts to destroy resources that are valuable to a forensics investigation, to cover its tracks.

To preserve forensic evidence

Enable termination protection on the impacted worker node and change the shutdown behavior to Stop.
Label the offending pod or node with a label indicating that it is part of an active investigation.
Cordon the worker node.
Capture both volatile (temporary memory) and non-volatile (Amazon EBS snapshots) artifacts on the worker node.

Now that you have the forensic evidence, you can start to quarantine your EKS resources to restrict unauthorized network communication. The main objective is to prevent the affected EKS pods from communicating with internal resources or exfiltrating data externally.

To quarantine EKS resources

Isolate the pod by creating a network policy that denies ingress and egress traffic to the pod.
Attach a security group to the host and remove inbound and outbound rules. Take this action if you believe the underlying host has been compromised.
Depending on existing inbound and outbound rules on the security group, the connections will either be tracked or untracked. Applying an isolation security group will drop untracked connections. For tracked connections, new connections with the host will not be allowed from the isolation security group, but existing tracked connections will not be interrupted.

Important: This action will affect all containers running on the host.
Attach a deny rule for the EKS resources in a network access control list (network ACL). Because network ACLs are stateless firewalls, all connections will be interrupted, whether they are tracked or untracked connections.

Important: This action will affect all subnets using the network ACL and all resources within those subnets.

At this point, the affected EKS resources are quarantined, but the cluster is still configured to allow anonymous, unauthenticated access. You will need to remove all unauthorized permissions that were created or added.

To remove unauthorized permissions

Update the RBAC configuration to remove system:anonymous access.
Revoke temporary security credentials that are assigned to the pod or worker node, if necessary. You can also remove the IAM role associated with the EKS resources.

Note: Removing IAM policies or attaching IAM policies to restrict permissions will affect the resources that are using the IAM role.
Remove any unauthorized ClusterRoleBinding created by the system:anonymous user.
Redeploy the compromised pod or workload resource.

The actions taken so far primarily target the EKS resource, but based on our Detective investigation, there are other actions you might need to take. Because secrets were involved that could be used outside of the EKS cluster, those secrets will need to be rotated wherever they are referenced. Detective will also suggest additional areas where you can investigate and remediate additional unauthorized activity in your AWS account.

It is important that your team go through game days or run-throughs for investigating and responding to different scenarios in order to make sure the team is prepared. You can run through the EKS security workshop to get your security team more familiar with remediation for EKS.

For more information about responding to EKS cluster related security issues, refer to GuardDuty EKS remediation in the GuardDuty User Guide and the EKS Best Practices Guide.

Preventative controls for EKS

This section covers several preventative controls that you can use to protect EKS clusters.

How can I prevent external access to the EKS cluster?

To help prevent external access to your EKS clusters, limit the exposure of your API server. You can achieve that in two ways:

Set the API server endpoint access to Private. This will effectively forbid anyone outside of the VPC to send Kubernetes API requests to your EKS cluster.
Set an IP address allow list for the EKS cluster public access endpoint.

How can I prevent giving admin access to the EKS cluster?

To help prevent an EKS cluster user from granting any type of access to anonymous or unauthenticated users, you can set up a ValidatingAdmissionWebhook. This is a special type of Kubernetes admission controller that can be configured in the Kubernetes API. (To learn how to build serverless admission webhooks, see the blog post Building serverless admission webhooks for Kubernetes with AWS SAM.)

The ValidatingAdmissionWebhook will deny a Kubernetes API request that matches all of the following checks:

The request is creating or modifying a ClusterRoleBinding or RoleBinding.
The subjects section contains either of the following:
- The user system:anonymous
- The group system:unauthenticated

How can I prevent malicious images from being deployed?

Now that you have set controls to prevent external access to the EKS cluster and prevent granting access to anonymous users, you can focus on preventing the deployment of potentially malicious images.

Malicious container images can have different origins, including:

Images stored in public or unauthorized registries
Images replacing the ones that are stored in authorized registries
Authorized images that contain software with existing or newly discovered vulnerabilities

You can address these sources of malicious images by doing the following:

Use admission controllers to verify that images meet your organization’s requirements, including for the image origin. You can also refer to this this blog post to implement a solution with a webhook and admission controllers.
Enable tag immutability in your registry, a control that prevents an actor from maliciously replacing container images without changing the image’s tags. Additionally, you can enable an AWS Config rule to check tag immutability
Configure another ValidatingAdmissionWebhook that will only accept images if they meet all of the following criteria.
1. Images that come from approved registries.
2. Images that pass the vulnerability scan during deployment time.
3. Images that are signed by a trusted party. Amazon Elastic Container Registry (Amazon ECR) is working on a product enhancement to store image signatures. Currently, you can use an open-source cosign tool to verify and store image signatures.
  
  Note: These criteria can vary based on your use case and internal security and compliance standards.

The above controls will help prevent the deployment of a vulnerable, unauthorized, or potentially malicious container image.

How can I prevent lateral movement inside the cluster?

To prevent lateral movement inside the cluster, it is recommended to use network policies, as follows:

Enforce Kubernetes network policies to enforce ingress and egress controls within the cluster. You can implement these policies by following the steps in the Securing your cluster with network policies EKS workshop.

It’s important to note that you could use security groups for the same purpose, but pod security groups should only be used if the cluster is compromised and when you want to control the traffic between a pod and a resource that resides in the VPC, not inter-pod traffic.

In this section, we’ve reviewed different preventative controls that could have helped mitigate our example security incident. With the first preventative control, we could have prevented external actors from connecting to the API server. The second control could have prevented granting access to anonymous users. The third control could have prevented the deployment of an unauthorized or vulnerable container image. Finally, the fourth control could have helped limit the impact of the deployed vulnerable images to only the pods where the images were deployed, making it harder to laterally move to other pods in the cluster.

Conclusion

In this post, we walked you through how to investigate an EKS cluster related security issue with Amazon Detective. We also provided some recommended remediation and preventative controls to put in place for the EKS cluster specific security issues. When pairing GuardDuty’s ability for continuous threat detection and monitoring with Detective’s organization and visualization capabilities, you enable your security team to conduct faster and more effective investigation. By providing the security team the ability quickly view an organized set of data associated with security events within your AWS account, you reduce the overall Mean Time to Respond (MTTR).

Now that you understand the investigative capabilities with Detective, it’s time to try things out! It is important that you provide a mechanism for your security team to practice detection, investigation, and remediation techniques using security incident response simulations. By periodically running simulations, your security team will be prepared to quickly respond to possible security events. You can find more detailed incident response playbooks that can assist you in preparing for events in your environment, see these sample AWS incident response playbooks.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a thread on Amazon GuardDuty re:Post.

Want more AWS Security news? Follow us on Twitter.

Journey to adopt Cloud-Native DevOps platform Series #1: OfferUp modernized DevOps platform with Amazon EKS and Flagger to accelerate time to market

2022-11-30 Purna Sanyal

Post Syndicated from Purna Sanyal original https://aws.amazon.com/blogs/devops/journey-to-adopt-cloud-native-devops-platform-series-1-offerup-modernized-devops-platform-with-amazon-eks-and-flagger-to-accelerate-time-to-market/

In this two part series, we discuss the challenges faced by OfferUp, a Digital Native customer, to meet business growth and time-to-market. Their journey involved modernizing their existing DevOps platform, from the traditional monolith virtual machine (VM) based architecture to modern containerized architecture and running cloud-native applications for secured progressive delivery to accelerate time to market. This series will provide strategies, architecture patterns, and technical steps you can adopt to become more agile and innovative like OfferUp has.

OfferUp engineers were using the homegrown DevOps platform to build and release new services on the marketplace platform. In this first post, we discuss the key challenges encountered by OfferUp engineers with the existing DevOps platform, as well as how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger, automating production releases with progressive delivery techniques for faster time-to-market with new products and services. Amazon EKS is a managed container service to run and scale Kubernetes applications in the cloud or on-premises.

Previous DevOps architecture

OfferUp is a leading online and mobile customer to customer (C2C) marketplace where users can both buy and sell goods on the platform. Users can browse and purchase products from a broad range of categories, including furniture, clothing, sports equipment, toys, and many more. As a mobile-first company, OfferUp puts a great deal of emphasis on in-person communication between buyers and sellers.

OfferUp built a home grown, self-managed DevOps platform. This platform used a set of manual processes and third-party applications that allows both developers and operations engineers to build and deploy code to a production environment. The DevOps pipeline included topic areas such as source code control, continuous integration/continuous delivery (CI/CD), microservices, as well as development and test Methodologies. The following diagram depicts the previous architecture of OfferUp’s DevOps platform, which was self-managed on Amazon Elastic Compute Cloud (Amazon EC2).

Figure 1: Previous DevOps architecture of OfferUp

OfferUp used GitHub for code repositories. Once the source code was committed in the code repository, Jenkins pulled the source code from code repositories on a scheduled or on-demand basis and built Amazon Machine Images (AMI). The built image was deployed in production by a custom built deployment tool, Vanaheim, which supports one-box canary deployment and full roll-out deployment strategies. The DevOps engineers used to manually create a deployment job in the Vanaheim portal and then manually monitor the test success rate and service metrics to detect any impact from the deployment. Once the success rate was reached, a full production roll out was performed from the Vanaheim portal.

Key challenges with previous DevOps pipeline

In 2020, OfferUp experienced significant transaction volume growth on its Marketplace platform with the increase of its user base. With OfferUp’s acquisition of LetGo in 2020, there was a need to build a scalable DevOps platform to support future integration and organic growth. The previous DevOps platform, designed and deployed over seven years ago, had reached the limits of its scalability, and could no longer keep up with the platform’s growth. The previous architecture was expensive to run and had a complex infrastructure that made it difficult to upgrade and add new features.

The following key factors drove the push for modernization:

Manual verification was required to check if the code was correctly deployed in one of the servers in production, and if the deployment was right in one server, then it was rolled out to other production servers. Full Rollout to production wasn’t automated due to frequent failures requiring manual rollbacks.
The previous platform required a longer deployment time (1–2 hours) due to the authoritative batch process, which sometimes caused delays in releasing and testing of new features.
The self-managed nature of the Jenkins and Vanaheim clusters was consuming far too much engineering time. Most of the institutional knowledge of this legacy platform was lost over the years and it didn’t align with OfferUp’s philosophy of small DevOps engineering teams. Innovation had stalled partly due to the difficulty of simultaneously upgrading the DevOps platform and releasing new features.

DevOps platform automation with Flagger and Gloo Ingress Controller on Amazon EKS

A key requirement for the next-generation system was that the new architecture would reduce the operational burden on engineering teams, deployment lifecycle, and total cost of ownership. OfferUp evaluated multiple managed container orchestration platforms for its DevOps Platform. It finally selected Amazon EKS for high availability, reducing the average time to deploy a change to the stack from hours to just a few minutes and reducing the complexity in managing and upgrading the Kubernetes cluster. On the Amazon EKS platform, OfferUp uses Flagger, a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements several deployment strategies (Canary releases, A/B testing, and Blue/Green mirroring) using the Gloo Edge ingress controller for traffic routing. Datadog is used as an observability service for monitoring the health of the deployments and effectively managing the canary to progressive delivery. For release analysis, Flagger runs a query on Datadog logs and uses Slack for alerting and notifications. The cloud native technology components of the architecture are described as follows:

Kubernetes and Amazon EKS – Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. Kubernetes is a graduate project in the CNCF. Amazon EKS is a fully-managed, certified Kubernetes conformant service that simplifies the process of building, securing, operating, and maintaining Kubernetes clusters on AWS. Amazon EKS integrates with core AWS services, such as Amazon CloudWatch, Auto Scaling Groups, and AWS Identity and Access Management (IAM) to provide a seamless experience for monitoring, scaling, and load balancing your containerized applications.

Helm – Helm manage Kubernetes applications. Helm Charts define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish. If Kubernetes were an operating system, then Helm would be the package manager. Helm is a graduate project in the CNCF and is maintained by the Helm community.

Flagger – Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance indicators such as HTTP requests success rate, requests average duration, and pods health. Based on the set thresholds, a canary is either promoted or aborted and its analysis is pushed to a Slack channel. Flagger became a CNCF project – part of the Flux family of GitOps tools.

Gloo Edge – Gloo Edge is a feature-rich, Kubernetes-native ingress controller. Gloo Edge is exceptional in its function-level routing; its support for legacy apps, microservices, and serverless; its discovery capabilities; and its tight integration with leading open-source projects. Gloo Edge is uniquely designed to support hybrid applications, in which multiple technologies, architectures, protocols, and clouds can coexist.

Observability platform – Datadog’s integrations with Kubernetes, Docker, and AWS will let you track the full range of Amazon EKS metrics, as well as logs and performance data from your cluster and applications. Datadog gives you comprehensive coverage of your dynamic infrastructure and applications with features like auto discovery to track services across containers, sophisticated graphing, and alerting options.

Modernized DevOps architecture

In the new architecture, OfferUp uses Github as a version control tool and Github actions as their CI/CD tool. On every Pull request, tests are run, artifacts are built and stored in the JFrog Artifactory, and docker Images are stored in the Amazon Elastic Container Registry (Amazon ECR). Separate deployment pipelines are triggered based on the environment (dev, staging, and production) of choice. Flagger detects any changes in the version of the application and gradually shifts production traffic to the canary. It measures the requests success rate and average response duration metrics from Datadog to decide full rollout in production. For an application deployment, a canary promotion can be defined using Flagger’s custom resource. Flagger rolls back the deployment when the success rate falls below the defined desired success rate metrics.

Figure 2: Modernized DevOps architecture of OfferUp

With the modernized DevOps platform, OfferUp moved from monolithic to microservice architecture where front-end applications and GraphQL runs on the Amazon EKS cluster. The production cluster runs 110 services and 650+ pods on 60 nodes. The cluster scales up to 100 nodes with Amazon Auto Scaling group based on the traffic pattern. On the networking front, the cluster has a private endpoint and uses both VPC CNI plugin, and the CoreDNS add-on. There are four Amazon EKS clusters, one each for the production, test, utility, and the staging environments. OfferUp has a plan to explore Karpenter open-source autoscaling project, and it will move new applications to the Amazon EKS cluster, allowing the total node counts to scale up to 200.

Benefits of modernized architecture

The new architecture helped OfferUp make automated decisions to deploy new releases and improve the time to market while reducing unplanned production downtime

Faster deployments and Quicker rollbacks – The new architecture reduces the Service Deployment time from one hour down to five minutes, and automates rollback time to five minutes from the manual rollback time of one hour.
Automate deployment of new releases – The lack of canary deployment processes in the previous architecture required OfferUp engineers to manually intervene to validate the deployment status, which led to administrative overhead and production outages. The canary deployments take care of the traffic shifting by automatically measuring the requests’ success rate and latency metrics from Datadog and subsequently release the service to production. Deployments are automatically rolled back when the success rate falls below the defined success rate metric thresholds.
Simplified Configuration – Configuration has been simplified drastically and integrated within the CI/CD pipeline in the new architecture, thereby reducing configuration complexity, eliminating manual processes, and saving Developers time.
More time to Focus on Innovation – With fully automated progressive delivery, the developers no longer need to spend time testing and releasing source code in production. Similarly, migrating from a Self-managed DevOps platform to the Managed Amazon EKS services lowered the DevOps platform’s infrastructure management burden on the engineering team. This helps developers spend more time focusing on building and testing new features and innovations.
Cost reduction – Moving from self-managed Amazon EC2-based architecture to the Amazon EKS cluster reduced the cost of operations through shared nodes and improved pod density. The previous architecture was using 200 nodes of Amazon EC2 instances. The same workload was moved to a 50 nodes Amazon EKS cluster. Furthermore, custom applications (Vanaheim and Jenkins) were retired, further reducing the costs.

Conclusion

In this post, you see how OfferUp embarked on the journey to modernize its DevOps platform to support its growth and developers’ velocity. The key factors that drove the modernization decisions were the ability to scale the platform to support the automated testing of features in production, the faster release of new features, cost reduction, and to facilitate future innovation. The modernized DevOps platform on Amazon EKS also decreased the ongoing operational support burden for engineers, and the scalability of the design opens up a lot of headroom for growth.

We encourage you to look into modernizing your existing CI/CD pipeline on Amazon EKS with the Flagger progressive delivery mechanism. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities, such as new Kubernetes version deployments.

In the next part of the series, you’ll discover how to implement Flagger and Gloo Edge Ingress Controller on Amazon EKS to automate the release process for applications running on Kubernetes.

Further Reading

Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller

About the authors:

New – Amazon ECS Service Connect Enabling Easy Communication Between Microservices

2022-11-28 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ecs-service-connect-enabling-easy-communication-between-microservices/

Microservices architectures are a well-known software development approach to make applications composed of small independent services that communicate over well-defined application programming interfaces (APIs). Customers faced challenges when they started breaking down their monolith applications into microservices, as it required specialized networking knowledge to communicate internally with other microservices.

Amazon Elastic Container Services (Amazon ECS) customers have several solutions for service-to-service, but each one comes with some challenges and complications: 1) Elastic Load Balancing (ELB) needs to carefully plan for configuring infrastructure for high availability and incur additional infrastructure cost. 2) Using Amazon ECS Service Discovery often requires developers to write custom application code for collecting traffic metrics and for making network calls resilient. 3) Service mesh solutions such as AWS App Mesh run outside of Amazon ECS despite having advanced traffic monitoring and routing features between services.

Today, we are announcing the general availability of Amazon ECS Service Connect, a new capability that simplifies building and operating resilient distributed applications. ECS Service Connect provides an easy network setup and seamless service communication deployed across multiple ECS clusters and virtual private clouds (VPCs). You can add a layer of resilience to your ECS service communication and get traffic insights with no changes to your application code.

With ECS Service Connect, you can refer and connect to your services by logical names using a namespace provided by AWS Cloud Map and automatically distribute traffic between ECS tasks without deploying and configuring load balancers. You can set some safe defaults for traffic resilience, such as health checking, automatic retries for 503 errors, and connection draining, for each of your ECS services. Additionally, the Amazon ECS console provides easy-to-use dashboards with real-time network traffic metrics for operational convenience and simplified debugging.

Getting Started with Amazon ECS Service Connect
To get started with the ECS Service Connect, you can specify a namespace as part of creating an ECS cluster or create one in the Cloud Map. A namespace represents a way to structure your services and can span across multiple ECS clusters residing in different VPCs. All ECS services that belong to a specific namespace can communicate with existing services in the namespaces, provided existing network-level connectivity.

You can also see a list of Cloud Map namespaces in Namespaces in the left navigation pane of the Amazon ECS console. When you select a namespace, it shows a list of services with the same namespace from two different ECS clusters with database services (db-mysql, db-redis) and backend services (webui, appserver).

When you create an ECS cluster, you can select one of the namespaces in the Default namespaces of the Networking setting. ECS Service Connect is enabled for all new ECS services running in both AWS Fargate and Amazon EC2 instances. To enable all existing services, you would need to redeploy with either a new version of ECS-optimized Amazon Machine Image (AMI), or with a new Fargate Agent that supports ECS Service Connect.

Or, you can simply create a cluster via AWS Command Line Interface (AWS CLI) with serviceConnect parameter and a default Cloud Map namespace name for service discovery purposes.

$ aws ecs create-cluster
     --cluster "svc-cluster-2"
     --serviceConnect {
       "defaultNamespace": "svc-namespace"
}

This command will create an ECS cluster with the namespace on AWS’s behalf. If you would like to use an already existing Cloud Map namespace, you can simply pass the name of the existing namespace here.

Next, let’s create a service with a task definition and expose your web user-interface server using ECS Service Connect.

$ aws ecs create-service
--cluster "svc-cluster-2"
--service-name "webui"
--task-definition "webui-svc-cluster"
--serviceConnect {
  "enabled": true,
  "namespace": "svc-namespace",
  "services":
   [
      {
         "portName": "webui-port",
         "discoveryName": "webui-svc",
         "clientAliases": [
           {
              "port": 80, // *Required *//
              "dnsName": "webui-svc-domain" // * Optional *//
            }
        }
     ]
   ]
}

In this command, portName represents a reference to the container port, and clientAliases assigns the port number and DNS name, overriding the discovery name that is used in the endpoint. Each service has an endpoint URL that contains the protocol, a DNS name, and the port. You can select the protocol and port name in the task definition or the ECS service configuration. For example, an endpoint could be http://webui:80, grpc://appserver:8080, or http://db-redis:8888.

In the ECS console, you can see this configuration of ECS Service Connect for the webui service in the svc-cluster-2 cluster.

As you can see, you can run the same workloads across different clusters with the same clientAlias and namespace name for high availability. ECS Service Connect will intelligently load balance the traffic to the ECS tasks. To connect to services running in different ECS clusters, you need to specify the same namespace name for all your ECS services that need to talk to each other. ECS Service Connect will make your services discoverable to all other services in the same namespace.

Improving Service Resilience with Observability Data
You can collect traffic metrics with ECS Service Connect observability capabilities. By default, for each ECS service, you can see the number of healthy and unhealthy endpoints, along with inbound and outbound traffic volume.

ECS Service Connect supports HTTP/1, HTTP/2, gRPC, and TCP protocols. So, you can collect the number of requests, number of HTTP errors, and average call latency. For gRPC and TCP, you can see the total number of active connections. All of these metrics are pushed to Amazon CloudWatch or other AWS analytics services via custom log routing

In the Advanced menu, you can publish ECS Service Connect Agent logs for help in debugging in case of issues.

These metrics are only visible in the original interface of the CloudWatch console. When you use the CloudWatch console, switch to the original interface to see the additional metric dimensions of “discovery name” and “target discovery name” under the ECS grouping.

The default settings provide you with a starting point for building resilient applications, and you can fine-tune parameters to limit the impact of failures, latency spikes, and network fluctuations on your application behavior using AWS Management Console or dedicated ECS APIs.

Now Available
Amazon ECS Service Connect is available in all commercial Regions, except China, where Amazon ECS is available. ECS Service Connect is fully supported in AWS CloudFormation, AWS CDK, AWS Copilot, and AWS Proton for infrastructure provisioning, code deployments, and monitoring of your services. To learn more, see the Amazon ECS Service Connect Developer Guide.

My colleagues, Hemanth AVS, Senior Container Specialist SA, and Satya Vajrapu, Senior DevOps Consultant, prepared a hands-on workshop to demonstrate an example of the ECS Service Connect. Join CON303 Networking, service mesh, and service discovery with Amazon ECS when you attend AWS re:Invent 2022.

Give it a try, and please send feedback to AWS re:Post for Amazon ECS or through your usual AWS support contacts.

– Channy