Tag Archives: Kubernetes

One small step closer to containerising service binaries

Post Syndicated from Grab Tech original https://engineering.grab.com/reducing-your-go-binary-size

Grab’s engineering teams currently own and manage more than 250+ microservices. Depending on the business problems that each team tackles, our development ecosystem ranges from Golang, Java, and everything in between.

Although there are centralised systems that help automate most of the build and deployment tasks, there are still some teams working on different technologies that manage their own build, test and deployment systems at different maturity levels. Managing a varied build and deploy ecosystems brings their own challenges.

Build challenges

  • Broken external dependencies.
  • Non-reproducible builds due to changes in AMI, configuration keys and other build parameters.
  • Missing security permissions between different repositories.

Deployment challenges

  • Varied deployment environments necessitating a bigger learning curve.
  • Managing the underlying infrastructure as code.
  • Higher downtime when bringing the systems up after a scale down event.

Grab’s appetite for customer obsession and quality drives the engineering teams to innovate and deliver value rapidly. The time that the team spends in fixing build issues or deployment-related tasks has a direct impact on the time they spend on delivering business value.

Introduction to containerisation

Using the Container architecture helps the team deploy and run multiple applications, isolated from each other, on the same virtual machine or server and with much less overhead.

At Grab, both the platform and the core engineering teams wanted to move to the containerisation architecture to achieve the following goals:

  • Support to build and push container images during the CI process.
  • Create a standard virtual machine image capable of running container workloads. The AMI is maintained by a central team and comes with Grab infrastructure components such as (DataDog, Filebeat, Vault, etc.).
  • A deployment experience which allows existing services to migrate to container workload safely by initially running both types of workloads concurrently.

The core engineering teams wanted to adopt container workloads to achieve the following benefits:

  • Provide a containerised version of the service that can be run locally and on different cloud providers without any dependency on Grab’s internal (runtime) tooling.
  • Allow reuse of common Grab tools in different projects by running the zero dependency version of the tools on demand whenever needed.
  • Allow a more flexible staging/dev/shadow deployment of new features.

Adoption of containerisation

Engineering teams at Grab use the containerisation model to build and deploy services at scale. Our containerisation effort helps the development teams move faster by:

  • Providing a consistent environment across development, testing and production
  • Deploying software efficiently
  • Reducing infrastructure cost
  • Abstracting OS dependency
  • Increasing scalability between cloud vendors

When we started using containers we realised that building smaller containers had some benefits over bigger containers. For example, smaller containers:

  • Include only the needed libraries and therefore are more secure.
  • Build and deploy faster as they can be pulled to the running container cluster quickly.
  • Utilise disk space and memory efficiently.

During the course of containerising our applications, we noted that some service binaries appeared to be bigger (~110 MB) than they should be. For a statically-linked Golang binary, that’s pretty big! So how do we figure out what’s bloating the size of our binary?

Go binary size visualisation tool

In the course of poking around for tools that would help us analyse the symbols in a Golang binary, we found go-binsize-viz based on this article. We particularly liked this tool, because it utilises the existing Golang toolchain (specifically, Go tool nm) to analyse imports, and provides a straightforward mechanism for traversing through the symbols present via treemap. We will briefly outline the steps that we did to analyse a Golang binary here.

  1. First, build your service using the following command (important for consistency between builds):

    $ go build -a -o service_name ./path/to/main.go
  2. Next, copy the binary over to the cloned directory of go-binsize-viz repository.
  3. Run the following script that covers the steps in the go-binsize-viz README.

    #!/usr/bin/env bash
    # This script needs more input parsing, but it serves the needs for now.
    mkdir dist
    # step 1
    go tool nm -size $1 | c++filt > dist/$1.symtab
    # step 2
    python3 tab2pydic.py dist/$1.symtab > dist/$1-map.py
    # step 3
    # must be data.js
    python3 simplify.py dist/$1-map.py > dist/$1-data.js
    rm data.js
    ln -s dist/$1-data.js data.js

    Running this script creates a dist folder where each intermediate step is deposited, and a data.js symlink in the top-level directory which points to the consumable .js file by treemap.html.

    # top-level directory
    $ ll
    -rw-r--r--   1 stan.halka  staff   1.1K Aug 20 09:57 README.md
    -rw-r--r--   1 stan.halka  staff   6.7K Aug 20 09:57 app3.js
    -rw-r--r--   1 stan.halka  staff   1.6K Aug 20 09:57 cockroach_sizes.html
    lrwxr-xr-x   1 stan.halka  staff        65B Aug 25 16:49 data.js -> dist/v2.0.709356.segments-paxgroups-macos-master-go1.13-data.js
    drwxr-xr-x   8 stan.halka  staff   256B Aug 25 16:49 dist
    # dist folder
    $ ll dist
    total 71728
    drwxr-xr-x   8 stan.halka  staff   256B Aug 25 16:49 .
    drwxr-xr-x  21 stan.halka  staff   672B Aug 25 16:49 ..
    -rw-r--r--   1 stan.halka  staff   4.2M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13-data.js
    -rw-r--r--   1 stan.halka  staff   3.4M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13-map.py
    -rw-r--r--   1 stan.halka  staff    11M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13.symtab

    As you can probably tell from the file names, these steps were explored on the segments-paxgroups service, which is a microservice used for segment information at Grab. You can ignore the versioning metadata, branch name, and Golang information embedded in the name.

  4. Finally, run a local python3 server to visualise the binary components.

    $ python3 -m http.server
    Serving HTTP on port 8000 ( ...

    So now that we have a methodology to consistently generate a service binary, and a way to explore the symbols present, let’s dive in!

  5. Open your browser and visit http://localhost:8000/treemap_v3.html:

    Of the 103MB binary produced, 81MB are recognisable, with 66MB recognised as Golang (UNKNOWN is present, and also during parsing there were a fair number of warnings. Note that we haven’t spent enough time with the tool to understand why we aren’t able to recognise and index all the symbols present).


    The next step is to figure out where the symbols are coming from. There’s a bunch of Grab-internal stuff that for the sake of this blog isn’t necessary to go into, and it was reasonably easy to come to the right answer based on the intuitiveness of the go-binsize-viz tool.

    This visualisation shows us the source of how 11 MB of symbols are sneaking into the segments-paxgroups binary.


    Every message format for any service that reads from, or writes to, streams at Grab is included in every service binary! Not cloud native!

How did this happen?

The short answer is that Golang doesn’t import only the symbols that it requires, but rather all the symbols defined within an imported directory and transitive symbols as well. So, when we think we’re importing just one directory, if our code structure doesn’t follow principles of encapsulation or isolation, we end up importing 11 MB of symbols that we don’t need! In our case, this occurred because a generic Message interface was included in the same directory with all the auto-generated code you see in the pretty picture above.

The Streams team did an awesome job of restructuring the code, which when built again, led to this outcome:

$$ ll | grep paxgroups
-rwxr-xr-x   1 stan.halka  staff   110M Aug 21 14:53 v2.0.709356.segments-paxgroups-macos-master-go1.12
-rwxr-xr-x   1 stan.halka  staff   103M Aug 25 16:34 v2.0.709356.segments-paxgroups-macos-master-go1.13
-rwxr-xr-x   1 stan.halka  staff        80M Aug 21 14:53 v2.0.709356.segments-paxgroups-macos-tinkered-go1.12
-rwxr-xr-x   1 stan.halka  staff        78M Aug 25 16:34 v2.0.709356.segments-paxgroups-macos-tinkered-go1.13

Not a bad reduction in service binary size!

Lessons learnt

The go-binsize-viz utility offers a treemap representation for imported symbols, and is very useful in determining what symbols are contributing to the overall size.

Code architecture matters: Keep binaries as small as possible!

To reduce your binary size, follow these best practices:

  • Structure your code so that the interfaces and common classes/utilities are imported from different locations than auto-generated classes.
  • Avoid huge, flat directory structures.
  • If it’s a platform offering and has too many interwoven dependencies, try to decouple the actual platform offering from the company specific instantiations. This fosters creating isolated, minimalistic code.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Running cost optimized Spark workloads on Kubernetes using EC2 Spot Instances

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/

This post is written by Kinnar Sen, Senior Solutions Architect, EC2 Spot 

Apache Spark is an open-source, distributed processing system used for big data workloads. It provides API operations to perform multiple tasks such as streaming, extract transform load (ETL), query, machine learning (ML), and graph processing. Spark supports four different types of cluster managers (Spark standalone, Apache Mesos, Hadoop YARN, and Kubernetes), which are responsible for scheduling and allocation of resources in the cluster. Spark can run with native Kubernetes support since 2018 (Spark 2.3). AWS customers that have already chosen Kubernetes as their container orchestration tool can also choose to run Spark applications in Kubernetes, increasing the effectiveness of their operations and compute resources.

In this post, I illustrate the deployment of scalable, resilient, and cost optimized Spark application using Kubernetes via Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EC2 Spot Instances. Learn how to save money on big data workloads by implementing this solution.


Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. Capacity pools are a group of EC2 instances that belong to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand Instance usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

Amazon Elastic Kubernetes Service

Amazon EKS is a fully managed Kubernetes service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane. It provides a highly available and scalable managed control plane. It also provides managed worker nodes, which let you create, update, or terminate shut down worker nodes for your cluster with a single command. It is a great choice for deploying flexible and fault tolerant containerized applications. Amazon EKS supports creating and managing Amazon EC2 Spot Instances using Amazon EKS-managed node groups following Spot best practices. This enables you to take advantage of the steep savings and scale that Spot Instances provide for interruptible workloads running in your Kubernetes cluster. Using EKS-managed node groups with Spot Instances requires less operational effort compared to using self-managed nodes. In addition to launching Spot Instances in managed node groups, it is possible to specify multiple instance types in EKS managed node groups. You can find more in this blog.

Apache Spark and Kubernetes

When a spark application is submitted to the Kubernetes cluster the following happens:

  • A Spark driver is created.
  • The driver and the run within pods.
  • The Spark driver then requests for executors, which are scheduled to run within pods. The executors are managed by the driver.
  • The application is launched and once it completes, the executor pods are cleaned up. The driver pod persists the logs and remains in a completed state until the pod is cleared by garbage collection or manually removed. The driver in a completed stage does not consume any memory or compute resources.

Spark Deployment on Kubernetes Cluster

When a spark application runs on clusters managed by Kubernetes, the native Kubernetes scheduler is used. It is possible to schedule the driver/executor pods on a subset of available nodes. The applications can be launched either by a vanilla ‘spark submit’, a workflow orchestrator like Apache Airflow or the spark operator. I use vanilla ‘spark submit’ in this blog. is also able to schedule Spark applications on EKS clusters as described in this launch blog, but Amazon EMR on EKS is out of scope for this post.

Cost optimization

For any organization running big data workloads there are three key requirements: scalability, performance, and low cost. As the size of data increases, there is demand for more compute capacity and the total cost of ownership increases. It is critical to optimize the cost of big data applications. Big Data frameworks (in this case, Spark) are distributed to manage and process high volumes of data. These frameworks are designed for failure, can run on machines with different configurations, and are inherently resilient and flexible.

If Spark deploys on Kubernetes, the executor pods can be scheduled on EC2 Spot Instances and driver pods on On-Demand Instances. This reduces the overall cost of deployment – Spot Instances can save up to 90% over On-Demand Instance prices. This also enables faster results by scaling out executors running on Spot Instances. Spot Instances, by design, can be interrupted when EC2 needs the capacity back. If a driver pod is running on a Spot Instance, which is interrupted then the application fails and the application must be re-submitted. To avoid this situation, the driver pod can be scheduled on On-Demand Instances only. This adds a layer of resiliency to the Spark application running on Kubernetes. To cost optimize the deployment, all the executor pods are scheduled on Spot Instances as that’s where the bulk of compute happens. Spark’s inherent resiliency has the driver launch new executors to replace the ones that fail due to Spot interruptions.

There are a couple of key points to note here.

  • The idea is to start with minimum number of nodes for both On-Demand and Spot Instances (one each) and then auto-scale usingCluster Autoscaler and EC2 Auto Scaling  Cluster Autoscaler for AWS provides integration with Auto Scaling groups. If there are not sufficient resources, the driver and executor pods go into pending state. The Cluster Autoscaler detects pods in pending state and scales worker nodes within the identified Auto Scaling group in the cluster using EC2 Auto Scaling.
  • The scaling for On-Demand and Spot nodes is exclusive of one another. So, if multiple applications are launched the driver and executor pods can be scheduled in different node groups independently per the resource requirements. This helps reduce job failures due to lack of resources for the driver, thus adding to the overall resiliency of the system.
  • Using EKS Managed node groups
    • This requires significantly less operational effort compared to using self-managed nodegroup and enables:
      • Auto enforcement of Spot best practices like Capacity Optimized allocation strategy, Capacity Rebalancing and use multiple instances types.
      • Proactive replacement of Spot nodes using rebalance notifications.
      • Managed draining of Spot nodes via re-balance recommendations.
    • The nodes are auto-labeled so that the pods can be scheduled with NodeAffinity.
      • eks.amazonaws.com/capacityType: SPOT
      • eks.amazonaws.com/capacityType: ON_DEMAND

Now that you understand the products and best practices of used in this tutorial, let’s get started.

Tutorial: running Spark in EKS managed node groups with Spot Instances

In this tutorial, I review steps, which help you launch cost optimized and resilient Spark jobs inside Kubernetes clusters running on EKS. I launch a word-count application counting the words from an Amazon Customer Review dataset and write the output to an Amazon S3 folder. To run the Spark workload on Kubernetes, make sure you have eksctl and kubectl installed on your computer or on an AWS Cloud9 environment. You can run this by using an AWS IAM user or role that has the AdministratorAccess policy attached to it, or check the minimum required permissions for using eksctl. The spot node groups in the Amazon EKS cluster can be launched both in a managed or a self-managed way, in this post I use the former. The config files for this tutorial can be found here. The job is finally launched in cluster mode.

Create Amazon S3 Access Policy

First, I must create an Amazon S3 access policy to allow the Spark application to read/write from Amazon S3. Amazon S3 Access is provisioned by attaching the policy by ARN to the node groups. This associates Amazon S3 access to the NodeInstanceRole and, hence, the node groups then have access to Amazon S3. Download the Amazon S3 policy file from here and modify the <<output folder>> to an Amazon S3 bucket you created. Run the following to create the policy. Note the ARN.

aws iam create-policy --policy-name spark-s3-policy --policy-document file://spark-s3.json

Cluster and node groups deployment

Create an EKS cluster using the following command:

eksctl create cluster –name= sparkonk8 --node-private-networking  --without-nodegroup --asg-access –region=<<AWS Region>>

The cluster takes approximately 15 minutes to launch.

Create the nodegroup using the nodeGroup config file. Replace the <<Policy ARN>> string using the ARN string from the previous step.

eksctl create nodegroup -f managedNodeGroups.yml

Scheduling driver/executor pods

The driver and executor pods can be assigned to nodes using affinity. PodTemplates can be used to configure the detail, which is not supported by Spark launch configuration by default. This feature is available from Spark 3.0.0, requiredDuringScheduling node affinity is used to schedule the driver and executor jobs. Sample podTemplates have been uploaded here.

Launching a Spark application

Create a service account. The spark driver pod uses the service account to create and watch executor pods using Kubernetes API server.

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole='edit'  --serviceaccount=default:spark --namespace=default

Download the Cluster Autoscaler and edit it to add the cluster-name. 

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Install the Cluster AutoScaler using the following command:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Get the details of Kubernetes master to get the head URL.

kubectl cluster-info 

command output

Use the following instructions to build the docker image.

Download the application file (script.py) from here and upload into the Amazon S3 bucket created.

Download the pod template files from here. Submit the application.

bin/spark-submit \
--master k8s://<<MASTER URL>> \
--deploy-mode cluster \
--name 'Job Name' \
--conf spark.eventLog.dir=s3a:// <<S3 BUCKET>>/logs \
--conf spark.eventLog.enabled=true \
--conf spark.history.fs.inProgressOptimization.enabled=true \
--conf spark.history.fs.update.interval=5s \
--conf spark.kubernetes.container.image=<<ECR Spark Docker Image>> \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.podTemplateFile='../driver_pod_template.yml' \
--conf spark.kubernetes.executor.podTemplateFile='../executor_pod_template.yml' \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.33 \
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=30 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.driver.memory=8g \
--conf spark.kubernetes.driver.request.cores=2 \
--conf spark.kubernetes.driver.limit.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.kubernetes.executor.limit.cores=4 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.s3a.fast.upload=true \
s3a://<<S3 BUCKET>>/script.py \
s3a://<<S3 BUCKET>>/output 

A couple of key points to note here

  • podTemplateFile is used here, which enables scheduling of the driver pods to On-Demand Instances and executor pods to Spot Instances.
  • Spark provides a mechanism to allocate dynamically resources dynamically based on workloads. In the latest release of Spark (3.0.0), dynamicAllocation can be used with Kubernetes cluster manager. The executors that do not store, active, shuffled files can be removed to free up the resources. DynamicAllocation works well in tandem with Cluster Autoscaler for resource allocation and optimizes resource for jobs. We are using dynamicAllocation here to enable optimized resource sharing.
  • The application file and output are both in Amazon S3.

Output Files in S3

  • Spark Event logs are redirected to Amazon S3. Spark on Kubernetes creates local temporary files for logs and removes them once the application completes. The logs are redirected to Amazon S3 and Spark History Server can be used to analyze the logs. Note, you can create more instrumentation using tools like Prometheus and Grafana to monitor and manage the cluster.

Spark History Server + Dynamic Allocation


EC2 Spot Interruptions

The following diagram and log screenshot details from Spark History server showcases the behavior of a Spark application in case of an EC2 Spot interruption.

Four Spark applications launched in parallel in a cluster and one of the Spot nodes was interrupted. A couple of executor pods were terminated shut down in three of the four applications, but due to the resilient nature of Spark new executors were launched and the applications finished almost around the same time.
The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.
Spark jobs

The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.

Dynamic Allocation

Dynamic Allocation works with the caveat that it is an experimental feature.

dynamic allocation

Cost Optimization

Cost Optimization is achieved in several different ways from this tutorial.

  • Use of 100% Spot Instances for the Spark executors
  • Use of dynamicAllocation along with cluster autoscaler does make optimized use of resources and hence save cost
  • With the deployment of one driver and executor nodes to begin with and then scaling up on demand reduces the waste of a continuously running cluster

Cluster Autoscaling

Cluster Autoscaling is triggered as it is designed when there are pending (Spark executor) pods.

The Cluster Autoscaler logs can be fetched by:

kubectl logs -f deployment/cluster-autoscaler -n kube-system —tail=10  

Cluster Autoscaler Logs 


If you are trying out the tutorial, run the following steps to make sure that you don’t encounter unwanted costs.

Delete the EKS cluster and the nodegroups with the following command:

eksctl delete cluster --name sparkonk8

Delete the Amazon S3 Access Policy with the following command:

aws iam delete-policy --policy-arn <<POLICY ARN>>

Delete the Amazon S3 Output Bucket with the following command:

aws s3 rb --force s3://<<S3_BUCKET>>


In this blog, I demonstrated how you can run Spark workloads on a Kubernetes Cluster using Spot Instances, achieving scalability, resilience, and cost optimization. To cost optimize your Spark based big data workloads, consider running spark application using Kubernetes and EC2 Spot Instances.




Evolving Container Security With Linux User Namespaces

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/evolving-container-security-with-linux-user-namespaces-afbe3308c082

By Fabio Kung, Sargun Dhillon, Andrew Spyker, Kyle, Rob Gulewich, Nabil Schear, Andrew Leung, Daniel Muino, and Manas Alekar

As previously discussed on the Netflix Tech Blog, Titus is the Netflix container orchestration system. It runs a wide variety of workloads from various parts of the company — everything from the frontend API for netflix.com, to machine learning training workloads, to video encoders. In Titus, the hosts that workloads run on are abstracted from our users. The Titus platform maintains large pools of homogenous node capacity to run user workloads, and the Titus scheduler places workloads. This abstraction allows the compute team to influence the reliability, efficiency, and operability of the fleet via the scheduler. The hosts that run workloads are called Titus “agents.” In this post, we describe how Titus agents leverage user namespaces to improve the overall security of the Titus agent fleet.

Titus’s Multi-Tenant Clusters

The Titus agent fleet appears to users as a homogenous pool of capacity. Titus internally employs a cellular bulkhead architecture for scalability, so the fleet is composed of multiple cells. Many bulkhead architectures partition their cells on tenants, where a tenant is defined as a team and their collection of applications. We do not take this approach, and instead, we partition our cells to balance load. We do this for reliability, scalability, and efficiency reasons.

Titus is a multi-tenant system, allowing multiple teams and users to run workloads on the system, and ensuring they can all co-exist while still providing guarantees about security and performance. Much of this comes down to isolation, which comes in multiple forms. These forms include performance isolation (ensuring workloads do not degrade one another’s performance), capacity isolation (ensuring that a given tenant can acquire resources when they ask for them), fault isolation (ensuring that the failure of a part of the system doesn’t cause the whole system to fail), and security isolation (ensuring that the compromise of one tenant’s workload does not affect the security of other tenants). This post focuses on our approaches to security isolation.

Secure Multi-tenancy

One of Titus’s biggest concerns with multi-tenancy is security isolation. We want to allow different kinds of containers from different tenants to run on the same instance. Security isolation in containers has been a contentious topic. Despite the risks, we’ve chosen to leverage containers as part of our security boundary. To offset the risks brought about by the container security boundary, we employ some additional protections.

The building blocks of multi-tenancy are Linux namespaces, the very technology that makes LXC, Docker, and other kinds of containers possible. For example, the PID namespace makes it so that a process can only see PIDs in its own namespace, and therefore cannot send kill signals to random processes on the host. In addition to the default Docker namespaces (mount, network, UTS, IPC, and PID), we employ user namespaces for added layers of isolation. Unfortunately, these default namespace boundaries are not sufficient to prevent container escape, as seen in CVEs like CVE-2015–2925. These vulnerabilities arise due to the complexity of interactions between namespaces, a large number of historical decisions during kernel development, and leaky abstractions like the proc filesystem in Linux. Composing these security isolation primitives correctly is difficult, so we’ve looked to other layers for additional protection.

Running many different workloads multi-tenant on a host necessitates the prevention lateral movement, a technique in which the attacker compromises a single piece of software running in a container on the system, and uses that to compromise other containers on the same system. To mitigate this, we run containers as unprivileged users — making it so that users cannot use “root.” This is important because, in Linux, UID 0 (or root’s privileges), do not come from the mere fact that the user is root, but from capabilities. These capabilities are tied to the current process’s credentials. Capabilities can be added via privilege escalation (e.g., sudo, file capabilities) or removed (e.g., setuid, or switching namespaces). Various capabilities control what the root user can do. For example, the CAP_SYS_BOOT capability controls the ability of a given user to reboot the machine. There are also more common capabilities that are granted to users like CAP_NET_RAW, which allows a process the ability to open raw sockets. A user can automatically have capabilities added when they execute specific files via file capabilities. For example, on a stock Ubuntu system, the ping command needs CAP_NET_RAW:

One of the most powerful capabilities in Linux is CAP_SYS_ADMIN, which is effectively equivalent to having superuser access. It gives the user the ability to do everything from mounting arbitrary filesystems, to accessing tracepoints that can expose vital information about the Linux kernel. Other powerful capabilities include CAP_CHOWN and CAP_DAC_OVERRIDE, which grant the capability to manipulate file permissions.

In the kernel, you’ll often see capability checks spread throughout the code, which looks something like this:

Notice this function doesn’t check if the user is root, but if the task has the CAP_SYS_ADMIN capability before allowing it to execute.

Docker takes the approach of using an allow-list to define which capabilities a container receives. These can be extended or attenuated by the user. Even the default capabilities that are defined in the Docker profile can be abused in certain situations. When we looked into running workloads as unprivileged users without many of these capabilities, we found that it was a non-starter. Various pieces of software used elevated capabilities for FUSE, low-level packet monitoring, and performance tracing amongst other use cases. Programs will usually start with capabilities, perform any activities that require those capabilities, and then “drop” them when the process no longer needs them.

User Namespaces

Fortunately, Linux has a solution — User Namespaces. Let’s go back to that kernel code example earlier. The pcrlock function called the capable function to determine whether or not the task was capable. This function is defined as:

This checks if the task has this capability relative to the init_user_ns. The init_user_ns is the namespace that processes are initialially spawned in, as it’s the only user namespace that exists at kernel startup time. User namespaces are a mechanism to split up the init_user_ns UID space. The interface to set up the mappings is via a “uid_map” and “gid_map” that’s exposed via /proc. The mapping looks something like this:

This allows UIDs in user-namespaced containers to be mapped to host UIDs. A variety of translations occur, but from the container’s perspective, everything is from the perspective of the UID ranges (otherwise known as extents) that are mapped. This is powerful in a few ways:

  1. It allows you to make certain UIDs off-limits to the container — if a UID is not mapped in the user namespace to a real UID, and you try to examine a file on disk with it, it will show up as overflowuid / overflowgid, a UID and GID specified in /proc/sys to indicate that it cannot be mapped into the current working space. Also, the container cannot setuid to a UID that can access files owned by that “outside uid.”
  2. From the user namespace’s perspective, the container’s root user appears to be UID 0, and the container can use the entire range of UIDs that are mapped into that namespace.
  3. Kernel subsystems can then proceed to call ns_capable with the specific user namespace that is tied to the resource. Many capability checks are now done to a user namespace that is relative to the resource being manipulated. This, in turn, allows processes to exercise certain privileges without having any privileges in the init user namespace. Even if the mapping is the same across many different namespaces, capability checks are still done relative to a specific user namespace.

One critical aspect of understanding how permissions work is that every namespace belongs to a specific user namespace. For example, let’s look at the UTS namespace, which is responsible for controlling the hostname:

The namespace has a relationship with a particular user namespace. The ability for a user to manipulate the hostname is based on whether or not the process has the appropriate capability in that user namespace.

Let’s Get Into It

We can examine how the interaction of namespaces and users work ourselves. To set the hostname in the UTS namespace, you need to have CAP_SYS_ADMIN in its user namespace. We can see this in action here, where an unprivileged process doesn’t have permission to set the hostname:

The reason for this is that the process does not have CAP_SYS_ADMIN. According to /proc/self/status, the effective capability set of this process is empty:

Now, let’s try to set up a user namespace, and see what happens:

Immediately, you’ll notice the command prompt says the current user is root, and that the id command agrees. Can we set the hostname now?

We still cannot set the hostname. This is because the process is still in the initial UTS namespace. Let’s see if we can unshare the UTS namespace, and set the hostname:

This is now successful, and the process is in an isolated UTS namespace with the hostname “foo.” This is because the process now has all of the capabilities that a traditional root user would have, except they are relative to the new user namespace we created:

If we inspect this process from the outside, we can see that the process still runs as the unprivileged user, and the hostname in the original outside namespace hasn’t changed:

From here, we can do all sorts of things, like mount filesystems, create other new namespaces, and in fact, we can create an entire container environment. Notice how no privilege escalation mechanism was used to perform any of these actions. This approach is what some people refer to as “rootless containers.”

Road to Implementation

We began work to enable user namespaces in early 2017. At the time we had a naive model that was simpler. This simplicity was possible because we were running without user namespaces:

This approach mirrored the process layout and boundaries of contemporary container orchestration systems. We had a shared metrics daemon on the machine that reached in and polled metrics from the container. User access was done by exposing an SSH daemon, and automatically doing nsenter on the user’s behalf to drop them into the container. To expose files to the container we would use bind mounts. The same mechanism was used to expose configuration, such as secrets.

This had the benefit that much of our software could be installed in the host namespace, and only manage files in the that namespace. The container runtime management system (Titus) was then responsible for configuring Docker to expose the right files to the container via bind mounts. In addition to that, we could use our standard metrics daemons on the host.

Although this model was easy to reason about and write software for, it had several shortcomings that we addressed by shifting everything to running inside of the container’s unprivileged user namespace. The first shortcoming was that all of the host daemons now needed to be aware of the UID translation, and perform the proper setuid or chown calls to transition across the container boundary. Second, each of these transitions represented a security risk. If the SSH daemon only partially transitioned into the container namespace by changing into the container’s pid namespace, it would leave its /proc accessible. This could then be used by a malicious attacker to escape.

With user namespaces, we can improve our security posture and reduce the complexity of the system by running those daemons in the container’s unprivileged user namespace, which removes the need to cross the namespace boundaries. In turn, this removes the need to correctly implement a cross-namespace transition mechanism thus, reducing the risk of introducing container escapes.

We did this by moving aspects of the container runtime environment into the container. For example, we run an SSH daemon per container and a metrics daemon per container. These run inside of the namespaces of the container, and they have the same capabilities and lifecycle as the workloads in the container. We call this model “System Services” — one can think of it as a primordial version of pods. By the end of 2018, we had moved all of our containers to run in unprivileged user namespaces successfully.

Why is this useful?

This may seem like another level of indirection that just introduces complexity, but instead, it allows us to leverage an extremely useful concept — “unprivileged containers.” In unprivileged containers, the root user starts from a baseline in which they don’t automatically have access to the entire system. This means that DAC, MAC, and seccomp policies are now an extra layer of defense against accessing privileged aspects of the system — not the only layer. As new privileges are added, we do not have to add them to an exclusion list. This allows our users to write software where they can control low-level system details in their own containers, rather than forcing all of the complexity up into the container runtime.

Use Case: FUSE

Netflix internally uses a purpose built FUSE filesystem called MezzFS. The purpose of this filesystem is to provide access to our content for a variety of encoding tools. Most of these encoding tools are designed to interact with the POSIX filesystem API. Our Media Cloud Engineering team wanted to leverage containers for a new platform they were building, called Archer. Archer, in turn, uses MezzFS, which needs FUSE, and at the time, FUSE required that the user have CAP_SYS_ADMIN in the initial user namespace. To accommodate the use case from our internal partner, we had to run them in a dedicated cluster where they could run privileged containers.

In 2017, we worked with our partner, Kinvolk, to have patches added to the Linux kernel that allowed users to safely use FUSE from non-init user namespaces. They were able to successfully upstream these patches, and we’ve been using them in production. From our user’s perspective, we were able to seamlessly move them into an unprivileged environment that was more secure. This simplified operations, as this workload was no longer considered exceptional, and could run alongside every other workload in the general node pool. In turn, this allowed the media encoding team access to a massive amount of compute capacity from the shared clusters, and better reliability due to the homogeneous nature of the deployment.

Use Case: Unintended Privileges

Many CVEs related to granting containers unintended privileges have been released in the past few years:

CVE-2020–15257: Privilege escalation in containerd

CVE-2019–5736: Privilege escalation via overwriting host runc binary

CVE-2018–10892: Access to /proc/acpi, allowing an attacker to modify hardware configuration

There will certainly be more vulnerabilities in the future, as is to be expected in any complex, quickly evolving system. We already use the default settings offered by Docker, such as AppArmor, and seccomp, but by adding user namespaces, we can achieve a superior defense-in-depth security model. These CVEs did not affect our infrastructure because we were using user namespaces for all of our containers. The attenuation of capabilities in the init user namespace performed as intended and stopped these attacks.

The Future

There are still many bits of the Kernel that are receiving support for user namespaces or enhancements making user namespaces easier to use. Much of the work left to do is focused on filesystems and container orchestration systems themselves. Some of these changes are slated for upcoming kernel releases. Work is being done to add unprivileged mounts to overlayfs allowing for nested container builds in a user namespace with layers. Future work is going on to make the Linux kernel VFS layer natively understand ID translation. This will make user namespaces with different ID mappings able to access the same underlying filesystem by shifting UIDs through a bind mount. Our partners at Kinvolk are also working on bringing user namespaces to Kubernetes.

Today, a variety of container runtimes support user namespaces. Docker can set up machine-wide UID mappings with separate user namespaces per container, as outlined in their docs. Any OCI compliant runtime such as Containerd / runc, Podman, and systemd-nspawn support user namespaces. Various container orchestration engines also support user namespaces via their underlying container runtimes, such as Nomad and Docker Swarm.

As part of our move to Kubernetes, Netflix has been working with Kinvolk on getting user namespaces to work under Kubernetes. You can follow this work via the KEP discussion here, and Kinvolk has more information about running user namespaces under Kubernetes on their blog. We look forward to evolving container security together with the Kubernetes community.

Evolving Container Security With Linux User Namespaces was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Post Syndicated from Brian Bassett original https://blog.cloudflare.com/getting-to-the-core/

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Maintaining a server fleet the size of Cloudflare’s is an operational challenge, to say the least. Anything we can do to lower complexity and improve efficiency has effects for our SRE (Site Reliability Engineer) and Data Center teams that can be felt throughout a server’s 4+ year lifespan.

At the Cloudflare Core, we process logs to analyze attacks and compute analytics. In 2020, our Core servers were in need of a refresh, so we decided to redesign the hardware to be more in line with our Gen X edge servers. We designed two major server variants for the core. The first is Core Compute 2020, an AMD-based server for analytics and general-purpose compute paired with solid-state storage drives. The second is Core Storage 2020, an Intel-based server with twelve spinning disks to run database workloads.

Core Compute 2020

Earlier this year, we blogged about our 10th generation edge servers or Gen X and the improvements they delivered to our edge in both performance and security. The new Core Compute 2020 server leverages many of our learnings from the edge server. The Core Compute servers run a variety of workloads including Kubernetes, Kafka, and various smaller services.

Configuration Changes (Kubernetes)

Previous Generation Compute Core Compute 2020
CPU 2 x Intel Xeon Gold 6262 1 x AMD EPYC 7642
Total Core / Thread Count 48C / 96T 48C / 96T
Base / Turbo Frequency 1.9 / 3.6 GHz 2.3 / 3.3 GHz
Memory 8 x 32GB DDR4-2666 8 x 32GB DDR4-2933
Storage 6 x 480GB SATA SSD 2 x 3.84TB NVMe SSD
Network Mellanox CX4 Lx 2 x 25GbE Mellanox CX4 Lx 2 x 25GbE

Configuration Changes (Kafka)

Previous Generation (Kafka) Core Compute 2020
CPU 2 x Intel Xeon Silver 4116 1 x AMD EPYC 7642
Total Core / Thread Count 24C / 48T 48C / 96T
Base / Turbo Frequency 2.1 / 3.0 GHz 2.3 / 3.3 GHz
Memory 6 x 32GB DDR4-2400 8 x 32GB DDR4-2933
Storage 12 x 1.92TB SATA SSD 10 x 3.84TB NVMe SSD
Network Mellanox CX4 Lx 2 x 25GbE Mellanox CX4 Lx 2 x 25GbE

Both previous generation servers were Intel-based platforms, with the Kubernetes server based on Xeon 6262 processors, and the Kafka server based on Xeon 4116 processors. One goal with these refreshed versions was to converge the configurations in order to simplify spare parts and firmware management across the fleet.

As the above tables show, the configurations have been converged with the only difference being the number of NVMe drives installed depending on the workload running on the host. In both cases we moved from a dual-socket configuration to a single-socket configuration, and the number of cores and threads per server either increased or stayed the same. In all cases, the base frequency of those cores was significantly improved. We also moved from SATA SSDs to NVMe SSDs.

Core Compute 2020 Synthetic Benchmarking

The heaviest user of the SSDs was determined to be Kafka. The majority of the time Kafka is sequentially writing 2MB blocks to the disk. We created a simple FIO script with 75% sequential write and 25% sequential read, scaling the block size from a standard page table entry size of 4096KB to Kafka’s write size of 2MB. The results aligned with what we expected from an NVMe-based drive.

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware
Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware
Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware
Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Core Compute 2020 Production Benchmarking

Cloudflare runs many of our Core Compute services in Kubernetes containers, some of which are multi-core. By transitioning to a single socket, problems associated with dual sockets were eliminated, and we are guaranteed to have all cores allocated for any given container on the same socket.

Another heavy workload that is constantly running on Compute hosts is the Cloudflare CSAM Scanning Tool. Our Systems Engineering team isolated a Compute 2020 compute host and the previous generation compute host, had them run just this workload, and measured the time to compare the fuzzy hashes for images to the NCMEC hash lists and verify that they are a “miss”.

Because the CSAM Scanning Tool is very compute intensive we specifically isolated it to take a look at its performance with the new hardware. We’ve spent a great deal of effort on software optimization and improved algorithms for this tool but investing in faster, better hardware is also important.

In these heatmaps, the X axis represents time, and the Y axis represents “buckets” of time taken to verify that it is not a match to one of the NCMEC hash lists. For a given time slice in the heatmap, the red point is the bucket with the most times measured, the yellow point the second most, and the green points the least. The red points on the Compute 2020 graph are all in the 5 to 8 millisecond bucket, while the red points on the previous Gen heatmap are all in the 8 to 13 millisecond bucket, which shows that on average, the Compute 2020 host is verifying hashes significantly faster.

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Core Storage 2020

Another major workload we identified was ClickHouse, which performs analytics over large datasets. The last time we upgraded our servers running ClickHouse was back in 2018.

Configuration Changes

Previous Generation Core Storage 2020
CPU 2 x Intel Xeon E5-2630 v4 1 x Intel Xeon Gold 6210U
Total Core / Thread Count 20C / 40T 20C / 40T
Base / Turbo Frequency 2.2 / 3.1 GHz 2.5 / 3.9 GHz
Memory 8 x 32GB DDR4-2400 8 x 32GB DDR4-2933
Storage 12 x 10TB 7200 RPM 3.5” SATA 12 x 10TB 7200 RPM 3.5” SATA
Network Mellanox CX4 Lx 2 x 25GbE Mellanox CX4 Lx 2 x 25GbE

CPU Changes

For ClickHouse, we use a 1U chassis with 12 x 10TB 3.5” hard drives. At the time we were designing Core Storage 2020 our server vendor did not yet have an AMD version of this chassis, so we remained on Intel. However, we moved Core Storage 2020 to a single 20 core / 40 thread Xeon processor, rather than the previous generation’s dual-socket 10 core / 20 thread processors. By moving to the single-socket Xeon 6210U processor, we were able to keep the same core count, but gained 17% higher base frequency and 26% higher max turbo frequency. Meanwhile, the total CPU thermal design profile (TDP), which is an approximation of the maximum power the CPU can draw, went down from 165W to 150W.

On a dual-socket server, remote memory accesses, which are memory accesses by a process on socket 0 to memory attached to socket 1, incur a latency penalty, as seen in this table:

Previous Generation Core Storage 2020
Memory latency, socket 0 to socket 0 81.3 ns 86.9 ns
Memory latency, socket 0 to socket 1 142.6 ns N/A

An additional advantage of having a CPU with all 20 cores on the same socket is the elimination of these remote memory accesses, which take 76% longer than local memory accesses.

Memory Changes

The memory in the Core Storage 2020 host is rated for operation at 2933 MHz; however, in the 8 x 32GB configuration we need on these hosts, the Intel Xeon 6210U processor clocks them at 2666 MH. Compared to the previous generation, this gives us a 13% boost in memory speed. While we would get a slightly higher clock speed with a balanced, 6 DIMMs configuration, we determined that we are willing to sacrifice the slightly higher clock speed in order to have the additional RAM capacity provided by the 8 x 32GB configuration.

Storage Changes

Data capacity stayed the same, with 12 x 10TB SATA drives in RAID 0 configuration for best  throughput. Unlike the previous generation, the drives in the Core Storage 2020 host are helium filled. Helium produces less drag than air, resulting in potentially lower latency.

Core Storage 2020 Synthetic benchmarking

We performed synthetic four corners benchmarking: IOPS measurements of random reads and writes using 4k block size, and bandwidth measurements of sequential reads and writes using 128k block size. We used the fio tool to see what improvements we would get in a lab environment. The results show a 10% latency improvement and 11% IOPS improvement in random read performance. Random write testing shows 38% lower latency and 60% higher IOPS. Write throughput is improved by 23%, and read throughput is improved by a whopping 90%.

Previous Generation Core Storage 2020 % Improvement
4k Random Reads (IOPS) 3,384 3,758 11.0%
4k Random Read Mean Latency (ms, lower is better) 75.4 67.8 10.1% lower
4k Random Writes (IOPS) 4,009 6,397 59.6%
4k Random Write Mean Latency (ms, lower is better) 63.5 39.7 37.5% lower
128k Sequential Reads (MB/s) 1,155 2,195 90.0%
128k Sequential Writes (MB/s) 1,265 1,558 23.2%

CPU frequencies

The higher base and turbo frequencies of the Core Storage 2020 host’s Xeon 6210U processor allowed that processor to achieve higher average frequencies while running our production ClickHouse workload. A recent snapshot of two production hosts showed the Core Storage 2020 host being able to sustain an average of 31% higher CPU frequency while running ClickHouse.

Previous generation (average core frequency) Core Storage 2020 (average core frequency) % improvement
Mean Core Frequency 2441 MHz 3199 MHz 31%

Core Storage 2020 Production benchmarking

Our ClickHouse database hosts are continually performing merge operations to optimize the database data structures. Each individual merge operation takes just a few seconds on average, but since they’re constantly running, they can consume significant resources on the host. We sampled the average merge time every five minutes over seven days, and then sampled the data to find the average, minimum, and maximum merge times reported by a Compute 2020 host and by a previous generation host. Results are summarized below.

ClickHouse merge operation performance improvement
(time in seconds, lower is better)

Time Previous generation Core Storage 2020 % improvement
Mean time to merge 1.83 1.15 37% lower
Maximum merge time 3.51 2.35 33% lower
Minimum merge time 0.68 0.32 53% lower

Our lab-measured CPU frequency and storage performance improvements on Core Storage 2020 have translated into significantly reduced times to perform this database operation.


With our Core 2020 servers, we were able to realize significant performance improvements, both in synthetic benchmarking outside production and in the production workloads we tested. This will allow Cloudflare to run the same workloads on fewer servers, saving CapEx costs and data center rack space. The similarity of the configuration of the Kubernetes and Kafka hosts should help with fleet management and spare parts management. For our next redesign, we will try to further converge the designs on which we run the major Core workloads to further improve efficiency.

Special thanks to Will Buckner and Chris Snook for their help in the development of these servers, and to Tim Bart for validating CSAM Scanning Tool’s performance on Compute.

Automated Origin CA for Kubernetes

Post Syndicated from Terin Stock original https://blog.cloudflare.com/automated-origin-ca-for-kubernetes/

Automated Origin CA for Kubernetes

Automated Origin CA for Kubernetes

In 2016, we launched the Cloudflare Origin CA, a certificate authority optimized for making it easy to secure the connection between Cloudflare and an origin server. Running our own CA has allowed us to support fast issuance and renewal, simple and effective revocation, and wildcard certificates for our users.

Out of the box, managing TLS certificates and keys within Kubernetes can be challenging and error prone. The secret resources have to be constructed correctly, as components expect secrets with specific fields. Some forms of domain verification require manually rotating secrets to pass. Once you’re successful, don’t forget to renew before the certificate expires!

cert-manager is a project to fill this operational gap, providing Kubernetes resources that manage the lifecycle of a certificate. Today we’re releasing origin-ca-issuer, an extension to cert-manager integrating with Cloudflare Origin CA to easily create and renew certificates for your account’s domains.

Origin CA Integration

Creating an Issuer

After installing cert-manager and origin-ca-issuer, you can create an OriginIssuer resource. This resource creates a binding between cert-manager and the Cloudflare API for an account. Different issuers may be connected to different Cloudflare accounts in the same Kubernetes cluster.

apiVersion: cert-manager.k8s.cloudflare.com/v1
kind: OriginIssuer
  name: prod-issuer
  namespace: default
  signatureType: OriginECC
      name: service-key
      key: key

This creates a new OriginIssuer named “prod-issuer” that issues certificates using ECDSA signatures, and the secret “service-key” in the same namespace is used to authenticate to the Cloudflare API.

Signing an Origin CA Certificate

After creating an OriginIssuer, we can now create a Certificate with cert-manager. This defines the domains, including wildcards, that the certificate should be issued for, how long the certificate should be valid, and when cert-manager should renew the certificate.

apiVersion: cert-manager.io/v1
kind: Certificate
  name: example-com
  namespace: default
  # The secret name where cert-manager
  # should store the signed certificate.
  secretName: example-com-tls
    - example.com
  # Duration of the certificate.
  duration: 168h
  # Renew a day before the certificate expiration.
  renewBefore: 24h
  # Reference the Origin CA Issuer you created above,
  # which must be in the same namespace.
    group: cert-manager.k8s.cloudflare.com
    kind: OriginIssuer
    name: prod-issuer

Once created, cert-manager begins managing the lifecycle of this certificate, including creating the key material, crafting a certificate signature request (CSR), and constructing a certificate request that will be processed by the origin-ca-issuer.

When signed by the Cloudflare API, the certificate will be made available, along with the private key, in the Kubernetes secret specified within the secretName field. You’ll be able to use this certificate on servers proxied behind Cloudflare.

Extra: Ingress Support

If you’re using an Ingress controller, you can use cert-manager’s Ingress support to automatically manage Certificate resources based on your Ingress resource.

apiVersion: networking/v1
kind: Ingress
    cert-manager.io/issuer: prod-issuer
    cert-manager.io/issuer-kind: OriginIssuer
    cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
  name: example
  namespace: default
    - host: example.com
          - backend:
              serviceName: examplesvc
              servicePort: 80
            path: /
    # specifying a host in the TLS section will tell cert-manager 
    # what DNS SANs should be on the created certificate.
    - hosts:
        - example.com
      # cert-manager will create this secret
      secretName: example-tls

Building an External cert-manager Issuer

An external cert-manager issuer is a specialized Kubernetes controller. There’s no direct communication between cert-manager and external issuers at all; this means that you can use any existing tools and best practices for developing controllers to develop an external issuer.

We’ve decided to use the excellent controller-runtime project to build origin-ca-issuer, running two reconciliation controllers.

Automated Origin CA for Kubernetes

OriginIssuer Controller

The OriginIssuer controller watches for creation and modification of OriginIssuer custom resources. The controllers create a Cloudflare API client using the details and credentials referenced. This client API instance will later be used to sign certificates through the API. The controller will periodically retry to create an API client; once it is successful, it updates the OriginIssuer’s status to be ready.

CertificateRequest Controller

The CertificateRequest controller watches for the creation and modification of cert-manager’s CertificateRequest resources. These resources are created automatically by cert-manager as needed during a certificate’s lifecycle.

The controller looks for Certificate Requests that reference a known OriginIssuer, this reference is copied by cert-manager from the origin Certificate resource, and ignores all resources that do not match. The controller then verifies the OriginIssuer is in the ready state, before transforming the certificate request into an API request using the previously created clients.

On a successful response, the signed certificate is added to the certificate request, and which cert-manager will use to create or update the secret resource. On an unsuccessful request, the controller will periodically retry.

Learn More

Up-to-date documentation and complete installation instructions can be found in our GitHub repository. Feedback and contributions are greatly appreciated. If you’re interested in Kubernetes at Cloudflare, including building controllers like these, we’re hiring.

Register for the Modern Applications Online Event

Post Syndicated from Rachel Richardson original https://aws.amazon.com/blogs/compute/register-for-the-modern-applications-online-event/

Earlier this year we hosted the first serverless themed virtual event, the Serverless-First Function. We enjoyed the opportunity to virtually connect with our customers so much that we want to do it again. This time, we’re expanding the scope to feature serverless, containers, and front-end development content. The Modern Applications Online Event is scheduled for November 4-5, 2020.

This free, two-day event addresses how to build and operate modern applications at scale across your organization, enabling you to become more agile and respond to change faster. The event covers topics including serverless application development, containers best practices, front-end web development and more. If you missed the containers or serverless virtual events earlier this year, this is great opportunity to watch the content and interact directly with expert moderators. The full agenda is listed below.

Register now

Organizational Level Operations

Wednesday, November 4, 2020, 9:00 AM – 1:00 PM PT

Move fast and ship things: Using serverless to increase speed and agility within your organization
In this session, Adrian Cockcroft demonstrates how you can use serverless to build modern applications faster than ever. Cockcroft uses real-life examples and customer stories to debunk common misconceptions about serverless.

Eliminating busywork at the organizational level: Tips for using serverless to its fullest potential 
In this session, David Yanacek discusses key ways to unlock the full benefits of serverless, including building services around APIs and using service-oriented architectures built on serverless systems to remove the roadblocks to continuous innovation.

Faster Mobile and Web App Development with AWS Amplify
In this session, Brice Pellé, introduces AWS Amplify, a set of tools and services that enables mobile and front-end web developers to build full stack serverless applications faster on AWS. Learn how to accelerate development with AWS Amplify’s use-case centric open-source libraries and CLI, and its fully managed web hosting service with built-in CI/CD.

Built Serverless-First: How Workgrid Software transformed from a Liberty Mutual project to its own global startup
Connected through a central IT team, Liberty Mutual has embraced serverless since AWS Lambda’s inception in 2014. In this session, Gillian McCann discusses Workgrid’s serverless journey—from internal microservices project within Liberty Mutual to independent business entity, all built serverless-first. Introduction by AWS Principal Serverless SA, Sam Dengler.

Market insights: A conversation with Forrester analyst Jeffrey Hammond & Director of Product for Lambda Ajay Nair
In this session, guest speaker Jeffrey Hammond and Director of Product for AWS Lambda, Ajay Nair, discuss the state of serverless, Lambda-based architectural approaches, Functions-as-a-Service platforms, and more. You’ll learn about the high-level and enduring enterprise patterns and advancements that analysts see driving the market today and determining the market in the future.

AWS Fargate Platform Version 1.4
In this session we will go through a brief introduction of AWS Fargate, what it is, its relation to EKS and ECS and the problems it addresses for customers. We will later introduce the concept of Fargate “platform versions” and we will then dive deeper into the new features that the new platform version 1.4 enables.

Persistent Storage on Containers
Containerizing applications that require data persistence or shared storage is often challenging since containers are ephemeral in nature, are scaled in and out dynamically, and typically clear any saved state when terminated. In this session you will learn about Amazon Elastic File System (EFS), a fully managed, elastic, highly-available, scalable, secure, high-performance, cloud native, shared file system that enables data to be persisted separately from compute for your containerized applications.

Security Best Practices on Amazon ECR
In this session, we will cover best practices with securing your container images using ECR. Learn how user access controls, image assurance, and image scanning contribute to securing your images.

Application Level Design

Thursday, November 5, 2020, 9:00 AM – 1:00 PM PT

Building a Live Streaming Platform with Amplify Video
In this session, learn how to build a live-streaming platform using Amplify Video and the platform powering it, AWS Elemental Live. Amplify video is an open source plugin for the Amplify CLI that makes it easy to incorporate video streaming into your mobile and web applications powered by AWS Amplify.

Building Serverless Web Applications
In this session, follow along as Ben Smith shows you how to build and deploy a completely serverless web application from scratch. The application will span from a mobile friendly front end to complex business logic on the back end.

Automating serverless application development workflows
In this talk, Eric Johnson breaks down how to think about CI/CD when building serverless applications with AWS Lambda and Amazon API Gateway. This session will cover using technologies like AWS SAM to build CI/CD pipelines for serverless application back ends.

Observability for your serverless applications
In this session, Julian Wood walks you through how to add monitoring, logging, and distributed tracing to your serverless applications. Join us to learn how to track platform and business metrics, visualize the performance and operations of your application, and understand which services should be optimized to improve your customer’s experience.

Happy Building with AWS Copilot
The hard part’s done. You and your team have spent weeks pouring over pull requests, building micro-services and containerizing them. Congrats! But what do you do now? How do you get those services on AWS? Copilot is a new command line tool that makes building, developing and operating containerized apps on AWS a breeze. In this session, we’ll talk about how Copilot helps you and your team set up modern applications that follow AWS best practices

CDK for Kubernetes
The CDK for Kubernetes (cdk8s) is a new open-source software development framework for defining Kubernetes applications and resources using familiar programming languages. Applications running on Kubernetes are composed of dozens of resources maintained through carefully maintained YAML files. As applications evolve and teams grow, these YAML files become harder and harder to manage. It’s also really hard to reuse and create abstractions through config files — copying & pasting from previous projects is not the solution! In this webinar, the creators of cdk8s show you how to define your first cdk8s application, define reusable components called “constructs” and generally say goodbye (and thank you very much) to writing in YAML.

Machine Learning on Amazon EKS
Amazon EKS has quickly emerged as a leading choice for machine learning workloads. In this session, we’ll walk through some of the recent ML related enhancements the Kubernetes team at AWS has released. We will then dive deep with walkthroughs of how to optimize your machine learning workloads on Amazon EKS, including demos of the latest features we’ve contributed to popular open source projects like Spark and Kubeflow

Deep Dive on Amazon ECS Capacity Providers
In this talk, we’ll dive into the different ways that ECS Capacity Providers can enable teams to focus more on their core business, and less on the infrastructure behind the scenes. We’ll look at the benefits and discuss scenarios where Capacity Providers can help solve the problems that customers face when using container orchestration. Lastly, we’ll review what features have been released with Capacity Providers, as well as look ahead at what’s to come.

Register now

Optimally scaling Kafka consumer applications

Post Syndicated from Grab Tech original https://engineering.grab.com/optimally-scaling-kafka-consumer-applications

Earlier this year, we took you on a journey on how we built and deployed our event sourcing and stream processing framework at Grab. We’re happy to share that we’re able to reliably maintain our uptime and continue to service close to 400 billion events a week. We haven’t stopped there though. To ensure that we can scale our framework as the Grab business continuously grows, we have spent efforts optimizing our infrastructure.

In this article, we will dive deeper into our Kubernetes infrastructure setup for our stream processing framework. We will cover why and how we focus on optimal scalability and availability of our infrastructure.

Quick Architecture Recap

Coban Platform Architecture

The Coban platform provides lightweight Golang plugin architecture-based data processing pipelines running in Kubernetes. These are essentially Kafka consumer pods that consume data, process it, and then materialize the results into various sinks (RDMS, other Kafka topics).

Anatomy of a Processing Pod

Anatomy of a Processing Pod

Each stream processing pod (the smallest unit of a pipeline’s deployment) has three top level components:

  • Trigger: An interface that connects directly to the source of the data and converts it into an event channel.
  • Runtime: This is the app’s entry point and the orchestrator of the pod. It manages the worker pools, triggers, event channels, and lifecycle events.
  • Pipeline plugin: This is provided by the user, and conforms to a contract that the platform team publishes. It contains the domain logic for the pipeline and houses the pipeline orchestration defined by a user based on our Stream Processing Framework.

Optimal Scaling

We initially architected our Kubernetes setup around horizontal pod autoscaling (HPA), which scales the number of pods per deployment based on CPU and memory usage. HPA keeps CPU and memory per pod specified in the deployment manifest and scales horizontally as the load changes.

These were the areas of application wastage we observed on our platform:

  • As Grab’s traffic is uneven, we’d always have to provision for peak traffic. As users would not (or could not) always account for ramps, they would be fairly liberal with setting limit values (CPU and memory), leading to resource wastage.
  • Pods often had uneven traffic distribution despite fairly even partition load distribution in Kafka. The Stream Processing Framework(SPF) is essentially Kafka consumers consuming from Kafka topics, hence the number of pods scaling in and out resulted in unequal partition load per pod.

Vertically Scaling with Fixed Number of Pods

We initially kept the number of pods for a pipeline equal to the number of partitions in the topic the pipeline consumes from. This ensured even distribution of partitions to each pod providing balanced consumption. In order to abstract this from the end user, we automated the application deployment process to directly call the Kafka API to fetch the number of partitions during runtime.

After achieving a fixed number of pods for the pipeline, we wanted to move away from HPA. We wanted our pods to scale up and down as the load increases or decreases without any manual intervention. Vertical pod autoscaling (VPA) solves this problem as it relieves us from any manual operation for setting up resources for our deployment.

We just deploy the application and let VPA handle the resources required for its operation. It’s known to not be very susceptible to quick load changes as it trains its model to monitor the deployment’s load trend over a period of time before recommending an optimal resource. This process ensures the optimal resource allocation for our pipelines considering the historic trends on throughput.

We saw a ~45% reduction in our total resource usage vs resource requested after moving to VPA with a fixed number of pods from HPA.

Anatomy of a Processing Pod

Managing Availability

We broadly classify our workloads as latency sensitive (critical) and latency tolerant (non-critical). As a result, we could optimize scheduling and cost efficiency using priority classes and overprovisioning on heterogeneous node types on AWS.

Kubernetes Priority Classes

The main cost of running EKS in AWS is attributed to the EC2 machines that form the worker nodes for the Kubernetes cluster. Running On-Demand brings all the guarantees of instance availability but it is definitely very expensive. Hence, our first action to drive cost optimisation was to include Spot instances in our worker node group.

With the uncertainty of losing a spot instance, we started assigning priority to our various applications. We then let the user choose the priority of their pipeline depending on their use case. Different priorities would result in different node affinity to different kinds of instance groups (On-Demand/Spot). For example, Critical pipelines (latency sensitive) run on On-Demand worker node groups and Non-critical pipelines (latency tolerant) on Spot instance worker node groups.

We use priority class as a method of preemption, as well as a node affinity that chooses a certain priority pipeline for the node group to deploy to.


With spot instances running we realised a need to make our cluster quickly respond to failures. We wanted to achieve quick rescheduling of evicted pods, hence we added overprovisioning to our cluster. This means we keep some noop pods occupying free space running in our worker node groups for the quick scheduling of evicted or deploying pods.

The overprovisioned pods are the lowest priority pods, thus can be preempted by any pod waiting in the queue for scheduling. We used cluster proportional autoscaler to decide the right number of these overprovisioned pods, which scales up and down proportionally to cluster size (i.e number of nodes and CPU in worker node group). This relieves us from tuning the number of these noop pods as the cluster scales up or down over the period keeping the free space proportional to current cluster capacity.

Lastly, overprovisioning also helped improve the deployment time because there is no  dependency on the time required for Auto Scaling Groups (ASG) to add a new node to the cluster every time we want to deploy a new application.

Future Improvements

Evolution is an ongoing process. In the next few months, we plan to work on custom resources for combining VPA and fixed deployment size. Our current architecture setup works fine for now, but we would like to create a more tuneable in-house CRD(Custom Resource Definition) for VPA that incorporates rightsizing our Kubernetes deployment horizontally.

Authored By Shubham Badkur on behalf of the Coban team at Grab – Ryan Ooi, Karan Kamath, Hui Yang, Yuguang Xiao, Jump Char, Jason Cusick, Shrinand Thakkar, Dean Barlan, Shivam Dixit, Andy Nguyen, and Ravi Tandon.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Testing cloud apps with GitHub Actions and cloud-native open source tools

Post Syndicated from Sarah Khalife original https://github.blog/2020-10-09-devops-cloud-testing/

See this post in action during GitHub Demo Days on October 16.

What makes a project successful? For developers building cloud-native applications, successful projects thrive on transparent, consistent, and rigorous collaboration. That collaboration is one of the reasons that many open source projects, like Docker containers and Kubernetes, grow to become standards for how we build, deliver, and operate software. Our Open Source Guides and Introduction to innersourcing are great first steps to setting up and encouraging these best practices in your own projects.

However, a common challenge that application developers face is manually testing against inconsistent environments. Accurately testing Kubernetes applications can differ from one developer’s environment to another, and implementing a rigorous and consistent environment for end-to-end testing isn’t easy. It can also be very time consuming to spin up and down Kubernetes clusters. The inconsistencies between environments and the time required to spin up new Kubernetes clusters can negatively impact the speed and quality of cloud-native applications.

Building a transparent CI process

On GitHub, integration and testing becomes a little easier by combining GitHub Actions with open source tools. You can treat Actions as the native continuous integration and continuous delivery (CI/CD) tool for your project, and customize your Actions workflow to include automation and validation as next steps.

Since Actions can be triggered based on nearly any GitHub event, it’s also possible to build in accountability for updating tests and fixing bugs. For example, when a developer creates a pull request, Actions status checks can automatically block the merge if the test fails.

Here are a few more examples:

Branch protection rules in the repository help enforce certain workflows, such as requiring more than one pull request review or requiring certain status checks to pass before allowing a pull request to merge.

GitHub Actions are natively configured to act as status checks when they’re set up to trigger `on: [pull_request]`.

Continuous integration (CI) is extremely valuable as it allows you to run tests before each pull request is merged into production code. In turn, this will reduce the number of bugs that are pushed into production and increases confidence that newly introduced changes will not break existing functionality.

But transparency remains key: Requiring CI status checks on protected branches provides a clearly-defined, transparent way to let code reviewers know if the commits meet the conditions set for the repository—right in the pull request view.

Using community-powered workflows

Now that we’ve thought through the simple CI policies, automated workflows are next. Think of an Actions workflow as a set of “plug and play” open sourced, automated steps contributed by the community. You can use them as they are, or customize and make them your own. Once you’ve found the right one, open sourced Actions can be plugged into your workflow with the`- uses: repo/action-name` field.

You might ask, “So how do I find available Actions that suit my needs?”

The GitHub Marketplace!

As you’re building automation and CI pipelines, take advantage of Marketplace to find pre-built Actions provided by the community. Examples of pre-built Actions span from a Docker publish and the kubectl CLI installation to container scans and cloud deployments. When it comes to cloud-native Actions, the list keeps growing as container-based development continues to expand.

Testing with kind

Testing is a critical part of any CI/CD pipeline, but running tests in Kubernetes can absorb the extra time that automation saves. Enter kind. kind stands for “Kubernetes in Docker.” It’s an open source project from the Kubernetes special interest group (SIGs) community, and a tool for running local Kubernetes clusters using Docker container “nodes.” Creating a kind cluster is a simple way to run Kubernetes cluster and application testing—without having to spin up a complete Kubernetes environment.

As the number of Kubernetes users pushing critical applications to production grows, so does the need for a repeatable, reliable, and rigorous testing process. This can be accomplished by combining the creation of a homogenous Kubernetes testing environment with kind, the community-powered Marketplace, and the native and transparent Actions CI process.

Bringing it all together with kind and Actions

Come see kind and Actions at work during our next GitHub Demo Day live stream on October 16, 2020 at 11am PT. I’ll walk you through how to easily set up automated and consistent tests per pull request, including how to use kind with Actions to automatically run end-to-end tests across a common Kubernetes environment.

Secondary DNS – Deep Dive

Post Syndicated from Alex Fattouche original https://blog.cloudflare.com/secondary-dns-deep-dive/

How Does Secondary DNS Work?

Secondary DNS - Deep Dive

If you already understand how Secondary DNS works, please feel free to skip this section. It does not provide any Cloudflare-specific information.

Secondary DNS has many use cases across the Internet; however, traditionally, it was used as a synchronized backup for when the primary DNS server was unable to respond to queries. A more modern approach involves focusing on redundancy across many different nameservers, which in many cases broadcast the same anycasted IP address.

Secondary DNS involves the unidirectional transfer of DNS zones from the primary to the Secondary DNS server(s). One primary can have any number of Secondary DNS servers that it must communicate with in order to keep track of any zone updates. A zone update is considered a change in the contents of a  zone, which ultimately leads to a Start of Authority (SOA) serial number increase. The zone’s SOA serial is one of the key elements of Secondary DNS; it is how primary and secondary servers synchronize zones. Below is an example of what an SOA record might look like during a dig query.

example.com	3600	IN	SOA	ashley.ns.cloudflare.com. dns.cloudflare.com. 
2034097105  // Serial
10000 // Refresh
2400 // Retry
604800 // Expire
3600 // Minimum TTL

Each of the numbers is used in the following way:

  1. Serial – Used to keep track of the status of the zone, must be incremented at every change.
  2. Refresh – The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change.
  3. Retry – The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change, after previously failing to contact the primary.
  4. Expire – The maximum number of seconds that a Secondary DNS server can serve stale information, in the event the primary cannot be contacted.
  5. Minimum TTL – Per RFC 2308, the number of seconds that a DNS negative response should be cached for.

Using the above information, the Secondary DNS server stores an SOA record for each of the zones it is tracking. When the serial increases, it knows that the zone must have changed, and that a zone transfer must be initiated.  

Serial Tracking

Serial increases can be detected in the following ways:

  1. The fastest way for the Secondary DNS server to keep track of a serial change is to have the primary server NOTIFY them any time a zone has changed using the DNS protocol as specified in RFC 1996, Secondary DNS servers will instantly be able to initiate a zone transfer.
  2. Another way is for the Secondary DNS server to simply poll the primary every “Refresh” seconds. This isn’t as fast as the NOTIFY approach, but it is a good fallback in case the notifies have failed.

One of the issues with the basic NOTIFY protocol is that anyone on the Internet could potentially notify the Secondary DNS server of a zone update. If an initial SOA query is not performed by the Secondary DNS server before initiating a zone transfer, this is an easy way to perform an amplification attack. There is two common ways to prevent anyone on the Internet from being able to NOTIFY Secondary DNS servers:

  1. Using transaction signatures (TSIG) as per RFC 2845. These are to be placed as the last record in the extra records section of the DNS message. Usually the number of extra records (or ARCOUNT) should be no more than two in this case.
  2. Using IP based access control lists (ACL). This increases security but also prevents flexibility in server location and IP address allocation.

Generally NOTIFY messages are sent over UDP, however TCP can be used in the event the primary server has reason to believe that TCP is necessary (i.e. firewall issues).

Zone Transfers

In addition to serial tracking, it is important to ensure that a standard protocol is used between primary and Secondary DNS server(s), to efficiently transfer the zone. DNS zone transfer protocols do not attempt to solve the confidentiality, authentication and integrity triad (CIA); however, the use of TSIG on top of the basic zone transfer protocols can provide integrity and authentication. As a result of DNS being a public protocol, confidentiality during the zone transfer process is generally not a concern.

Authoritative Zone Transfer (AXFR)

AXFR is the original zone transfer protocol that was specified in RFC 1034 and RFC 1035 and later further explained in RFC 5936. AXFR is done over a TCP connection because a reliable protocol is needed to ensure packets are not lost during the transfer. Using this protocol, the primary DNS server will transfer all of the zone contents to the Secondary DNS server, in one connection, regardless of the serial number. AXFR is recommended to be used for the first zone transfer, when none of the records are propagated, and IXFR is recommended after that.

Incremental Zone Transfer (IXFR)

IXFR is the more sophisticated zone transfer protocol that was specified in RFC 1995. Unlike the AXFR protocol, during an IXFR, the primary server will only send the secondary server the records that have changed since its current version of the zone (based on the serial number). This means that when a Secondary DNS server wants to initiate an IXFR, it sends its current serial number to the primary DNS server. The primary DNS server will then format its response based on previous versions of changes made to the zone. IXFR messages must obey the following pattern:

  1. Current latest SOA
  2. Secondary server current SOA
  3. DNS record deletions
  4. Secondary server current SOA + changes
  5. DNS record additions
  6. Current latest SOA

Steps 2,3,4,5,6 can be repeated any number of times, as each of those represents one change set of deletions and additions, ultimately leading to a new serial.

IXFR can be done over UDP or TCP, but again TCP is generally recommended to avoid packet loss.

How Does Secondary DNS Work at Cloudflare?

The DNS team loves microservice architecture! When we initially implemented Secondary DNS at Cloudflare, it was done using Mesos Marathon. This allowed us to separate each of our services into several different marathon apps, individually scaling apps as needed. All of these services live in our core data centers. The following services were created:

  1. Zone Transferer – responsible for attempting IXFR, followed by AXFR if IXFR fails.
  2. Zone Transfer Scheduler – responsible for periodically checking zone SOA serials for changes.
  3. Rest API – responsible for registering new zones and primary nameservers.

In addition to the marathon apps, we also had an app external to the cluster:

  1. Notify Listener – responsible for listening for notifies from primary servers and telling the Zone Transferer to initiate an AXFR/IXFR.

Each of these microservices communicates with the others through Kafka.

Secondary DNS - Deep Dive
Figure 1: Secondary DNS Microservice Architecture‌‌

Once the zone transferer completes the AXFR/IXFR, it then passes the zone through to our zone builder, and finally gets pushed out to our edge at each of our 200 locations.

Although this current architecture worked great in the beginning, it left us open to many vulnerabilities and scalability issues down the road. As our Secondary DNS product became more popular, it was important that we proactively scaled and reduced the technical debt as much as possible. As with many companies in the industry, Cloudflare has recently migrated all of our core data center services to Kubernetes, moving away from individually managed apps and Marathon clusters.

What this meant for Secondary DNS is that all of our Marathon-based services, as well as our NOTIFY Listener, had to be migrated to Kubernetes. Although this long migration ended up paying off, many difficult challenges arose along the way that required us to come up with unique solutions in order to have a seamless, zero downtime migration.

Challenges When Migrating to Kubernetes

Although the entire DNS team agreed that kubernetes was the way forward for Secondary DNS, it also introduced several challenges. These challenges arose from a need to properly scale up across many distributed locations while also protecting each of our individual data centers. Since our core does not rely on anycast to automatically distribute requests, as we introduce more customers, it opens us up to denial-of-service attacks.

The two main issues we ran into during the migration were:

  1. How do we create a distributed and reliable system that makes use of kubernetes principles while also making sure our customers know which IPs we will be communicating from?
  2. When opening up a public-facing UDP socket to the Internet, how do we protect ourselves while also preventing unnecessary spam towards primary nameservers?.

Issue 1:

As was previously mentioned, one form of protection in the Secondary DNS protocol is to only allow certain IPs to initiate zone transfers. There is a fine line between primary servers allow listing too many IPs and them having to frequently update their IP ACLs. We considered several solutions:

  1. Open source k8s controllers
  2. Altering Network Address Translation(NAT) entries
  3. Do not use k8s for zone transfers
  4. Allowlist all Cloudflare IPs and dynamically update
  5. Proxy egress traffic

Ultimately we decided to proxy our egress traffic from k8s, to the DNS primary servers, using static proxy addresses. Shadowsocks-libev was chosen as the SOCKS5 implementation because it is fast, secure and known to scale. In addition, it can handle both UDP/TCP and IPv4/IPv6.

Secondary DNS - Deep Dive
Figure 2: Shadowsocks proxy Setup

The partnership of k8s and Shadowsocks combined with a large enough IP range brings many benefits:

  1. Horizontal scaling
  2. Efficient load balancing
  3. Primary server ACLs only need to be updated once
  4. It allows us to make use of kubernetes for both the Zone Transferer and the Local ShadowSocks Proxy.
  5. Shadowsocks proxy can be reused by many different Cloudflare services.

Issue 2:

The Notify Listener requires listening on static IPs for NOTIFY Messages coming from primary DNS servers. This is mostly a solved problem through the use of k8s services of type loadbalancer, however exposing this service directly to the Internet makes us uneasy because of its susceptibility to attacks. Fortunately DDoS protection is one of Cloudflare’s strengths, which lead us to the likely solution of dogfooding one of our own products, Spectrum.

Spectrum provides the following features to our service:

  1. Reverse proxy TCP/UDP traffic
  2. Filter out Malicious traffic
  3. Optimal routing from edge to core data centers
  4. Dual Stack technology
Secondary DNS - Deep Dive
Figure 3: Spectrum interaction with Notify Listener

Figure 3 shows two interesting attributes of the system:

  1. Spectrum <-> k8s IPv4 only:
  2. This is because our custom k8s load balancer currently only supports IPv4; however, Spectrum has no issue terminating the IPv6 connection and establishing a new IPv4 connection.
  3. Spectrum <-> k8s routing decisions based of L4 protocol:
  4. This is because k8s only supports one of TCP/UDP/SCTP per service of type load balancer. Once again, spectrum has no issues proxying this correctly.

One of the problems with using a L4 proxy in between services is that source IP addresses get changed to the source IP address of the proxy (Spectrum in this case). Not knowing the source IP address means we have no idea who sent the NOTIFY message, opening us up to attack vectors. Fortunately, Spectrum’s proxy protocol feature is capable of adding custom headers to TCP/UDP packets which contain source IP/Port information.

As we are using miekg/dns for our Notify Listener, adding proxy headers to the DNS NOTIFY messages would cause failures in validation at the DNS server level. Alternatively, we were able to implement custom read and write decorators that do the following:

  1. Reader: Extract source address information on inbound NOTIFY messages. Place extracted information into new DNS records located in the additional section of the message.
  2. Writer: Remove additional records from the DNS message on outbound NOTIFY replies. Generate a new reply using proxy protocol headers.

There is no way to spoof these records, because the server only permits two extra records, one of which is the optional TSIG. Any other records will be overwritten.

Secondary DNS - Deep Dive
Figure 4: Proxying Records Between Notifier and Spectrum‌‌

This custom decorator approach abstracts the proxying away from the Notify Listener through the use of the DNS protocol.  

Although knowing the source IP will block a significant amount of bad traffic, since NOTIFY messages can use both UDP and TCP, it is prone to IP spoofing. To ensure that the primary servers do not get spammed, we have made the following additions to the Zone Transferer:

  1. Always ensure that the SOA has actually been updated before initiating a zone transfer.
  2. Only allow at most one working transfer and one scheduled transfer per zone.

Additional Technical Challenges

Zone Transferer Scheduling

As shown in figure 1, there are several ways of sending Kafka messages to the Zone Transferer in order to initiate a zone transfer. There is no benefit in having a large backlog of zone transfers for the same zone. Once a zone has been transferred, assuming no more changes, it does not need to be transferred again. This means that we should only have at most one transfer ongoing, and one scheduled transfer at the same time, for any zone.

If we want to limit our number of scheduled messages to one per zone, this involves ignoring Kafka messages that get sent to the Zone Transferer. This is not as simple as ignoring specific messages in any random order. One of the benefits of Kafka is that it holds on to messages until the user actually decides to acknowledge them, by committing that messages offset. Since Kafka is just a queue of messages, it has no concept of order other than first in first out (FIFO). If a user is capable of reading from the Kafka topic concurrently, it is entirely possible that a message in the middle of the queue be committed before a message at the end of the queue.

Most of the time this isn’t an issue, because we know that one of the concurrent readers has read the message from the end of the queue and is processing it. There is one Kubernetes-related catch to this issue, though: pods are ephemeral. The kube master doesn’t care what your concurrent reader is doing, it will kill the pod and it’s up to your application to handle it.

Consider the following problem:

Secondary DNS - Deep Dive
Figure 5: Kafka Partition‌‌
  1. Read offset 1. Start transferring zone 1.
  2. Read offset 2. Start transferring zone 2.
  3. Zone 2 transfer finishes. Commit offset 2, essentially also marking offset 1.
  4. Restart pod.
  5. Read offset 3 Start transferring zone 3.

If these events happen, zone 1 will never be transferred. It is important that zones stay up to date with the primary servers, otherwise stale data will be served from the Secondary DNS server. The solution to this problem involves the use of a list to track which messages have been read and completely processed. In this case, when a zone transfer has finished, it does not necessarily mean that the kafka message should be immediately committed. The solution is as follows:

  1. Keep a list of Kafka messages, sorted based on offset.
  2. If finished transfer, remove from list:
  3. If the message is the oldest in the list, commit the messages offset.
Secondary DNS - Deep Dive
Figure 6: Kafka Algorithm to Solve Message Loss

This solution is essentially soft committing Kafka messages, until we can confidently say that all other messages have been acknowledged. It’s important to note that this only truly works in a distributed manner if the Kafka messages are keyed by zone id, this will ensure the same zone will always be processed by the same Kafka consumer.

Life of a Secondary DNS Request

Although Cloudflare has a large global network, as shown above, the zone transferring process does not take place at each of the edge datacenter locations (which would surely overwhelm many primary servers), but rather in our core data centers. In this case, how do we propagate to our edge in seconds? After transferring the zone, there are a couple more steps that need to be taken before the change can be seen at the edge.

  1. Zone Builder – This interacts with the Zone Transferer to build the zone according to what Cloudflare edge understands. This then writes to Quicksilver, our super fast, distributed KV store.
  2. Authoritative Server – This reads from Quicksilver and serves the built zone.
Secondary DNS - Deep Dive
Figure 7: End to End Secondary DNS‌‌

What About Performance?

At the time of writing this post, according to dnsperf.com, Cloudflare leads in global performance for both Authoritative and Resolver DNS. Here, Secondary DNS falls under the authoritative DNS category here. Let’s break down the performance of each of the different parts of the Secondary DNS pipeline, from the primary server updating its records, to them being present at the Cloudflare edge.

  1. Primary Server to Notify Listener – Our most accurate measurement is only precise to the second, but we know UDP/TCP communication is likely much faster than that.
  2. NOTIFY to Zone Transferer – This is negligible
  3. Zone Transferer to Primary Server – 99% of the time we see ~800ms as the average latency for a zone transfer.
Secondary DNS - Deep Dive
Figure 8: Zone XFR latency

4. Zone Transferer to Zone Builder – 99% of the time we see ~10ms to build a zone.

Secondary DNS - Deep Dive
Figure 9: Zone Build time

5. Zone Builder to Quicksilver edge: 95% of the time we see less than 1s propagation.

Secondary DNS - Deep Dive
Figure 10: Quicksilver propagation time

End to End latency: less than 5 seconds on average. Although we have several external probes running around the world to test propagation latencies, they lack precision due to their sleep intervals, location, provider and number of zones that need to run. The actual propagation latency is likely much lower than what is shown in figure 10. Each of the different colored dots is a separate data center location around the world.

Secondary DNS - Deep Dive
Figure 11: End to End Latency

An additional test was performed manually to get a real world estimate, the test had the following attributes:

Primary server: NS1
Number of records changed: 1
Start test timer event: Change record on NS1
Stop test timer event: Observe record change at Cloudflare edge using dig
Recorded timer value: 6 seconds


Cloudflare serves 15.8 trillion DNS queries per month, operating within 100ms of 99% of the Internet-connected population. The goal of Cloudflare operated Secondary DNS is to allow our customers with custom DNS solutions, be it on-premise or some other DNS provider, to be able to take advantage of Cloudflare’s DNS performance and more recently, through Secondary Override, our proxying and security capabilities too. Secondary DNS is currently available on the Enterprise plan, if you’d like to take advantage of it, please let your account team know. For additional documentation on Secondary DNS, please refer to our support article.

TensorFlow Serving on Kubernetes with Amazon EC2 Spot Instances

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/tensorflow-serving-on-kubernetes-spot-instances/

This post is contributed by Kinnar Sen – Sr. Specialist Solutions Architect, EC2 Spot

TensorFlow (TF) is a popular choice for machine learning research and application development. It’s a machine learning (ML) platform, which is used to build (train) and deploy (serve) machine learning models. TF Serving is a part of TF framework and is used for deploying ML models in production environments. TF Serving can be containerized using Docker and deployed in a cluster with Kubernetes. It is easy to run production grade workloads on Kubernetes using Amazon Elastic Kubernetes Service (Amazon EKS), a managed service for creating and managing Kubernetes clusters. To cost optimize the TF serving workloads, you can use Amazon EC2 Spot Instances. Spot Instances are spare EC2 capacity available at up to a 90% discount compared to On-Demand Instance prices.

In this post I will illustrate deployment of TensorFlow Serving using Kubernetes via Amazon EKS and Spot Instances to build a scalable, resilient, and cost optimized machine learning inference service.


About TensorFlow Serving (TF Serving)

TensorFlow Serving is the recommended way to serve TensorFlow models. A flexible and a high-performance system for serving models TF Serving enables users to quickly deploy models to production environments. It provides out-of-box integration with TF models and can be extended to serve other kinds of models and data. TF Serving deploys a model server with gRPC/REST endpoints and can be used to serve multiple models (or versions). There are two ways that the requests can be served, batching individual requests or one-by-one. Batching is often used to unlock the high throughput of hardware accelerators (if used for inference) for offline high volume inference jobs.

Amazon EC2 Spot Instances

Spot Instances are spare Amazon EC2 capacity that enables customers to save up to 90% over On-Demand Instance prices. The price of Spot Instances is determined by long-term trends in supply and demand of spare capacity pools. Capacity pools can be defined as a group of EC2 instances belonging to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

TensorFlow Serving (TF Serving) and Kubernetes

Each pod in a Kubernetes cluster runs a TF Docker image with TF Serving-based server and a model. The model contains the architecture of TensorFlow Graph, model weights and assets. This is a deployment setup with configurable number of replicas. The replicas are exposed externally by a service and an External Load Balancer that helps distribute the requests to the service endpoints. To keep up with the demands of service, Kubernetes can help scale the number of replicated pods using Kubernetes Replication Controller.


There are a couple of goals that we want to achieve through this solution.

  • Cost optimization – By using EC2 Spot Instances
  • High throughput – By using Application Load Balancer (ALB) created by Ingress Controller
  • Resilience – Ensuring high availability by replenishing nodes and gracefully handling the Spot interruptions
  • Elasticity – By using Horizontal Pod Autoscaler, Cluster Autoscaler, and EC2 Auto Scaling

This can be achieved by using the following components.

Component Role Details Deployment Method
Cluster Autoscaler Scales EC2 instances automatically according to pods running in the cluster Open source A deployment on On-Demand Instances
EC2 Auto Scaling group Provisions and maintains EC2 instance capacity AWS AWS CloudFormation via eksctl
AWS Node Termination Handler Detects EC2 Spot interruptions and automatically drains nodes Open source A DaemonSet on Spot and On-Demand Instances
AWS ALB Ingress Controller Provisions and maintains Application Load Balancer Open source A deployment on On-Demand Instances

You can find more details about each component in this AWS blog. Let’s go through the steps that allow the deployment to be elastic.

  1. HTTP requests flows in through the ALB and Ingress object.
  2. Horizontal Pod Autoscaler (HPA) monitors the metrics (CPU / RAM) and once the threshold is breached a Replica (pod) is launched.
  3. If there are sufficient cluster resources, the pod starts running, else it goes into pending state.
  4. If one or more pods are in pending state, the Cluster Autoscaler (CA) triggers a scale up request to Auto Scaling group.
    1. If HPA tries to schedule pods more than the current size of what the cluster can support, CA can add capacity to support that.
  5. Auto Scaling group provision a new node and the application scales up
  6. A scale down happens in the reverse fashion when requests start tapering down.

AWS ALB Ingress controller and ALB

We will be using an ALB along with an Ingress resource instead of the default External Load Balancer created by the TF Serving deployment. The open source AWS ALB Ingress controller triggers the creation of an ALB and the necessary supporting AWS resources whenever a Kubernetes user declares an Ingress resource in the cluster. The Ingress resource uses the ALB to route HTTP(S) traffic to different endpoints within the cluster. ALB is ideal for advanced load balancing of HTTP and HTTPS traffic. ALB provides advanced request routing targeted at delivery of modern application architectures, including microservices and container-based applications. This allows the deployment to maintain a high throughput and improve load balancing.

Spot Instance interruptions

To gracefully handle interruptions, we will use the AWS node termination handler. This handler runs a DaemonSet (one pod per node) on each host to perform monitoring and react accordingly. When it receives the Spot Instance 2-minute interruption notification, it uses the Kubernetes API to cordon the node. This is done by tainting it to ensure that no new pods are scheduled there, then it drains it, removing any existing pods from the ALB.

One of the best practices for using Spot is diversification where instances are chosen from across different instance types, sizes, and Availability Zone. The capacity-optimized allocation strategy for EC2 Auto Scaling provisions Spot Instances from the most-available Spot Instance pools by analyzing capacity metrics, thus lowering the chance of interruptions.


Set up the cluster

We are using eksctl to create an Amazon EKS cluster with the name k8-tf-serving in combination with a managed node group. The managed node group has two On-Demand t3.medium nodes and it will bootstrap with the labels lifecycle=OnDemand and intent=control-apps. Be sure to replace <YOUR REGION> with the Region you are launching your cluster into.

eksctl create cluster --name=TensorFlowServingCluster --node-private-networking --managed --nodes=3 --alb-ingress-access --region=<YOUR REGION> --node-type t3.medium --node-labels="lifecycle=OnDemand,intent=control-apps" --asg-access

Check the nodes provisioned by using kubectl get nodes.

Create the NodeGroups now. You create the eksctl configuration file first. Copy the nodegroup configuration below and create a file named spot_nodegroups.yml. Then run the command using eksctl below to add the new Spot nodes to the cluster.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
    name: TensorFlowServingCluster
    region: <YOUR REGION>
    - name: prod-4vcpu-16gb-spot
      minSize: 0
      maxSize: 15
      desiredCapacity: 10
        instanceTypes: ["m5.xlarge", "m5d.xlarge", "m4.xlarge","t3.xlarge","t3a.xlarge","m5a.xlarge","t2.xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
          autoScaler: true
          albIngress: true
    - name: prod-8vcpu-32gb-spot
      minSize: 0
      maxSize: 15
      desiredCapacity: 10
        instanceTypes: ["m5.2xlarge", "m5n.2xlarge", "m5d.2xlarge", "m5dn.2xlarge","m5a.2xlarge", "m4.2xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
          autoScaler: true
          albIngress: true
eksctl create nodegroup -f spot_nodegroups.yml

A few points to note here, for more technical details refer to the EC2 Spot workshop.

  • There are two diversified node groups created with a fixed vCPU:Memory ratio. This adheres to the Spot best practice of diversifying instances, and helps the Cluster Autoscaler function properly.
  • Capacity-optimized Spot allocation strategy is used in both the node groups.

Once the nodes are created, you can check the number of instances provisioned using the command below. It should display 20 as we configured each of our two node groups with a desired capacity of 10 instances.

kubectl get nodes --selector=lifecycle=Ec2Spot | expr $(wc -l) - 1

The cluster setup is complete.

Install the AWS Node Termination Handler

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.3.1/all-resources.yaml

This installs the Node Termination Handler to both Spot Instance and On-Demand Instance nodes. This helps the handler responds to both EC2 maintenance events and Spot Instance interruptions.

Deploy Cluster Autoscaler

For additional detail, see the Amazon EKS page here. Next, export the Cluster Autoscaler into a configuration file:

curl -o cluster_autoscaler.yml https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/content/using_ec2_spot_instances_with_eks/scaling/deploy_ca.files/cluster_autoscaler.yml

Open the file created and edit.

Add AWS Region and the cluster name as depicted in the screenshot below.

Run the commands below to deploy Cluster Autoscaler.

<div class="hide-language"><pre class="unlimited-height-code"><code class="lang-yaml">kubectl apply -f cluster_autoscaler.yml</code></pre></div><div class="hide-language"><pre class="unlimited-height-code"><code class="lang-yaml">kubectl -n kube-system annotate deployment.apps/cluster-autoscaler cluster-autoscaler.kubernetes.io/safe-to-evict="false"</code></pre></div>

Use this command to see into the Cluster Autoscaler (CA) logs to find NodeGroups auto-discovered. Use Ctrl + C to abort the log view.

kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=10

Deploy TensorFlow Serving

TensorFlow Model Server is deployed in pods and the model will load from the model stored in Amazon S3.

Amazon S3 access

We are using Kubernetes Secrets to store and manage the AWS Credentials for S3 Access.

Copy the following and create a file called kustomization.yml. Add the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY details in the file.

namespace: default
- name: s3-credentials
  disableNameSuffixHash: true

Create the secret file and deploy.

kubectl kustomize . > secret.yaml
kubectl apply -f secret.yaml

We recommend to use Sealed Secret for production workloads, Sealed Secret provides a mechanism to encrypt a Secret object thus making it more secure. For further details please take a look at the AWS workshop here.

ALB Ingress Controller

Deploy RBAC Roles and RoleBindings needed by the AWS ALB Ingress controller.

kubectl apply -f


Download the AWS ALB Ingress controller YAML into a local file.

curl -sS "https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/v1.1.4/docs/examples/alb-ingress-controller.yaml" &gt; alb-ingress-controller.yaml

Change the –cluster-name flag to ‘TensorFlowServingCluster’ and add the Region details under – –aws-region. Also add the lines below just before the ‘serviceAccountName’.

    lifecycle: OnDemand

Deploy the AWS ALB Ingress controller and verify that it is running.

kubectl apply -f alb-ingress-controller.yaml
kubectl logs -n kube-system $(kubectl get po -n kube-system | grep alb-ingress | awk '{print $1}')

Deploy the application

Next, download a model as explained in the TF official documentation, then upload in Amazon S3.

mkdir /tmp/resnet

curl -s http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | \
tar --strip-components=2 -C /tmp/resnet -xvz

RANDOM_SUFFIX=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 10 | head -n 1)

aws s3 mb s3://${S3_BUCKET}
aws s3 sync /tmp/resnet/1538687457/ s3://${S3_BUCKET}/resnet/1/

Copy the following code and create a file named tf_deployment.yml. Don’t forget to replace <AWS_REGION> with the AWS Region you plan to use.

A few things to note here:

  • NodeSelector is used to route the TF Serving replica pods to Spot Instance nodes.
  • ServiceType LoadBalancer is used.
  • model_base_path is pointed at Amazon S3. Replace the <S3_BUCKET> with the S3_BUCKET name you created in last instruction set.
apiVersion: v1
kind: Service
    app: resnet-service
  name: resnet-service
  - name: grpc
    port: 9000
    targetPort: 9000
  - name: http
    port: 8500
    targetPort: 8500
    app: resnet-service
  type: LoadBalancer
apiVersion: apps/v1
kind: Deployment
    app: resnet-service
  name: resnet-v1
  replicas: 25
      app: resnet-service
        app: resnet-service
        version: v1
        lifecycle: Ec2Spot
      - args:
        - --port=9000
        - --rest_api_port=8500
        - --model_name=resnet
        - --model_base_path=s3://<S3_BUCKET>/resnet/
        - /usr/bin/tensorflow_model_server
        - name: AWS_REGION
          value: <AWS_REGION>
        - name: S3_ENDPOINT
          value: s3.<AWS_REGION>.amazonaws.com
   - name: AWS_ACCESS_KEY_ID
              name: s3-credentials
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
              name: s3-credentials
              key: AWS_SECRET_ACCESS_KEY        
image: tensorflow/serving
        imagePullPolicy: IfNotPresent
        name: resnet-service
        - containerPort: 9000
        - containerPort: 8500
            cpu: "4"
            memory: 4Gi
            cpu: "2"
            memory: 2Gi

Deploy the application.

kubectl apply -f tf_deployment.yml

Copy the code below and create a file named ingress.yml.

apiVersion: extensions/v1beta1
kind: Ingress
  name: "resnet-service"
  namespace: "default"
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    app: resnet-service
    - http:
          - path: "/v1/models/resnet:predict"
              serviceName: "resnet-service"
              servicePort: 8500

Deploy the ingress.

kubectl apply -f ingress.yml

Deploy the Metrics Server and Horizontal Pod Autoscaler, which scales up when CPU/Memory exceeds 50% of the allocated container resource.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.3.6/components.yaml
kubectl autoscale deployment resnet-v1 --cpu-percent=50 --min=20 --max=100

Load testing

Download the Python helper file written for testing the deployed application.

curl -o submit_mc_tf_k8s_requests.py https://raw.githubusercontent.com/awslabs/ec2-spot-labs/master/tensorflow-serving-load-testing-sample/python/submit_mc_tf_k8s_requests.py

Get the address of the Ingress using the command below.

kubectl get ingress resnet-service

Install a Python Virtual Env. and install the library requirements.

pip3 install virtualenv
virtualenv venv
source venv/bin/activate
pip3 install tqdm
pip3 install requests

Run the following command to warm up the cluster after replacing the Ingress address. You will be running a Python application for predicting the class of a downloaded image against the ResNet model, which is being served by the TF Serving rest API. You are running multiple parallel processes for that purpose. Here “p” is the number of processes and “r” the number of requests for each process.

python submit_mc_tf_k8s_requests.py -p 100 -r 100 -u 'http://<INGRESS ADDRESS>:80/v1/models/resnet:predict'

You can use the command below to observe the scaling of the cluster.

kubectl get hpa -w

We ran the above again with 10,000 requests per process as to send 1 million requests to the application. The results are below:

The deployment was able to serve ~300 requests per second with an average latency of ~320 ms per requests.


Now that you’ve successfully deployed and ran TensorFlow Serving using Ec2 Spot it’s time to cleanup your environment. Remove the ingress, deployment, ingress-controller.

kubectl delete -f ingress.yml
kubectl delete -f tf_deployment.yml
kubectl delete -f alb-ingress-controller.yaml

Remove the model files from Amazon S3.

aws s3 rb s3://${S3_BUCKET}/ --force 

Delete the node groups and the cluster.

eksctl delete nodegroup -f spot_nodegroups.yml --approve
eksctl delete cluster --name TensorFlowServingCluster


In this blog, we demonstrated how TensorFlow Serving can be deployed onto Spot Instances based on a Kubernetes cluster, achieving both resilience and cost optimization. There are multiple optimizations that can be implemented on TensorFlow Serving that will further optimize the performance. This deployment can be extended and used for serving multiple models with different versions. We hope you consider running TensorFlow Serving using EC2 Spot Instances to cost optimize the solution.

The journey of deploying Apache Airflow at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/the-journey-of-deploying-apache-airflow-at-Grab

At Grab, we use Apache Airflow to schedule and orchestrate the ingestion and transformation of data,  train machine learning models, and the copy data between clouds. There are many engineering teams at Grab that use Airflow, each of which originally had their own Airflow instance.

The proliferation of independently managed Airflow instances resulted in inefficient use of resources, where each team ended up solving the same problems of logging, scaling, monitoring, and more. From this morass came the idea of having a single dedicated team to manage all the Airflow instances for anyone in Grab that wants to use Airflow as a scheduling tool.

We designed and implemented an Apache Airflow-based scheduling and orchestration platform that currently runs close to 20 Airflow instances for different teams at Grab. Sounds interesting? What follows is a brief history.

Early days

Circa 2018, we were running a few hundred Directed Acyclic Graphs (DAGs) on one Airflow instance in the Data Engineering team. There was no dedicated team to maintain it, and no Airflow expert in our team. We were struggling to maintain our Airflow instance, which was causing many jobs to fail every day. We were facing issues with library management, scaling, managing and syncing artefacts across all Airflow components, upgrading Airflow versions, deployment, rollbacks, etc.

After a few postmortem reports, we realized that we needed a dedicated team to maintain our Airflow. This was how our Airflow team was born.

In the initial months, we dedicated ourselves to stabilizing our Airflow environment. During this process, we realized that Airflow has a steep learning curve and requires time and effort to understand and maintain properly. Also, we found that tweaking of Airflow configurations required a thorough understanding of Airflow internals.

We felt that for the benefit of everyone at Grab, we should leverage what we learned about Airflow to help other teams at Grab; there was no need for anyone else to go through the same struggles we did. That’s when we started thinking about managing Airflow for other teams.

We talked to the Data Science and Engineering teams who were also running Airflow to schedule their jobs. Almost all the teams were struggling to maintain their Airflow instance. A few teams didn’t have enough technical expertise to maintain their instance. The Data Scientists and Analysts that we spoke to were more than happy to outsource the overhead of Airflow maintenance and wanted to focus more on their Data Science use cases instead.

We started working with one of the Data Science teams and initiated the discussion to create a dockerized Airflow instance and run it on our Kubernetes cluster.

We created the Airflow instance and maintained it for them. Later, we were approached by two more teams to help with their Airflow instances. This was the trigger for us to design and create a platform on which we can efficiently manage Airflow instances for different teams.

Current state

As mentioned, we are currently serving close to 20 Airflow instances for various teams on this platform and leverage Apache Airflow to schedule thousands of daily jobs.  Each Airflow instance is currently scheduling 1k to 60k daily jobs. Also, new teams can quickly try out Airflow without worrying about infrastructure and maintenance overhead. Let’s go through the important aspects of this platform such as design considerations, architecture, deployment, scalability, dependency management, monitoring and alerting, and more.

Design considerations

The initial step we took towards building our scheduling platform was to define a set of expectations and guidelines around ownership, infrastructure, authentication, common artifacts and CI/CD, to name a few.

These were the considerations we had in mind:

  • Deploy containerized Airflow instances on Kubernetes cluster to isolate Airflow instances at the team level. It should scale up and scale out according to usage.
  • Each team can have different sets of jobs that require specific dependencies on the Airflow server.
  • Provide common CI/CD templates to build, test, and deploy Airflow instances. These CI/CD templates should be flexible enough to be extended by users and modified according to their use case.
  • Common plugins, operators, hooks, sensors will be shipped to all Airflow instances. Moreover, each team can have its own plugins, operators, hooks, and sensors.
  • Support LDAP based authentication as it is natively supported by Apache Airflow. Each team can authenticate Airflow UI by their LDAP credentials.
  • Use the Hashicorp Vault to store Airflow specific secrets. Inject these secrets via sidecar in Airflow servers.
  • Use ELK stack to access all application logs and infrastructure logs.
  • Datadog and PagerDuty will be used for monitoring and alerting.
  • Ingest job statistics such as total number of jobs scheduled, no of failed jobs, no of successful jobs, active DAGs, etc. into the data lake and will be accessible via Presto.
Architecture diagram
Architecture diagram

Infrastructure management

Initially, we started deploying Airflow instances on  Kubernetes clusters managed via Kubernetes Operations (KOPS). Later, we migrated to Amazon EKS to reduce the overhead of managing the Kubernetes control plane. Each Kubernetes namespace deploys one Airflow instance.

We chose Terraform to manage infrastructure as code. We deployed each Airflow instance using Terraform modules, which include a helm_release Terraform resource on top of our customized Airflow Helm Chart.

Each Airflow instance connects to its own Redis and RDS. RDS is responsible for storing Airflow metadata and Redis is acting as a celery broker between Airflow scheduler and Airflow workers.

The Hashicorp Vault is used to store secrets required by Airflow instances and injected via sidecar by each Airflow component. The ELK stack stores all logs related to Airflow instances and is used for troubleshooting any instance. Datadog, Slack, and PagerDuty are used to send alerts.

Presto is used to access job statistics, such as numbers on scheduled jobs, failed jobs, successful jobs, and active DAGs, to help each team to analyze their usage and stability of their jobs.

Doing things at scale

There are two kinds of scaling we need to talk about:

  • scaling of Airflow instances on a resource level handling different loads
  • scaling in terms of teams served on the platform

To scale Airflow instances, we set the request and the limit of each Airflow component allowing any of the components to scale up easily. To scale out Airflow workers, we decided to enable the horizontal pod autoscaler (HPA) using Memory and CPU parameters. The cluster autoscaler on EKS helps in scaling the platform to accommodate more teams.

Moreover, we categorized all our Airflow instances in three sizes (small, medium, and large) to efficiently use the resources. This was based on how many hourly/daily jobs it scheduled. Each Airflow instance type has a specific RDS instance type and storage, Redis instance type and CPU and memory, request/limit for scheduler, worker, web server, and flower. There are different Airflow configurations for each instance type to optimize the given resources to the Airflow instance.

Airflow image and version management

The Airflow team builds and releases one common base Docker image for each Airflow version. The base image has Airflow installed with specific versions, as well as common Python packages, plugins, helpers, tests, patches, and so on.

Each team has their customized Docker image on top of the base image. In their customized Docker image, they can update the Python packages and can download other artifacts that they require. Each Airflow instance will be deployed using the team’s customized image.

There are common CI/CD templates provided by the Airflow team to build the customized image, run unit tests, and deploy Airflow instances from their GitLab pipeline.

To upgrade the Airflow version, the Airflow team reviews and studies the changelog of the released Airflow version, note down the important features and its impacts, open issues, bugs, and workable solutions. Later, we build and release the base Docker image using the new Airflow version.

We support only one Airflow version for all Airflow instances to have less maintenance overhead. In the case of minor or major versions, we support one old and new versions until the retirement period.

How do we deploy

There is a deployment ownership guideline that explains the schedule of deployments and the corresponding PICs. All teams have agreed on this guideline and share the responsibility with the Airflow Team.

There are two kinds of deployment:

  • DAG deployment: This is part of the common GitLab CI/CD template. The Airflow team doesn’t trigger the DAG deployment, it’s fully owned by the teams.
  • Airflow instance deployment: The Airflow instance deployment is required in these scenarios:
    1. update in base Docker image
    2. add/update in Python packages by any team
    3. customization in the base image by any team
    4. change in Airflow configurations
    5. change in the resource of scheduler, worker, web server or flower

Base Docker image update

The Airflow team maintains the base Docker image on the AWS Elastic Container Registry. The GitLab CI/CD builds the updated base image whenever the Airflow team changes the base image. The base image is validated by automated deployment on the test environment and automated smoke test. The Airflow instance owner of each team needs to trigger their build and deployment pipeline to apply the base image changes on their Airflow instance.

Python package additions or updates

Each team can add or update their Python dependencies. The Gitlab CI/CD pipeline builds a new image with updated changes. The Airflow instance owner manually triggers the deployment from their CI/CD pipeline. There is a flag to make it automated deployment as well.

Based image customization

Each team can add any customizations on the base image. Similar to the above scenario, the Gitlab CI/CD pipeline builds a new image with updated changes. The Airflow instance owner manually triggers the deployment from their CI/CD pipeline. To automate the deployment, a flag is made available.

Configuration Airflow and Airflow component resource changes

To optimize the Airflow instances, the Airflow Team makes changes to the Airflow configurations and resources of any of the Airflow components. The Airflow configurations and resources are also part of the Terraform code. Atlantis (https://www.runatlantis.io/) deploys the Airflow instances with Terraform changes.

There is no downtime in any form of deployment and doesn’t impact the running tasks and the Airflow UI.


During the process of making our first Airflow stable, we started exploring testing in Airflow. We wanted to validate the correctness of DAGs, duplicate DAG IDs, checking typos and cyclicity in DAGs, etc. We then later wrote the tests by ourselves and published a detailed blog in several channels: usejournal (part1) and medium (part2).

These tests are available in the base image and run in the GitLab pipeline from the user’s repository to validate their DAGs. The unit tests run using the common GitLab CI/CD template provided by the Airflow team.

Monitoring & alerting

Our scheduling platform runs the Airflow instance for many critical jobs scheduled by each team. It’s important for us to monitor all Airflow instances and alert respective stakeholders in case of any failure.

We use a Datadog for monitoring and alerting. To create a common Datadog dashboard, it is required to pass tags with metrics from Airflow and till Airflow 1.10.x, it doesn’t support tagging to Datadog metrics.

We have contributed to the community to enable Datadog support and it will be released in Airflow 2.0.0 (https://github.com/apache/Airflow/pull/7376). We internally patched this pull request and created the common Datadog dashboard.

There are three categories of metrics that we are interested in:

  • EKS cluster metrics: It includes total In-Service Nodes, allocated CPU cores, allocated Memory, Node status, CPU/Memory request vs limit, Node disk and Memory pressure, Rx-Tx packets dropped/errors, etc.
  • Host Metrics: These metrics are for each host participating in the EKS cluster. It includes Host CPU/Memory utilization, Host free memory, System disk, and EBS IOPS, etc.
  • Airflow instance metrics: These metrics are for each Airflow instance. It includes scheduler heartbeats, DagBag size, DAG processing import errors, DAG processing time, open/used slots in a pool, each pod’s Memory/CPU usage, CPU and Memory utilization of metadata DB, database connections as well as the number of workers, active/paused DAGs, successful/failed/queued/running tasks, etc.
Sample Datadog dashboard
Sample Datadog dashboard

We alert respective stakeholders and oncalls using Slack and PagerDuty.


These are the benefits of having our own Scheduling Platform:

  • Scaling: HPA on Airflow workers running on EKS with autoscaler helps Airflow workers to scale automatically to theoretically infinite scale. This enables teams to run thousands of DAGs.
  • Logging: Centralized logging using Kibana.
  • Better Isolation: Separate Docker images for each team provide better isolation.
  • Better Customization: All teams are provided with a mechanism to customize their Airflow worker environment according to their requirements.
  • Zero Downtime: Rolling upgrade and termination period on Airflow workers helps in zero downtime during the deployment.
  • Efficient usage of infrastructure: Each team doesn’t need to allocate infrastructure for Airflow instances. All Airflow instances are deployed on one shared EKS cluster.
  • Less maintenance overhead for users:  Users can focus on their core work and don’t need to spend time maintaining Airflow instances and it’s resources.
  • Common plugins and helpers: All common plugins and helpers available to use on Airflow instances. Each team doesn’t need to add.


Designing and implementing our own scheduling platform started with many challenges and unknowns. We were not sure about the scale we were aiming for, the heterogeneous workload from each team, or the level of triviality or complexity we were going to be faced. After two years, we have successfully built and productionized a scalable scheduling platform that helps teams at Grab to schedule their workload.

We have many failure stories, odd things we ran into, hacks and workarounds we patched. But, we went through it and provided a cost-effective and scalable scheduling platform with low maintenance overhead to all teams at Grab.

What’s Next

Moving ahead, we will be exploring to add the following capabilities:

  • REST APIs to enable teams to access their Airflow instance programmatically and have better integration with other tools and frameworks.
  • Support of dynamic DAGs at scale to help in decreasing the DAG maintenance overhead.
  • Template-based engine to act as a middle layer between the scheduling platform and external systems. It will have a set of templates to generate DAGs which helps in better integration with the external system.

We suggest anyone who is running multiple Airflow instances within different teams to look at this approach and build the centralized scheduling platform. Before you begin,  review the feasibility of building the centralized platform as it requires a vision, a lot of effort, and cross-communication with many teams.

Authored by Chandulal Kavar on behalf of the Airflow team at Grab – Charles Martinot, Vinson Lee, Akash Sihag, Piyush Gupta, Pramiti Goel, Dewin Goh, QuiHieu Nguyen, James Anh-Tu Nguyen, and the Data Engineering Team.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Building for Cost optimization and Resilience for EKS with Spot Instances

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/cost-optimization-and-resilience-eks-with-spot-instances/

This post is contributed by Chris Foote, Sr. EC2 Spot Specialist Solutions Architect

Running your Kubernetes and containerized workloads on Amazon EC2 Spot Instances is a great way to save costs. Kubernetes is a popular open-source container management system that allows you to deploy and manage containerized applications at scale. AWS makes it easy to run Kubernetes with Amazon Elastic Kubernetes Service (EKS) a managed Kubernetes service to run production-grade workloads on AWS. To cost optimize these workloads, run them on Spot Instances. Spot Instances are available at up to a 90% discount compared to On-Demand prices. These instances are best used for various fault-tolerant and instance type flexible applications. Spot Instances and containers are an excellent combination, because containerized applications are often stateless and instance flexible.

In this blog, I illustrate the best practices of using Spot Instances such as diversification, automated interruption handling, and leveraging Auto Scaling groups to acquire capacity. You then adapt these Spot Instance best practices to EKS with the goal of cost optimizing and increasing the resilience of container-based workloads.

Spot Instances Overview

Spot Instances are spare Amazon EC2 capacity that allows customers to save up to 90% over On-Demand prices. Spot capacity is split into pools determined by instance type, Availability Zone (AZ), and AWS Region. The Spot Instance price changes slowly determined by long-term trends in supply and demand of a particular Spot capacity pool, as shown below:

Spot Instance pricing

Prices listed are an example, and may not represent current prices. Spot Instance pricing is illustrated in orange blocks, and On-Demand is illustrated in dark-blue.

When EC2 needs the capacity back, the Spot Instance service arbitrarily sends Spot interruption notifications to instances within the associated Spot capacity pool. This Spot interruption notification lands in both the EC2 instance metadata and Eventbridge. Two minutes after the Spot interruption notification, the instance is reclaimed. You can set up your infrastructure to automate a response to this two-minute notification. Examples include draining containers, draining ELB connections, or post-processing.

Instance flexibility is important when following Spot Instance best practices, because it allows you to provision from many different pools of Spot capacity. Leveraging multiple Spot capacity pools help reduce interruptions depending on your defined Spot Allocation Strategy, and decrease time to provision capacity. Tapping into multiple Spot capacity pools across instance types and AZs, allows you to achieve your desired scale — even for applications that require 500K concurrent cores:

Spot capacity pools = (Availability Zones) * (Instance Types)

If your application is deployed across two AZs and uses only an c5.4xlarge then you are only using (2 * 1 = 2) two Spot capacity pools. To follow Spot Instance best practices, consider using six AZs and allowing your application to use c5.4xlarge, c5d.4xlarge, c5n.4xlarge, and c4.4xlarge. This gives us (6 * 4 = 24) 24 Spot capacity pools, greatly increasing the stability and resilience of your application.

Auto Scaling groups support deploying applications across multiple instance types, and automatically replace instances if they become unhealthy, or terminated due to Spot interruption. To decrease the chance of interruption, use the capacity-optimized Spot allocation strategy. This automatically launches Spot Instances into the most available pools by looking at real-time capacity data, and identifying which are the most available.

Now that I’ve covered Spot best practices, you can apply them to Kubernetes and build an architecture for EKS with Spot Instances.

Solution architecture

The goals of this architecture are as follows:

  • Automatically scaling the worker nodes of Kubernetes clusters to meet the needs of the application
  • Leveraging Spot Instances to cost-optimize workloads on Kubernetes
  • Adapt Spot Instance best practices (like diversification) to EKS and Cluster Autoscaler

You achieve these goals via the following components:

Component Role Details Deployment Method
Cluster Autoscaler Scales EC2 instances automatically according to pods running in the cluster Open Source A DaemonSet via Helm on On-Demand Instances
EC2 Auto Scaling group Provisions and maintains EC2 instance capacity AWS Cloudformation via eksctl
AWS Node Termination Handler Detects EC2 Spot interruptions and automatically drains nodes Open Source A DaemonSet via Helm on Spot Instances

The architecture deploys the EKS worker nodes over three AZs, and leverages three Auto Scaling groups – two for Spot Instances, and one for On-Demand. The Kubernetes Cluster Autoscaler is deployed on On-Demand worker nodes, and the AWS Node Termination Handler is deployed on all worker nodes.

EKS worker node architecture

Additional detail on Kubernetes interaction with Auto Scaling group:

  • Cluster Autoscaler can be used to control scaling activities by changing the DesiredCapacity of the Auto Scaling group, and directly terminating instances. Auto Scaling groups can be used to find capacity, and automatically replace any instances that become unhealthy, or terminated through Spot Instance interruptions.
  • The Cluster Autoscaler can be provisioned as a Deployment of one pod to an instance of the On-Demand Auto Scaling group. The proceeding diagram shows the pod in AZ1, however this may not necessarily be the case.
  • Each node group maps to a single Auto Scaling group. However, Cluster Autoscaler requires all instances within a node group to share the same number of vCPU and amount of RAM. To adhere to Spot Instance best practices and maximize diversification, you use multiple node groups. Each of these node groups is a mixed-instance Auto Scaling group with capacity-optimized Spot allocation strategy.

Kuberentes Node Groups

Autoscaling in Kubernetes Clusters

There are two common ways to scale Kubernetes clusters:

  1. Horizontal Pod Autoscaler (HPA) scales the pods in deployment or a replica set to meet the demand of the application. Scaling policies are based on observed CPU utilization or custom metrics.
  2. Cluster Autoscaler (CA) is a standalone program that adjusts the size of a Kubernetes cluster to meet the current needs. It increases the size of the cluster when there are pods that failed to schedule on any of the current nodes due to insufficient resources. It attempts to remove underutilized nodes, when its pods can run elsewhere.

When a pod cannot be scheduled due to lack of available resources, Cluster Autoscaler determines that the cluster must scale out and increases the size of the node group. When multiple node groups are used, Cluster Autoscaler chooses one based on the Expander configuration. Currently, the following strategies are supported: random, most-pods, least-waste, and priority

You use random placement strategy in this example for the Expander in Cluster Autoscaler. This is the default expander, and arbitrarily chooses a node-group when the cluster must scale out. The random expander maximizes your ability to leverage multiple Spot capacity pools. However, you can evaluate the others and may find another more appropriate for your workload.

Atlassian Escalator:

An alternative to Cluster Autoscaler for batch workloads is Escalator. It is designed for large batch or job-based workloads that cannot be force-drained and moved when the cluster needs to scale down.

Auto Scaling group

Following best practices for Spot Instances means deploying a fleet across a diversified set of instance families, sizes, and AZs. An Auto Scaling group is one of the best mechanisms to accomplish this. Auto Scaling groups automatically replace Spot Instances that have been terminated due to a Spot interruption with an instance from another capacity pool.

Auto Scaling groups support launching capacity from multiple instances types, and using multiple purchase options. For the example in this blog post, I’m maintaining a 1:4 vCPU to memory ratio for all instances chosen. For your application, there may be a different set of requirements. The instances chosen for the two Auto Scaling groups are below:

  • 4vCPU / 16GB ASG: xlarge, m5d.xlarge, m5n.xlarge, m5dn.xlarge, m5a.xlarge, m4.xlarge
  • 8vCPU / 32GB ASG: 2xlarge, m5d.2xlarge, m5n.2xlarge, m5dn.2xlarge, m5a.2xlarge, m4.2xlarge

In this example I use a total of 12 different instance types and three different AZs, for a total of (12 * 3 = 36) 36 different Spot capacity pools. The Auto Scaling group chooses which instance types to deploy based on the Spot allocation strategy. To minimize the chance of Spot Instance interruptions, you use the capacity-optimized allocation strategy. The capacity-optimized strategy automatically launches Spot Instances into the most available pools by looking at real-time capacity data and predicting which are the most available.

capacity-optimized Spot allocation strategy

Example of capacity-optimized Spot allocation strategy.

For Cluster Autoscaler, other cluster administration/management pods, and stateful workloads that run on EKS worker nodes, you create a third Auto Scaling group using On-Demand Instances. This ensures that Cluster Autoscaler is not impacted by Spot Instance interruptions. In Kubernetes, labels and nodeSelectors can be used to control where pods are placed. You use the nodeSelector to place Cluster Autoscaler on an instance in the On-Demand Auto Scaling group.

Note: The Auto Scaling groups continually work to balance the number of instances in each AZ they are deployed over. This may cause worker nodes to be terminated while ASG is scaling in the number of instances within an AZ. You can disable this functionality by suspending the AZRebalance process, but this can result in capacity becoming unbalanced across AZs. Another option is running a tool to drain instances upon ASG scale-in such as the EKS Node Drainer. The code provides an AWS Lambda function that integrates as an Amazon EC2 Auto Scaling Lifecycle Hook. When called, the Lambda function calls the Kubernetes API to cordon and evicts all evictable pods from the node being terminated. It then waits until all pods are evicted before the Auto Scaling group continues to terminate the EC2 instance.

Spot Instance Interruption Handling

To mitigate the impact of potential Spot Instance interruptions, leverage the ‘node termination handler’. The DaemonSet deploys a pod on each Spot Instance to detect the Spot Instance interruption notification, so that it can both terminate gracefully any pod that was running on that node, drain from load balancers and allow the Kubernetes scheduler to reschedule the evicted pods elsewhere on the cluster.

The workflow can be summarized as:

  • Identify that a Spot Instance is about to be interrupted in two minutes.
  • Use the two-minute notification window to gracefully prepare the node for termination.
  • Taint the node and cordon it off to prevent new pods from being placed on it.
  • Drain connections on the running pods.


  • Controllers that manage K8s objects like Deployments and ReplicaSet will understand that one or more pods are not available and create a new replica.
  • Cluster Autoscaler and AWS Auto Scaling group will re-provision capacity as needed.


Getting started (launch EKS)

First, you use eksctl to create an EKS cluster with the name spotcluster-eksctl in combination with a managed node group. The managed node group will have two On-Demand t3.medium nodes and it will bootstrap with the labels lifecycle=OnDemand and intent=control-apps. Be sure to replace <YOUR REGION> with the region you’ll be launching your cluster into.

eksctl create cluster --version=1.15 --name=spotcluster-eksctl --node-private-networking --managed --nodes=3 --alb-ingress-access --region=<YOUR REGION> --node-type t3.medium --node-labels="lifecycle=OnDemand" --asg-access

This takes approximately 15 minutes. Once cluster creation is complete, test the node connectivity:

kubectl get nodes

Provision the worker nodes

You use eksctl create nodegroup and eksctl configuration files to add the new nodes to the cluster. First, create the configuration file spot_nodegroups.yml. Then, paste the code and replace <YOUR REGION> with the region you launched your EKS cluster in.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
    name: spotcluster-eksctl
    region: <YOUR REGION>
    - name: ng-4vcpu-16gb-spot
      minSize: 0
      maxSize: 5
      desiredCapacity: 1
        instanceTypes: ["m5.xlarge", "m5n.xlarge", "m5d.xlarge", "m5dn.xlarge","m5a.xlarge", "m4.xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
          autoScaler: true
          albIngress: true
    - name: ng-8vcpu-32gb-spot
      minSize: 0
      maxSize: 5
      desiredCapacity: 1
        instanceTypes: ["m5.2xlarge", "m5n.2xlarge", "m5d.2xlarge", "m5dn.2xlarge","m5a.2xlarge", "m4.2xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
          autoScaler: true
          albIngress: true

This configuration file adds two diversified Spot Instance node groups with 4vCPU/16GB and 8vCPU/32GB instance types. These node groups use the capacity-optimized Spot allocation strategy as described above. Last, you label all nodes created with the instance lifecycle “Ec2Spot” and later use nodeSelectors, to guide your application front-end to your Spot Instance nodes. To create both node groups, run:

eksctl create nodegroup -f spot_nodegroups.yml

This takes approximately three minutes. Once done, confirm these nodes were added to the cluster:

kubectl get nodes --show-labels --selector=lifecycle=Ec2Spot

Install the Node Termination Handler

You can install the .yaml file from the official GitHub site.

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.3.1/all-resources.yaml

This installs the Node Termination Handler to both Spot Instance and On-Demand nodes, which is helpful because the handler responds to both EC2 maintenance events and Spot Instance interruptions. However, if you are interested in limiting deployment to just Spot Instance nodes, the site has additional instructions to accomplish this.

Verify the Node Termination Handler is running:

kubectl get daemonsets --all-namespaces

Deploy the Cluster Autoscaler

For additional detail, see the EKS page here. Export the Cluster Autoscaler into a configuration file:

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Open the file created and edit the cluster-autoscaler container command to replace <YOUR CLUSTER NAME> with your cluster’s name, and add the following options.


You also need to change the expander configuration. Search for - --expander= and replace least-waste with random


        - command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=random
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false

Save the file and then deploy the Cluster Autoscaler:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Next, add the cluster-autoscaler.kubernetes.io/safe-to-evictannotation to the deployment with the following command:

kubectl -n kube-system annotate deployment.apps/cluster-autoscaler cluster-autoscaler.kubernetes.io/safe-to-evict="false"

Open the Cluster Autoscaler releases page in a web browser and find the latest Cluster Autoscaler version that matches your cluster’s Kubernetes major and minor version. For example, if your cluster’s Kubernetes version is 1.16 find the latest Cluster Autoscaler release that begins with 1.16.

Set the Cluster Autoscaler image tag to this version using the following command, replacing 1.15.n with your own value. You can replace us with asia or eu:

kubectl -n kube-system set image deployment.apps/cluster-autoscaler cluster-autoscaler=us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.15.n

To view the Cluster Autoscaler logs, use the following command:

kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler

Deploy the sample application

Create a new file web-app.yaml, paste the following specification into it and save the file:

apiVersion: apps/v1
kind: Deployment
  name: web-stateless
  replicas: 3
      app: nginx
        service: nginx
        app: nginx
      - image: nginx
        name: web-stateless
            cpu: 1000m
            memory: 1024Mi
            cpu: 1000m
            memory: 1024Mi
        lifecycle: Ec2Spot
apiVersion: apps/v1
kind: Deployment
  name: web-stateful
  replicas: 2
      app: redis
        service: redis
        app: redis
      - image: redis:3.2-alpine
        name: web-stateful
            cpu: 1000m
            memory: 1024Mi
            cpu: 1000m
            memory: 1024Mi
        lifecycle: OnDemand

This deploys three replicas, which land on one of the Spot Instance node groups due to the nodeSelector choosing lifecycle: Ec2Spot. The “web-stateful” nodes are not fault-tolerant and not appropriate to be deployed on Spot Instances. So, you use nodeSelector again, and instead choose lifecycle: OnDemand. By guiding fault-tolerant pods to Spot Instance nodes, and stateful pods to On-Demand nodes, you can even use this to support multi-tenant clusters.

To deploy the application:

kubectl apply -f web-app.yaml

Confirm that both deployments are running:

kubectl get deployment/web-stateless
kubectl get deployment/web-stateful

Now, scale out the stateless application:

kubectl scale --replicas=30 deployment/web-stateless

Check to see that there are pending pods. Wait approximately 5 minutes, then check again to confirm the pending pods have been scheduled:

kubectl get pods


Remove the AWS Node Termination Handler:

kubectl delete daemonset aws-node-termination-handler -n kube-system

Remove the two Spot node groups (EC2 Auto Scaling group) that you deployed in the tutorial.

eksctl delete nodegroup ng-4vcpu-16gb-spot --cluster spotcluster-eksctl
eksctl delete nodegroup ng-8vcpu-32gb-spot --cluster spotcluster-eksctl

If you used a new cluster and not your existing cluster, delete the EKS cluster.

eksctl confirms the deletion of the cluster’s CloudFormation stack immediately but the deletion could take up to 15 minutes. You can optionally track it in the CloudFormation Console.

eksctl delete cluster --name spotcluster-eksctl


By following best practices, Kubernetes workloads can be deployed onto Spot Instances, achieving both resilience and cost optimization. Instance and Availability Zone flexibility are the cornerstones of pulling from multiple capacity pools and obtaining the scale your application requires. In addition, there are pre-built tools to handle Spot Instance interruptions, if they do occur. EKS makes this even easier by reducing operational overhead through offering a highly-available managed control-plane and managed node groups. You’re now ready to begin integrating Spot Instances into your Kubernetes clusters to reduce workload cost, and if needed, achieve massive scale.

Releasing kubectl support in Access

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/releasing-kubectl-support-in-access/

Releasing kubectl support in Access

Starting today, you can use Cloudflare Access and Argo Tunnel to securely manage your Kubernetes cluster with the kubectl command-line tool.

We built this to address one of the edge cases that stopped all of Cloudflare, as well as some of our customers, from disabling the VPN. With this workflow, you can add SSO requirements and a zero-trust model to your Kubernetes management in under 30 minutes.

Once deployed, you can migrate to Cloudflare Access for controlling Kubernetes clusters without disrupting your current kubectl workflow, a lesson we learned the hard way from dogfooding here at Cloudflare.

What is kubectl?

A Kubernetes deployment consists of a cluster that contains nodes, which run the containers, as well as a control plane that can be used to manage those nodes. Central to that control plane is the Kubernetes API server, which interacts with components like the scheduler and manager.

kubectl is the Kubernetes command-line tool that developers can use to interact with that API server. Users run kubectl commands to perform actions like starting and stopping the nodes, or modifying other elements of the control plane.

In most deployments, users connect to a VPN that allows them to run commands against that API server by addressing it over the same local network. In that architecture, user traffic to run these commands must be backhauled through a physical or virtual VPN appliance. More concerning, in most cases the user connecting to the API server will also be able to connect to other addresses and ports in the private network where the cluster runs.

How does Cloudflare Access apply?

Cloudflare Access can secure web applications as well as non-HTTP connections like SSH, RDP, and the commands sent over kubectl. Access deploys Cloudflare’s network in front of all of these resources. Every time a request is made to one of these destinations, Cloudflare’s network checks for identity like a bouncer in front of each door.

Releasing kubectl support in Access

If the request lacks identity, we send the user to your team’s SSO provider, like Okta, AzureAD, and G Suite, where the user can login. Once they login, they are redirected to Cloudflare where we check their identity against a list of users who are allowed to connect. If the user is permitted, we let their request reach the destination.

In most cases, those granular checks on every request would slow down the experience. However, Cloudflare Access completes the entire check in just a few milliseconds. The authentication flow relies on Cloudflare’s serverless product, Workers, and runs in every one of our data centers in 200 cities around the world. With that distribution, we can improve performance for your applications while also authenticating every request.

How does it work with kubectl?

To replace your VPN with Cloudflare Access for kubectl, you need to complete two steps:

  • Connect your cluster to Cloudflare with Argo Tunnel
  • Connect from a client machine to that cluster with Argo Tunnel
Releasing kubectl support in Access

Connecting the cluster to Cloudflare

On the cluster side, Cloudflare Argo Tunnel connects those resources to our network by creating a secure tunnel with the Cloudflare daemon, cloudflared. As an administrator, you can run cloudflared in any space that can connect to the kubectl API server over TCP.

Once installed, an administrator authenticates the instance of cloudflared by logging in to a browser with their Cloudflare account and choosing a hostname to use. Once selected, Cloudflare will issue a certificate to cloudflared that can be used to create a subdomain for the cluster.

Next, an administrator starts the tunnel. In the example below, the hostname value can be any subdomain of the hostname selected in Cloudflare; the url value should be the API server for the cluster.

cloudflared tunnel --hostname cluster.site.com --url tcp://kubernetes.docker.internal:6443 --socks5=true 

This should be run as a systemd process to ensure the tunnel reconnects if the resource restarts.

Connecting as an end user

End users do not need an agent or client application to connect to web applications secured by Cloudflare Access. They can authenticate to on-premise applications through a browser, without a VPN, like they would for SaaS tools. When we apply that same security model to non-HTTP protocols, we need to establish that secure connection from the client with an alternative to the web browser.

Unlike our SSH flow, end users cannot modify kubeconfig to proxy requests through cloudflared. Pull requests have been submitted to add this functionality to kubeconfig, but in the meantime users can set an alias to serve a similar function.

First, users need to download the same cloudflared tool that administrators deploy on the cluster. Once downloaded, they will need to run a corresponding command to create a local SOCKS proxy. When the user runs the command, cloudflared will launch a browser window to prompt them to login with their SSO and check that they are allowed to reach this hostname.

$ cloudflared access tcp --hostname cluster.site.com url

The proxy allows your local kubectl tool to connect to cloudflared via a SOCKS5 proxy, which helps avoid issues with TLS handshakes to the cluster itself. In this model, TLS verification can still be exchanged with the kubectl API server without disabling or modifying that flow for end users.

Users can then create an alias to save time when connecting. The example below aliases all of the steps required to connect in a single command. This can be added to the user’s bash profile so that it persists between restarts.

$ alias kubeone=”env HTTPS_PROXY=socks5:// kubectl

A (hard) lesson when dogfooding

When we build products at Cloudflare, we release them to our own organization first. The entire company becomes a feature’s first customer, and we ask them to submit feedback in a candid way.

Cloudflare Access began as a product we built to solve our own challenges with security and connectivity. The product impacts every user in our team, so as we’ve grown, we’ve been able to gather more expansive feedback and catch more edge cases.

The kubectl release was no different. At Cloudflare, we have a team that manages our own Kubernetes deployments and we went to them to discuss the prototype. However, they had more than just some casual feedback and notes for us.

They told us to stop.

We had started down an implementation path that was technically sound and solved the use case, but did so in a way that engineers who spend all day working with pods and containers would find to be a real irritant. The flow required a small change in presenting certificates, which did not feel cumbersome when we tested it, but we do not use it all day. That grain of sand would cause real blisters as a new requirement in the workflow.

With their input, we stopped the release, and changed that step significantly. We worked through ideas, iterated with them, and made sure the Kubernetes team at Cloudflare felt this was not just good enough, but better.

What’s next?

Support for kubectl is available in the latest release of the cloudflared tool. You can begin using it today, on any plan. More detailed instructions are available to get started.

If you try it out, please send us your feedback! We’re focused on improving the ease of use for this feature, and other non-HTTP workflows in Access, and need your input.

New to Cloudflare for Teams? You can use all of the Teams products for free through September. You can learn more about the program, and request a dedicated onboarding session, here.

TMA Special: Connecting Taza Chocolate’s Legacy Equipment to the Cloud

Post Syndicated from Todd Escalona original https://aws.amazon.com/blogs/architecture/tma-special-connecting-taza-chocolates-legacy-equipment-to-the-cloud/

As a “bean to bar” chocolate manufacturer, Taza Chocolate uses traditional stone ground mills for the production of its famous chocolate discs. The analog, mid-century machines that the company imported from Central America were never built to connect to the cloud.

Along comes Tulip Interfaces, an AWS Industrial Software Competency Partner that makes the human and machine interaction easier by replacing paper processes with digital automation. Tulip retrofitted Taza’s legacy equipment with Internet of Things (IoT) sensors and connected it back to the AWS cloud.

Taza’s AWS cloud integration begins with Tulip’s own physical gateway that connects systems and machinery on the plant floor. Tulip then deploys IoT sensors to the machinery and passes outputs to the AWS cloud using an encrypted web socket where Tulip’s Kubernetes workers, managed by Kops, automatically schedule services across highly available instances and processes requests.

All job completion data is then fed to an Amazon RDS Multi-AZ PostgreSQL database that allows Taza to run visualizations and analytics for more insight using Prometheus and Garfana. In addition, all of the application definition metadata is contained in a MongoDB database service running on Amazon Elastic Cloud Compute (EC2) instances, which in return is VPC-peered with Kubernetes clusters. On top of this backend, Tulip uses a player application to stream metrics in near real-time that are displayed on the dashboard down on the shop floor and can be easily examined in order to help guide their operations and foster continuous improvements efforts to manufacturing operations.

Taza has realized many benefits from monitoring machine availability, performance, ambient conditions as well as overall process enhancements.

In this special, on-site This is My Architecture video, AWS Solutions Architect Evangelist Todd Escalona takes us on his journey through the Taza Chocolate factory where he meets with Taza’s Director of Manufacturing, Rich Moran, and Tulip’s DevOps lead, John Defreitas, to further explore how Tulip enables Taza Chocolate’s legacy equipment for cloud-based plant automation.

*Check out more This Is My Architecture video series.

Plumbing At Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/plumbing-at-scale

When you open the Grab app and hit book, a series of events are generated that define your personalised experience with us: booking state machines kick into motion, driver partners are notified, reward points are computed, your feed is generated, etc. While it is important for you to know that a request has been received, a lot happens asynchronously in our back-end services.

As custodians and builders of the streaming platform at Grab operating at massive scale (think terabytes of data ingress each hour), the Coban team’s mission is to provide a NoOps, managed platform for seamless, secure access to event streams in real-time, for every team at Grab.

Coban Sewu Waterfall In Indonesia
Coban Sewu Waterfall In Indonesia. (Streams, get it?)

Streaming systems are often at the heart of event-driven architectures, and what starts as a need for a simple message bus for asynchronous processing of events quickly evolves into one that requires a more sophisticated stream processing paradigms.
Earlier this year, we saw common patterns of event processing emerge across our Go backend ecosystem, including:

  • Filtering and mapping stream events of one type to another
  • Aggregating events into time windows and materializing them back to the event log or to various types of transactional and analytics databases

Generally, a class of problems surfaced which could be elegantly solved through an event sourcing1 platform with a stream processing framework built over it, similar to the Keystone platform at Netflix2.

This article details our journey building and deploying an event sourcing platform in Go, building a stream processing framework over it, and then scaling it (reliably and efficiently) to service over 300 billion events a week.

Event Sourcing

Event sourcing is an architectural pattern where changes to an application state are stored as a sequence of events, which can be replayed, recomputed, and queried for state at any time. An implementation of the event sourcing pattern typically has three parts to it:

  • An event log
  • Processor selection logic: The logic that selects which chunk of domain logic to run based on an incoming event
  • Processor domain logic: The domain logic that mutates an application’s state
Event Sourcing
Event Sourcing

Event sourcing is a building block on which architectural patterns such as Command Query Responsibility Segregation3, serverless systems, and stream processing pipelines are built.

The Case For Stream Processing

Below are some use cases serviced by stream processing, built on event sourcing.

Asynchronous State Management

A pub-sub system allows for change events from one service to be fanned out to multiple interested subscribers without letting any one subscriber block the progress of others. Abstracting the event log and centralising it democratises access to this log to all back-end services. It enables the back-end services to apply changes from this centralised log to their own state, independent of downstream services, and/or publish their state changes to it.

Time Windowed Aggregations

Time-windowed aggregates are a common requirement for machine learning models (as features) as well as analytics. For example, personalising the Grab app landing page requires counting your interaction with various widget elements in recent history, not any one event in particular. Similarly, an analyst may not be interested in the details of a singular booking in real-time, but in building demand heatmaps segmented by geohashes. For latency-sensitive lookups, especially for the personalisation example, pre-aggregations are preferred instead of post-aggregations.

Stream Joins, Filtering, Mapping

Event logs are typically sharded by some notion of topics to logically divide events of interest around a theme (booking events, profile updates, etc.). Building bigger topics out of smaller ones, as well as smaller ones from bigger ones are common ways to compose “substreams” of the log of interest directed towards specific services. For example, a promo service may only be interested in listening to booking events for promotional bookings.

Realtime Business Intelligence

Outputs of stream processing workloads are also plugged into realtime Business Intelligence (BI) and stream analytics solutions upstream, as raw data for visualizations on operations dashboards.


For offline analytics, as well as reconciliation and disaster recovery, having an archive in a cold store helps for certain mission critical streams.

Platform Requirements

Any processing platform for event sourcing and stream processing has certain expectations around its functionality.

Scaling and Elasticity

Stream/Event Processing pipelines need to be elastic and responsive to changes in traffic patterns, especially considering that user activity (rides, food, deliveries, payments) varies dramatically during the course of a day or week. A spike in food orders on rainy days shouldn’t cause indefinite order processing latencies.


For a platform team, it’s important that users can easily onboard and manage their pipeline lifecycles, at their preferred cadence. To scale effectively, the process of scaffolding, configuring, and deploying pipelines needs to be standardised, and infrastructure managed. Both the platform and users are able to leverage common standards of telemetry, configuration, and deployment strategies, and users benefit from a lack of infrastructure management overhead.


Our platform has quickly scaled to support hundreds of pipelines. Workload isolation, independent processing uptime guarantees, and resource allocation and cost audit are important requirements necessitating multi-tenancy, which help amortize platform overhead costs.


Whether latency sensitive or latency tolerant, all workloads have certain expectations on processing uptime. From a user’s perspective, there must be guarantees on pipeline uptimes and data completeness, upper bounds on processing delays, instrumentation for alerting, and self-healing properties of the platform for remediation.

Tunable Tradeoffs

Some pipelines are latency sensitive, and rely on processing completeness seconds after event ingress. Other pipelines are latency tolerant, and can tolerate disruption to processing lasting in tens of minutes. A one size fits all solution is likely to be either cost inefficient or unreliable. Having a way for users to make these tradeoffs consciously becomes important for ensuring efficient processing guarantees at a reasonable cost. Similarly, in the case of upstream failures or unavailability, being able to tune failure modes (like wait, continue, or retry) comes in handy.

Stream Processing Framework

While basic event sourcing covers simple use cases like archival, more complicated ones benefit from a common framework that shifts the mental model for processing from per event processing to stream pipeline orchestration.
Given that Go is a “paved road” for back-end development at Grab, and we have service code and bindings for streaming data in a mono-repository, we built a Go framework with a subset of capabilities provided by other streaming frameworks like Flink4.

Logic Blocks In A Stream Processing Pipeline
Logic Blocks In A Stream Processing Pipeline


Some capabilities built into the framework include:

  • Deduplication: Enables pipelines to idempotently reprocess data in case of rewinds/replays, and provides some processing guarantees within a time window for certain use cases including sinking to datastores.
  • Filtering and Mapping: An ability to filter a source stream data and map them onto target streams.
  • Aggregation: An ability to generate and execute aggregation logic such as sum, avg, max, and min in a window.
  • Windowing: An ability to window processing into tumbling, sliding, and session windows.
  • Join: An ability to combine two streams together with certain join keys in a window.
  • Processor Chaining: Various functionalities can be chained to build more complicated pipelines from simpler ones. For example: filter a large stream into a smaller one, aggregate it over a time window, and then map it to a new stream.
  • Rewind: The ability to rewind the processing logic by a few hours through configuration.
  • Replay: The ability to replay archived data into the same or a separate pipeline via configuration.
  • Sinks: A number of connectors to standard Grab stores are provided, with concerns of auth, telemetry, etc. managed in the runtime.
  • Error Handling: Providing an easy way to indicate whether to wait, skip, and/or retry in case of upstream failures is an important tuning parameter that users need for making sensible tradeoffs in dimensions of backpressure, latency, correctness, etc.


Coban Platform
Coban Platform

Our event log is primarily a bunch of critical Kafka clusters, which are being polled by various pipelines deployed by service teams on the platform for incoming events. Each pipeline is an isolated deployment, has an identity, and the ability to connect to various upstream sinks to materialise results into, including the event log itself.
There is also a metastore available as an intermediate store for processing pipelines, so the pipelines themselves are stateless with their lifecycle completely beholden to the whims of their owners.

Anatomy of a Processing Pipeline

Anatomy Of A Stream Processing Pod
Anatomy Of A Stream Processing Pod

Anatomy of a Stream Processing Pod
Each stream processing pod (the smallest unit of a pipeline’s deployment) has three top level components:

  • Triggers: An interface that connects directly to the source of the data and converts it into an event channel.
  • Runtime: This is the app’s entrypoint and the orchestrator of the pod. It manages the worker pools, triggers, event channels, and lifecycle events.
  • The Pipeline plugin: The plugin is provided by the user, and conforms to a contract that the platform team publishes. It contains the domain logic for the pipeline and houses the pipeline orchestration defined by a user based on our Stream Processing Framework.

Deployment Infrastructure

Our deployment infrastructure heavily leverages Kubernetes on AWS. After a (pretty high) initial cost for infrastructure set up, we’ve found scaling to hundreds of pipelines a breeze with the Kubernetes provided controls. We package our stateless pipeline workloads into Kubernetes deployments, with each pod containing a unit of a stream pipeline, with sidecars that integrate them with our monitoring systems. Other cluster wide tooling deployed (usually as DaemonSets) deal with metric collection, log ingestion, and autoscaling. We currently use the Horizontal Pod Autoscaler5 to manage traffic elasticity, and the Cluster Autoscaler6 to manage worker node scaling.

A Typical Kubernetes Set Up On AWS


Some pipelines require storage for use cases ranging from deduplication to stores for materialised results of time windowed aggregations. All our pipelines have access to clusters of ScyllaDB instances (which we use as our internal store), made available to pipeline authors via interfaces in the Stream Processing Framework. Results of these aggregations are then made available to backend services via our GrabStats service, which is a thin query layer over the latest pipeline results.

Compute Isolation

A nice property of packaging pipelines as Kubernetes deployments is a good degree of compute workload isolation for pipelines. While node resources of pipeline pods are still shared (and there are potential noisy neighbour issues on matters like logging throughput), the pipeline pods of various pods can be scheduled and rescheduled across a wide range of nodes safely and swiftly, with minimal impact to pods of other pipelines.


Stateless processing pods mean we can set up backup or redundant Kubernetes clusters in hot-hot, hot-warm, or hot-cold modes. We use this to ensure high processing availability despite limited control plane guarantees from any single cluster. (Since EKS SLAs for the Kubernetes control plane guarantee only 99.9% uptime today7.) Transparent to our users, we make the deployment systems aware of multiple available targets for scheduling.

Availability vs Cost

As alluded to in the “Platform Requirements” section, having a way of trading off availability for cost becomes important where the requirements and criticality of each processing pipeline are very different. Given that AWS spot instances are a lot cheaper8 than on-demand ones, we use user annotated Kubernetes priority classes to determine deployment targets for pipelines. For latency tolerant pipelines, we schedule them on Spot instances which are routinely between 40-90% cheaper than on demand instances on which latency sensitive pipelines run. The caveat is that Spot instances occasionally disappear, and these workloads are disrupted until a replacement node for their scheduling can be found.

What’s Next?

  • Expand the ecosystem of triggers to support custom sources of data i.e. the “event log”, as well as push based (RPC driven) versus just pull based triggers
  • Build a control plane for API integration with pipeline lifecycle management
  • Move some workloads to use the Vertical Pod Autoscaler9 in Kubernetes instead of horizontal scaling, as most of our workloads have a limit on parallelism (which is their partition count in Kafka topics)
  • Move from Go plugins for pipelines to plugins over RPC, like what HashiCorp does10, to enable processing logic in non-Go languages.
  • Use either pod gossip or a service mesh with a control plane to set up quotas for shared infrastructure usage per pipeline. This is to protect upstream dependencies and the metastore from surges in event backlogs.
  • Improve availability guarantees for pipeline pods by occasionally redistributing/rescheduling pods across nodes in our Kubernetes cluster to prevent entire workloads being served out of a few large nodes.

Authored By Karan Kamath on behalf of the Coban team at Grab-
Zezhou Yu, Ryan Ooi, Hui Yang, Yuguang Xiao, Ling Zhang, Roy Kim, Matt Hino, Jump Char, Lincoln Lee, Jason Cusick, Shrinand Thakkar, Dean Barlan, Shivam Dixit, Shubham Badkur, Fahad Pervaiz, Andy Nguyen, Ravi Tandon, Ken Fishkin, and Jim Caputo.


Coban Sewu Waterfall Photo by Dwinanda Nurhanif Mujito on Unsplash

Cover Photo by tian kuan on Unsplash

  1. https://martinfowler.com/eaaDev/EventSourcing.html 

  2. https://medium.com/netflix-techblog/keystone-real-time-stream-processing-platform-a3ee651812a 

  3. https://martinfowler.com/bliki/CQRS.html 

  4. https://flink.apache.org 

  5. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ 

  6. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler 

  7. https://aws.amazon.com/eks/sla/ 

  8. https://aws.amazon.com/ec2/pricing/ 

  9. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler 

  10. https://github.com/hashicorp/go-plugin 

Improve Productivity and Reduce Overhead Expenses with Red Hat OpenShift Dedicated on AWS

Post Syndicated from Ryan Niksch original https://aws.amazon.com/blogs/architecture/improve-productivity-and-reduce-overhead-expenses-with-red-hat-openshift-dedicated-on-aws/

Red Hat OpenShift on AWS helps you develop, deploy, and manage container-based applications across on-premises and cloud environments. A recent case study from Cathay Pacific Airways proved that the use of the Red Hat OpenShift application platform can significantly improve developer productivity and reduce operational overhead by automating infrastructure, application deployment, and scaling. In this post, I explore how the architectural implementation and customization options of Red Hat OpenShift dedicated on AWS can cater to a variety of customer needs.

Red Hat OpenShift is a turn key solution providing  a container runtime, Kubernetes orchestration, container image repositories, pipeline, build process, monitoring, logging, role-based access control, granular policy-based control, and abstractions to simplify functions. Deploying a single turnkey solution, instead of building and integrating a collection of independent solutions or services, allows you to invest more time and effort in building meaningful applications for your business.

In the past, Red Hat OpenShift deployed on Amazon EC2 using an automated provisioning process with an open source solution, like the Red Hat OpenShift on AWS Quick Start. The Red Hat OpenShift Quick Start is an infrastructure as code solution which accelerates customer provisioning of Red Hat OpenShift on AWS. The OpenShift Quick Start adheres to the reference architecture to deploy Red Hat OpenShift on AWS in a resilient, scalable, well-architected manner. This reference architecture sees the control plane as a collection of load balanced master nodes for traffic routing, session state, scheduling, and monitoring. It also contains the application nodes where the customer’s containerized workloads run. This solution allowed customers to get up and running within three hours; however, it did not reduce management overhead because customers were required to monitor and maintain the infrastructure of the Red Hat OpenShift cluster.

Red Hat and AWS listened to customer feedback and created Red Hat OpenShift dedicated, a fully managed OpenShift implementation running exclusively on AWS. This implementation monitors the layers and functions, scales the layers to cater to consumption needs, and addresses operational concerns.

Customers now have access to a platform that helps manage control planes for business-critical solutions, like their developer and operational platforms.

Red Hat OpenShift Dedicated Infrastructure on AWS

You can purchase Red Hat OpenShift dedicated through the Red Hat account team. Red Hat OpenShift dedicated comes in two varieties: the Standard edition and the Cloud Choice edition (bring your own cloud).

redhar-openshit options

Figure 1: Red Hat OpenShift architecture illustrating master and infrastructure nodes spread over three Availability Zones and placed behind elastic load balancers.

Red Hat OpenShift dedicated adheres to the reference architecture defined by AWS and Red Hat. Master and infrastructure layers are spread across three AWS availability zones providing resilience within the OpenShift solution, as well as the underlying infrastructure.

Red Hat OpenShift Dedicated Standard Edition

In the Red Hat OpenShift dedicated standard edition, Red Hat deploys the OpenShift cluster into an AWS account owned and managed by Red Hat. Red Hat provides an aggregated bill for the OpenShift subscription fees, management fees, and AWS billing. This edition is ideal for customers who want everything to be managed for them. The Red Hat site reliability engineering  team (SRE) will monitor and manage healing, scaling, and patching of the cluster.

Red Hat OpenShift Cloud Choice Edition

The cloud choice edition allows customers to create their own AWS account, and then have the Red Hat OpenShift dedicated infrastructure provisioned into their existing account. The Red Hat SRE team provisions the Red Hat OpenShift cluster into the customer owned AWS account and manages the solution via IAM roles.

Figure 2: Red Hat OpenShift Cloud Choice IAM role separation

Red Hat provides billing for the Red Hat OpenShift Cloud Choice subscription and management fees, and AWS provides billing for the AWS resources. Keeping the Red Hat OpenShift infrastructure within your AWS account allows better cost controls.

Red Hat OpenShift Cloud Choice provides visibility into the resources running in your account; which is desirable if you have regulatory and auditing concerns. You can inspect, monitor, and audit resources within the AWS account — taking advantage of the rich AWS service set (AWS CloudTrail, AWS config, AWS CloudWatch, and AWS cost explorer).

You can also take advantage of cost management solutions like AWS organizations and consolidated billing. Customers with multiple business units using AWS can combine the usage across their accounts to share the volume pricing discounts resulting in cost savings for projects, departments, and companies.

Red Hat OpenShift Cloud Choice dedicated cannot be deployed into an account currently hosting other applications and resources. In order to maintain separation of control with the managed service, Red Hat OpenShift Cloud Choice dedicated requires an AWS account dedicated to the managed Red Hat OpenShift solution.

You can take advantage of cost reductions of up to 70% using Reserved Instances, which match the pervasive running instances. This is ideal for the master and infrastructure nodes of the Red Hat OpenShift solutions running in your account. The reference architecture for Red Hat OpenShift on AWS recommends spanning  nodes over three availability zones, which translates to three master instances. The master and infrastructure nodes scale differently; so, there will be three additional instances for the infrastructure nodes. Purchasing reserved instances to offset the costs of the master nodes and the infrastructure nodes can free up funds for your next project.


DevOps teams using either edition of Red Hat OpenShift dedicated have a rich console experience providing control over networking between application workloads, storage, and monitoring. Granular drill down consoles enable operations teams to focus on what is most critical to their organization.

Each interface is controlled through granular role-based access control. Teams have visibility of high-level cluster overviews where they are able to see visualizations of the overall health of the cluster; and they have access to more granular overviews of views of hosts, nodes, and containers. Application owners, key stake holders, and operations teams have access to a customizable dashboard displaying the running state. Teams can drill down to the underlying nodes, and further into the PODs and containers, should they wish to explore the status or overall health of the containerized micro services. The cluster-wide event stream provides the same drill down experience to logging events.

The drill down console menu options are illustrated in the screenshots below:

In summary, the partnership of Red Hat and AWS created a fully managed solution which directly answers customer feedback requests for a fully managed application platform running on the availability, scalability, and cost benefits of AWS. The solution allows visibility and control whenever and wherever you need it.

About the author

Ryan Niksch

Ryan Niksch is a Partner Solutions Architect focusing on application platforms, hybrid application solutions, and modernization. Ryan has worn many hats in his life and has a passion for tinkering and a desire to leave everything he touches a little better than when he found it.

Using AWS App Mesh with Fargate

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/using-aws-app-mesh-with-fargate/

This post is contributed by Tony Pujals | Senior Developer Advocate, AWS


AWS App Mesh is a service mesh, which provides a framework to control and monitor services spanning multiple AWS compute environments. My previous post provided a walkthrough to get you started. In it, I showed deploying a simple microservice application to Amazon ECS and configuring App Mesh to provide traffic control and observability.

In this post, I show more advanced techniques using AWS Fargate as an ECS launch type. I show you how to deploy a specific version of the colorteller service from the previous post. Finally, I move on and explore distributing traffic across other environments, such as Amazon EC2 and Amazon EKS.

I simplified this example for clarity, but in the real world, creating a service mesh that bridges different compute environments becomes useful. Fargate is a compute service for AWS that helps you run containerized tasks using the primitives (the tasks and services) of an ECS application. This lets you work without needing to directly configure and manage EC2 instances.


Solution overview

This post assumes that you already have a containerized application running on ECS, but want to shift your workloads to use Fargate.

You deploy a new version of the colorteller service with Fargate, and then begin shifting traffic to it. If all goes well, then you continue to shift more traffic to the new version until it serves 100% of all requests. Use the labels “blue” to represent the original version and “green” to represent the new version. The following diagram shows programmer model of the Color App.

You want to begin shifting traffic over from version 1 (represented by colorteller-blue in the following diagram) over to version 2 (represented by colorteller-green).

In App Mesh, every version of a service is ultimately backed by actual running code somewhere, in this case ECS/Fargate tasks. Each service has its own virtual node representation in the mesh that provides this conduit.

The following diagram shows the App Mesh configuration of the Color App.



After shifting the traffic, you must physically deploy the application to a compute environment. In this demo, colorteller-blue runs on ECS using the EC2 launch type and colorteller-green runs on ECS using the Fargate launch type. The goal is to test with a portion of traffic going to colorteller-green, ultimately increasing to 100% of traffic going to the new green version.


AWS compute model of the Color App.


Before following along, set up the resources and deploy the Color App as described in the previous walkthrough.


Deploy the Fargate app

To get started after you complete your Color App, configure it so that your traffic goes to colorteller-blue for now. The blue color represents version 1 of your colorteller service.

Log into the App Mesh console and navigate to Virtual routers for the mesh. Configure the HTTP route to send 100% of traffic to the colorteller-blue virtual node.

The following screenshot shows routes in the App Mesh console.

Test the service and confirm in AWS X-Ray that the traffic flows through the colorteller-blue as expected with no errors.

The following screenshot shows racing the colorgateway virtual node.


Deploy the new colorteller to Fargate

With your original app in place, deploy the send version on Fargate and begin slowly increasing the traffic that it handles rather than the original. The app colorteller-green represents version 2 of the colorteller service. Initially, only send 30% of your traffic to it.

If your monitoring indicates a healthy service, then increase it to 60%, then finally to 100%. In the real world, you might choose more granular increases with automated rollout (and rollback if issues arise), but this demonstration keeps things simple.

You pushed the gateway and colorteller images to ECR (see Deploy Images) in the previous post, and then launched ECS tasks with these images. For this post, launch an ECS task using the Fargate launch type with the same colorteller and envoy images. This sets up the running envoy container as a sidecar for the colorteller container.

You don’t have to manually configure the EC2 instances in a Fargate launch type. Fargate automatically colocates the sidecar on the same physical instance and lifecycle as the primary application container.

To begin deploying the Fargate instance and diverting traffic to it, follow these steps.


Step 1: Update the mesh configuration

You can download updated AWS CloudFormation templates located in the repo under walkthroughs/fargate.

This updated mesh configuration adds a new virtual node (colorteller-green-vn). It updates the virtual router (colorteller-vr) for the colorteller virtual service so that it distributes traffic between the blue and green virtual nodes at a 2:1 ratio. That is, the green node receives one-third of the traffic.

$ ./appmesh-colorapp.sh
Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - DEMO-appmesh-colorapp

Step 2: Deploy the green task to Fargate

The fargate-colorteller.sh script creates parameterized template definitions before deploying the fargate-colorteller.yaml CloudFormation template. The change to launch a colorteller task as a Fargate task is in fargate-colorteller-task-def.json.

$ ./fargate-colorteller.sh

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - DEMO-fargate-colorteller


Verify the Fargate deployment

The ColorApp endpoint is one of the CloudFormation template’s outputs. You can view it in the stack output in the AWS CloudFormation console, or fetch it with the AWS CLI:

$ colorapp=$(aws cloudformation describe-stacks --stack-name=$ENVIRONMENT_NAME-ecs-colorapp --query="Stacks[0
].Outputs[?OutputKey=='ColorAppEndpoint'].OutputValue" --output=text); echo $colorapp> ].Outputs[?OutputKey=='ColorAppEndpoint'].OutputValue" --output=text); echo $colorapp

Assign the endpoint to the colorapp environment variable so you can use it for a few curl requests:

$ curl $colorapp/color
{"color":"blue", "stats": {"blue":1}}

The 2:1 weight of blue to green provides predictable results. Clear the histogram and run it a few times until you get a green result:

$ curl $colorapp/color/clear

$ for ((n=0;n<200;n++)); do echo "$n: $(curl -s $colorapp/color)"; done

0: {"color":"blue", "stats": {"blue":1}}
1: {"color":"green", "stats": {"blue":0.5,"green":0.5}}
2: {"color":"blue", "stats": {"blue":0.67,"green":0.33}}
3: {"color":"green", "stats": {"blue":0.5,"green":0.5}}
4: {"color":"blue", "stats": {"blue":0.6,"green":0.4}}
5: {"color":"gre
en", "stats": {"blue":0.5,"green":0.5}}
6: {"color":"blue", "stats": {"blue":0.57,"green":0.43}}
7: {"color":"blue", "stats": {"blue":0.63,"green":0.38}}
8: {"color":"green", "stats": {"blue":0.56,"green":0.44}}
199: {"color":"blue", "stats": {"blue":0.66,"green":0.34}}

This reflects the expected result for a 2:1 ratio. Check everything on your AWS X-Ray console.

The following screenshot shows the X-Ray console map after the initial testing.

The results look good: 100% success, no errors.

You can now increase the rollout of the new (green) version of your service running on Fargate.

Using AWS CloudFormation to manage your stacks lets you keep your configuration under version control and simplifies the process of deploying resources. AWS CloudFormation also gives you the option to update the virtual route in appmesh-colorapp.yaml and deploy the updated mesh configuration by running appmesh-colorapp.sh.

For this post, use the App Mesh console to make the change. Choose Virtual routers for appmesh-mesh, and edit the colorteller-route. Update the HTTP route so colorteller-blue-vn handles 33.3% of the traffic and colorteller-green-vn now handles 66.7%.

Run your simple verification test again:

$ curl $colorapp/color/clear
fargate $ for ((n=0;n<200;n++)); do echo "$n: $(curl -s $colorapp/color)"; done
0: {"color":"green", "stats": {"green":1}}
1: {"color":"blue", "stats": {"blue":0.5,"green":0.5}}
2: {"color":"green", "stats": {"blue":0.33,"green":0.67}}
3: {"color":"green", "stats": {"blue":0.25,"green":0.75}}
4: {"color":"green", "stats": {"blue":0.2,"green":0.8}}
5: {"color":"green", "stats": {"blue":0.17,"green":0.83}}
6: {"color":"blue", "stats": {"blue":0.29,"green":0.71}}
7: {"color":"green", "stats": {"blue":0.25,"green":0.75}}
199: {"color":"green", "stats": {"blue":0.32,"green":0.68}}

If your results look good, double-check your result in the X-Ray console.

Finally, shift 100% of your traffic over to the new colorteller version using the same App Mesh console. This time, modify the mesh configuration template and redeploy it:

    Type: AWS::AppMesh::Route
      - ColorTellerVirtualRouter
      - ColorTellerGreenVirtualNode
      MeshName: !Ref AppMeshMeshName
      VirtualRouterName: colorteller-vr
      RouteName: colorteller-route
              - VirtualNode: colorteller-green-vn
                Weight: 1
            Prefix: "/"
$ ./appmesh-colorapp.sh
Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - DEMO-appmesh-colorapp

Again, repeat your verification process in both the CLI and X-Ray to confirm that the new version of your service is running successfully.



In this walkthrough, I showed you how to roll out an update from version 1 (blue) of the colorteller service to version 2 (green). I demonstrated that App Mesh supports a mesh spanning ECS services that you ran as EC2 tasks and as Fargate tasks.

In my next walkthrough, I will demonstrate that App Mesh handles even uncontainerized services launched directly on EC2 instances. It provides a uniform and powerful way to control and monitor your distributed microservice applications on AWS.

If you have any questions or feedback, feel free to comment below.

Scaling Kubernetes deployments with Amazon CloudWatch metrics

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/scaling-kubernetes-deployments-with-amazon-cloudwatch-metrics/

This post is contributed by Kwunhok Chan | Solutions Architect, AWS


In an earlier post, AWS introduced Horizontal Pod Autoscaler and Kubernetes Metrics Server support for Amazon Elastic Kubernetes Service. These tools make it easy to scale your Kubernetes workloads managed by EKS in response to built-in metrics like CPU and memory.

However, one common use case for applications running on EKS is the integration with AWS services. For example, you administer an application that processes messages published to an Amazon SQS queue. You want the application to scale according to the number of messages in that queue. The Amazon CloudWatch Metrics Adapter for Kubernetes (k8s-cloudwatch-adapter) helps.


Amazon CloudWatch Metrics Adapter for Kubernetes

The k8s-cloudwatch-adapter is an implementation of the Kubernetes Custom Metrics API and External Metrics API with integration for CloudWatch metrics. It allows you to scale your Kubernetes deployment using the Horizontal Pod Autoscaler (HPA) with CloudWatch metrics.



Before starting, you need the following:


Getting started

Before using the k8s-cloudwatch-adapter, set up a way to manage IAM credentials to Kubernetes pods. The CloudWatch Metrics Adapter requires the following permissions to access metric data from CloudWatch:


Create an IAM policy with the following template:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

For demo purposes, I’m granting admin permissions to my Kubernetes worker nodes. Don’t do this in your production environment. To associate IAM roles to your Kubernetes pods, you may want to look at kube2iam or kiam.

If you’re using an EKS cluster, you most likely provisioned it with AWS CloudFormation. The following command uses AWS CloudFormation stacks to update the proper instance policy with the correct permissions:

aws iam attach-role-policy \
--policy-arn arn:aws:iam::aws:policy/AdministratorAccess \
--role-name $(aws cloudformation describe-stacks --stack-name ${STACK_NAME} --query 'Stacks[0].Parameters[?ParameterKey==`NodeInstanceRoleName`].ParameterValue' | jq -r ".[0]")


Make sure to replace ${STACK_NAME} with the nodegroup stack name from the AWS CloudFormation console .


You can now deploy the k8s-cloudwatch-adapter to your Kubernetes cluster.

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/deploy/adapter.yaml


This deployment creates a new namespace—custom-metrics—and deploys the necessary ClusterRole, Service Account, and Role Binding values, along with the deployment of the adapter. Use the created custom resource definition (CRD) to define the configuration for the external metrics to retrieve from CloudWatch. The adapter reads the configuration defined in ExternalMetric CRDs and loads its external metrics. That allows you to use HPA to autoscale your Kubernetes pods.


Verifying the deployment

Next, query the metrics APIs to see if the adapter is deployed correctly. Run the following command:

$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq.
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [

There are no resources from the response because you haven’t registered any metric resources yet.


Deploying an Amazon SQS application

Next, deploy a sample SQS application to test out k8s-cloudwatch-adapter. The SQS producer and consumer are provided, together with the YAML files for deploying the consumer, metric configuration, and HPA.

Both the producer and consumer use an SQS queue named helloworld. If it doesn’t exist already, the producer creates this queue.

Deploy the consumer with the following command:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/consumer-deployment.yaml


You can verify that the consumer is running with the following command:

$ kubectl get deploy sqs-consumer
sqs-consumer   1         1         1            0           5s


Set up Amazon CloudWatch metric and HPA

Next, create an ExternalMetric resource for the CloudWatch metric. Take note of the Kind value for this resource. This CRD resource tells the adapter how to retrieve metric data from CloudWatch.

You define the query parameters used to retrieve the ApproximateNumberOfMessagesVisible for an SQS queue named helloworld. For details about how metric data queries work, see CloudWatch GetMetricData API.

apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric:
    name: hello-queue-length
    name: hello-queue-length
      resource: "deployment"
        - id: sqs_helloworld
              namespace: "AWS/SQS"
              metricName: "ApproximateNumberOfMessagesVisible"
                - name: QueueName
                  value: "helloworld"
            period: 300
            stat: Average
            unit: Count
          returnData: true


Create the ExternalMetric resource:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/externalmetric.yaml


Then, set up the HPA for your consumer. Here is the configuration to use:

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
  name: sqs-consumer-scaler
    apiVersion: apps/v1beta1
    kind: Deployment
    name: sqs-consumer
  minReplicas: 1
  maxReplicas: 10
  - type: External
      metricName: hello-queue-length
      targetValue: 30


This HPA rule starts scaling out when the number of messages visible in your SQS queue exceeds 30, and scales in when there are fewer than 30 messages in the queue.

Create the HPA resource:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/hpa.yaml


Generate load using a producer

Finally, you can start generating messages to the queue:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/producer-deployment.yaml

On a separate terminal, you can now watch your HPA retrieving the queue length and start scaling the replicas. SQS metrics generate at five-minute intervals, so give the process a few minutes:

$ kubectl get hpa sqs-consumer-scaler -w


Clean up

After you complete this experiment, you can delete the Kubernetes deployment and respective resources.

Run the following commands to remove the consumer, external metric, HPA, and SQS queue:

$ kubectl delete deploy sqs-producer
$ kubectl delete hpa sqs-consumer-scaler
$ kubectl delete externalmetric sqs-helloworld-length
$ kubectl delete deploy sqs-consumer

$ aws sqs delete-queue helloworld


Other CloudWatch integrations

AWS recently announced the preview for Amazon CloudWatch Container Insights, which monitors, isolates, and diagnoses containerized applications running on EKS and Kubernetes clusters. To get started, see Using Container Insights.


Get involved

This project is currently under development. AWS welcomes issues and pull requests, and would love to hear your feedback.

How could this adapter be best implemented to work in your environment? Visit the Amazon CloudWatch Metrics Adapter for Kubernetes project on GitHub and let AWS know what you think.

Learning AWS App Mesh

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/learning-aws-app-mesh/

This post is contributed by Geremy Cohen | Solutions Architect, Strategic Accounts, AWS

At re:Invent 2018, AWS announced AWS App Mesh, a service mesh that provides application-level networking. App Mesh makes it easy for your services to communicate with each other across multiple types of compute infrastructure, including:

App Mesh standardizes how your services communicate, giving you end-to-end visibility and ensuring high availability for your applications. Service meshes like App Mesh help you run and monitor HTTP and TCP services at scale.

Using the open source Envoy proxy, App Mesh gives you access to a wide range of tools from AWS partners and the open source community. Because all traffic in and out of each service goes through the Envoy proxy, all traffic can be routed, shaped, measured, and logged. This extra level of indirection lets you build your services in any language desired without having to use a common set of communication libraries.

In this six-part series of the post, I walk you through setup and configuration of App Mesh for popular platforms and use cases, beginning with EKS. Here’s the list of the parts:

  1. Part 1: Introducing service meshes.
  2. Part 2: Prerequisites for running on EKS.
  3. Part 3: Creating example microservices on Amazon EKS.
  4. Part 4: Installing the sidecar injector and CRDs.
  5. Part 5: Configuring existing microservices.
  6. Part 6: Deploying with the canary technique.


Throughout the post series, I use diagrams to help describe what’s being built. In the following diagram:

  • The circle represents the container in which your app (microservice) code runs.
  • The dome alongside the circle represents the App Mesh (Envoy) proxy running as a sidecar container. When there is no dome present, no service mesh functionality is implemented for the pod.
  • The arrows show communications traffic between the application container and the proxy, as well as between the proxy and other pods.

PART 1: Introducing service meshes

Life without a service mesh

Best practices call for implementing observability, analytics, and routing capabilities across your microservice infrastructure in a consistent manner.

Between any two interacting services, it’s critical to implement logging, tracing, and metrics gathering—not to mention dynamic routing and load balancing—with minimal impact to your actual application code.

Traditionally, to provide these capabilities, you would compile each service with one or more SDKs that provided this logic. This is known as the “in-process design pattern,” because this logic runs in the same process as the service code.

When you only run a small number of services, running multiple SDKs alongside your application code may not be a huge undertaking. If you can find SDKs that provide the required functionality on the platforms and languages on which you are developing, compiling it into your service code is relatively straightforward.

As your application matures, the in-process design pattern becomes increasingly complex:

  • The number of engineers writing code grows, so each engineer must learn the in-process SDKs in use. They must also spend time integrating the SDKs with their own service logic and the service logic of others.
  • In shops where polyglot development is prevalent, as the number of engineers grow, so may the number of coding languages in use. In these scenarios, you’ll need to make sure that your SDKs are supported on these new languages.
  • The platforms that your engineering teams deploy services to may also increase and become disparate. You may have begun with Node.js containers on Kubernetes, but now, new microservices are being deployed with AWS Lambda, EC2, and other managed services. You’ll need to make sure that the SDK solution that you’ve chosen is compatible with these common platforms.
  • If you’re fortunate to have platform and language support for the SDKs you’re using, inconsistencies across the various SDK languages may creep in. This is especially true when you find a gap in language or platform support and implement custom operational logic for a language or platform that is unsupported.
  • Assuming you’ve accommodated for all the previous caveats, by using SDKs compiled into your service logic, you’re tightly coupling your business logic with your operations logic.


Enter the service mesh

Considering the increasing complexity as your application matures, the true value of service meshes becomes clear. With a service mesh, you can decouple your microservices’ observability, analytics, and routing logic from the underlying infrastructure and application layers.

The following diagram combines the previous two. Instead of incorporating these features at the code level (in-process), an out-of-process “sidecar proxy” container (represented by the pink dome) runs alongside your application code’s container in each pod.


In this model, consistent and decoupled analytics, logging, tracing, and routing logic capabilities are running alongside each microservice in your infrastructure as a sidecar proxy. Each sidecar proxy is configured by a unique configuration ruleset, based on the services it’s responsible for proxying. With 100% of the communications between pods and services proxied, 100% of the traffic is now observable and actionable.


App Mesh as the service mesh

App Mesh implements this sidecar proxy via the production-proven Envoy proxy. Envoy is arguably the most popular open-source service proxy. Created at Lyft in 2016, Envoy is a stable OSS project with wide community support. It’s defined as a “Graduated Project” by the Cloud Native Computing Foundation (CNCF). Envoy is a popular proxy solution due to its lightweight C++-based design, scalable architecture, and successful deployment record.

In the following diagram, a sidecar runs alongside each container in your application to provide its proxying logic, syncing each of their unique configurations from the App Mesh control plane.

Each one of these proxies must have its own unique configuration ruleset pushed to it to operate correctly. To achieve this, DevOps teams can push their intended ruleset configuration to the App Mesh API. From there, the App Mesh control plane reliably keeps all proxy instances up-to-date with their desired configurations. App Mesh dynamically scales to hundreds of thousands of pods, tasks, EC2 instances, and Lambda functions, adjusting configuration changes accordingly as instances scale up, down, and restart.


App Mesh components

App Mesh is made up of the following components:

  • Service mesh: A logical boundary for network traffic between the services that reside within it.
  • Virtual nodes: A logical pointer to a Kubernetes service, or an App Mesh virtual service.
  • Virtual routers: Handles traffic for one or more virtual services within your mesh.
  • Routes: Associated with a virtual router, it directs traffic that matches a service name prefix to one or more virtual nodes.
  • Virtual services: An abstraction of a real service that is either provided by a virtual node directly, or indirectly by means of a virtual router.
  • App Mesh sidecar: The App Mesh sidecar container configures your pods to use the App Mesh service mesh traffic rules set up for your virtual routers and virtual nodes.
  • App Mesh injector: Makes it easy to auto-inject the App Mesh sidecars into your pods.
  • App Mesh custom resource definitions: (CRD) Provided to implement App Mesh CRUD and configuration operations directly from the kubectl CLI.  Alternatively, you may use the latest version of the AWS CLI.


In the following parts, I walk you through the setup and configuration of each of these components.


Conclusion of Part 1

In this first part, I discussed in detail the advantages that service meshes provide, and the specific components that make up the App Mesh service mesh. I hope the information provided helps you to understand the benefit of all services meshes, regardless of vendor.

If you’re intrigued by what you’ve learned so far, don’t stop now!

For even more background on the components of AWS App Mesh, check out the official AWS App Mesh documentation, and when you’re ready, check out part 2 in this post where I guide you through completing the prerequisite steps to run App Mesh in your own environment.



PART 2: Setting up AWS App Mesh on Amazon EKS


In part 1 of this series, I discussed the functionality of service meshes like AWS App Mesh provided on Kubernetes and other services. In this post, I walk you through completing the prerequisites required to install and run App Mesh in your own Amazon EKS-based Kubernetes environment.

When you have the environment set up, be sure to leave it intact if you plan on experimenting in the future with App Mesh on your own (or throughout this series of posts).



To run App Mesh, your environment must meet the following requirements.

  • An AWS account
  • The AWS CLI installed and configured
    • The minimal version supported is 1.16.133. You should have a Region set via the aws configure command. For this tutorial, it should work against all Regions where App Mesh and Amazon EKS are supported. Use us-west-2 if you don’t have a preference or are in doubt:
      aws configure set region us-west-2
  • The jq utility
    • The utility is required by scripts executed in this series. Make sure that you have it installed on the machine from which to run the tutorial steps.
  • Kubernetes and kubectl
    • The minimal Kubernetes and kubectl versions supported are 1.11. You need a Kubernetes cluster deployed on Amazon Elastic Compute Cloud (Amazon EC2) or on an Amazon EKS cluster. Although the steps in this tutorial demonstrate using App Mesh on Amazon EKS, the instructions also work on upstream k8s running on Amazon EC2.

Amazon EKS makes it easy to run Kubernetes on AWS. Start by creating an EKS cluster using eksctl.  For more information about how to use eksctl to spin up an EKS cluster for this exercise, see eksworkshop.com. That site has a great tutorial for getting up and running quickly with an account, as well as an EKS cluster.


Clone the tutorial repository

Clone the tutorial’s repository by issuing the following command in a directory of your choice:

git clone https://github.com/aws/aws-app-mesh-examples

Next, navigate to the repo’s /djapp examples directory:

cd aws-app-mesh-examples/examples/apps/djapp/

All the steps in this tutorial are executed out of this directory.


IAM permissions for the user and k8s worker nodes

Both k8s worker nodes and any principals (including yourself) running App Mesh AWS CLI commands must have the proper permissions to access the App Mesh service, as shown in the following code example:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

To provide users with the correct permissions, add the previous policy to the user’s role or group, or create it as an inline policy.

To verify as a user that you have the correct permissions set for App Mesh, issue the following command:

aws appmesh list-meshes

If you have the proper permissions and haven’t yet created a mesh, you should get back an empty response like the following. If you did have a mesh created, you get a slightly more verbose response.

"meshes": []

If you do not have the proper permissions, you’ll see a response similar to the following:

An error occurred (AccessDeniedException) when calling the ListMeshes operation: User: arn:aws:iam::123abc:user/foo is not authorized to perform: appmesh:ListMeshes on resource: *

As a user, these permissions (or even the Administrator Access role) enable you to complete this tutorial, but it’s critical to implement least-privileged access for production or internet-facing deployments.


Adding the permissions for EKS worker nodes

If you’re using an Amazon EKS-based cluster to follow this tutorial (suggested), you can easily add the previous permissions to your k8s worker nodes with the following steps.

First, get the role under which your k8s workers are running:

INSTANCE_PROFILE_NAME=$(aws iam list-instance-profiles | jq -r '.InstanceProfiles[].InstanceProfileName' | grep nodegroup)
ROLE_NAME=$(aws iam get-instance-profile --instance-profile-name $INSTANCE_PROFILE_NAME | jq -r '.InstanceProfile.Roles[] | .RoleName')

Upon running that command, the $ROLE_NAME environment variable should be output similar to:


Copy and paste the following code to add the permissions as an inline policy to your worker node instances:

cat << EoF > k8s-appmesh-worker-policy.json
  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": [
      "Resource": "*"

aws iam put-role-policy --role-name $ROLE_NAME --policy-name AppMesh-Policy-For-Worker --policy-document file://k8s-appmesh-worker-policy.json

To verify that the policy was attached to the role, run the following command:

aws iam get-role-policy --role-name $ROLE_NAME --policy-name AppMesh-Policy-For-Worker

To test that your worker nodes are able to use these permissions correctly, run the following job from the project’s directory.

NOTE: The following YAML is configured for the us-west-2 Region. If you are running your cluster and App Mesh out of a different Region, modify the –region value found in the command attribute (not in the image attribute) in the YAML before proceeding, as shown below:

command: ["aws","appmesh","list-meshes","—region","us-west-2"]

Execute the job by running the following command:

kubectl apply -f awscli.yaml

Make sure that the job is completed by issuing the command:

kubectl get jobs

You should see that the desired and successful values are both one:

awscli   1         1            1m

Inspect the output of the job:

kubectl logs jobs/awscli

Similar to the list-meshes call, the output of this command shows whether your nodes can make App Mesh API calls successfully.

This output shows that the workers have proper access:

"meshes": []

While this output shows that they don’t:

An error occurred (AccessDeniedException) when calling the ListMeshes operation: User: arn:aws:iam::123abc:user/foo is not authorized to perform: appmesh:ListMeshes on resource: *

If you have to troubleshoot further, you must first delete the job before you run it again to test it:

kubectl delete jobs/awscli

After you’ve verified that you have the proper permissions set, you are ready to move forward and understand more about the demo application you’re going to build on top of App Mesh.


Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:


This script does not delete any nodes in your k8s cluster. It only deletes the DJ App and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.


Conclusion of Part 2

In this second part of the series, I walked you through the prerequisites required to install and run App Mesh in an Amazon EKS-based Kubernetes environment. In part 3 , I show you how to create a simple microservice that can be implemented on an App Mesh service mesh.



PART 3: Creating example microservices on Amazon EKS


In part 2 of this series, I walked you through completing the setup steps needed to configure your environment to run AWS App Mesh. In this post, I walk you through creating three Amazon EKS-based microservices. These microservices work together to form an app called DJ App, which you use later to demonstrate App Mesh functionality.



Make sure that you’ve completed parts 1 and 2 of this series before running through the steps in this post.


Overview of DJ App

I’ll now walk you through creating an example app on App Mesh called DJ App, which is used for a cloud-based music service. This application is composed of the following three microservices:

  • dj
  • metal-v1
  • jazz-v1

The dj service makes requests to either the jazz or metal backends for artist lists. If the dj service requests from the jazz backend, then musical artists such as Miles Davis or Astrud Gilberto are returned. Requests made to the metal backend return artists such as Judas Priest or Megadeth.

Today, the dj service is hardwired to make requests to the metal-v1 service for metal requests and to the jazz-v1 service for jazz requests. Each time there is a new metal or jazz release, a new version of dj must also be rolled out to point to its new upstream endpoints. Although it works for now, it’s not an optimal configuration to maintain for the long term.

App Mesh can be used to simplify this architecture. By virtualizing the metal and jazz service via kubectl or the AWS CLI, routing changes can be made dynamically to the endpoints and versions of your choosing. That minimizes the need for the complete re-deployment of DJ App each time there is a new metal or jazz service release.


Create the initial architecture

To begin, I’ll walk you through creating the initial application architecture. As the following diagram depicts, in the initial architecture, there are three k8s services:

  • The dj service, which serves as the DJ App entrypoint
  • The metal-v1 service backend
  • The jazz-v1 service backend

As depicted by the arrows, the dj service will make requests to either the metal-v1, or jazz-v1 backends.

First, deploy the k8s components that make up this initial architecture. To keep things organized, create a namespace for the app called prod, and deploy all of the DJ App components into that namespace. To create the prod namespace, issue the following command:

kubectl apply -f 1_create_the_initial_architecture/1_prod_ns.yaml

The output should be similar to the following:

namespace/prod created

Now that you’ve created the prod namespace, deploy the DJ App (the dj, metal, and jazz microservices) into it. Create the DJ App deployment in the prod namespace by issuing the following command:

kubectl apply -nprod -f 1_create_the_initial_architecture/1_initial_architecture_deployment.yaml

The output should be similar to:

deployment.apps "dj" created
deployment.apps "metal-v1" created
deployment.apps "jazz-v1" created

Create the services that front these deployments by issuing the following command:

kubectl apply -nprod -f 1_create_the_initial_architecture/1_initial_architecture_services.yaml

The output should be similar to:

service "dj" created
service "metal-v1" created
service "jazz-v1" created

Now, verify that everything has been set up correctly by getting all resources from the prod namespace. Issue this command:

kubectl get all -nprod

The output should display the dj, jazz, and metal pods, and the services, deployments, and replica sets, similar to the following:

NAME                            READY   STATUS    RESTARTS   AGE
pod/dj-5b445fbdf4-qf8sv         1/1     Running   0          1m
pod/jazz-v1-644856f4b4-mshnr    1/1     Running   0          1m
pod/metal-v1-84bffcc887-97qzw   1/1     Running   0          1m

NAME               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/dj         ClusterIP   <none>        9080/TCP   15s
service/jazz-v1    ClusterIP   <none>        9080/TCP   15s
service/metal-v1   ClusterIP   <none>        9080/TCP   15s

deployment.apps/dj         1         1         1            1           1m
deployment.apps/jazz-v1    1         1         1            1           1m
deployment.apps/metal-v1   1         1         1            1           1m

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/dj-5b445fbdf4         1         1         1       1m
replicaset.apps/jazz-v1-644856f4b4    1         1         1       1m
replicaset.apps/metal-v1-84bffcc887   1         1         1       1m

When you’ve verified that all resources have been created correctly in the prod namespace, test out this initial version of DJ App. To do that, exec into the DJ pod, and issue a curl request out to the jazz-v1 and metal-v1 backends. Get the name of the DJ pod by listing all the pods with the dj app selector:

kubectl get pods -nprod -l app=dj

The output should be similar to:

NAME                  READY     STATUS    RESTARTS   AGE
dj-5b445fbdf4-8xkwp   1/1       Running   0          32s

Next, exec into the DJ pod:

kubectl exec -nprod -it <your-dj-pod-name> bash

The output should be similar to:

[email protected]:/usr/src/app#

Now that you have a root prompt into the DJ pod, issue a curl request to the jazz-v1 backend service:

curl jazz-v1.prod.svc.cluster.local:9080;echo

The output should be similar to:

["Astrud Gilberto","Miles Davis"]

Try it again, but this time issue the command to the metal-v1.prod.svc.cluster.local backend on port 9080:

curl metal-v1.prod.svc.cluster.local:9080;echo

You should get a list of heavy metal bands:

["Megadeth","Judas Priest"]

When you’re done exploring this vast world of music, press CTRL-D, or type exit to exit the container’s shell:

[email protected]:/usr/src/app# exit
command terminated with exit code 1

Congratulations on deploying the initial DJ App architecture!


Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:


This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.


Conclusion of Part 3

In this third part of the series, I demonstrated how to create three simple Kubernetes-based microservices, which working together, form an app called DJ App. This app is later used to demonstrate App Mesh functionality.

In part 4, I show you how to install the App Mesh sidecar injector and CRDs, which make defining and configuring App Mesh components easy.



PART 4: Installing the sidecar injector and CRDs


In part 3 of this series, I walked you through setting up a basic microservices-based application called DJ App on Kubernetes with Amazon EKS. In this post, I demonstrate how to set up and configure the AWS App Mesh sidecar injector and custom resource definitions (CRDs).  As you will see later, the sidecar injector and CRD components make defining and configuring DJ App’s service mesh more convenient.



Make sure that you’ve completed parts 1–3 of this series before running through the steps in this post.


Installing the App Mesh sidecar

As decoupled logic, an App Mesh sidecar container must run alongside each pod in the DJ App deployment. This can be set up in few different ways:

  1. Before installing the deployment, you could modify the DJ App deployment’s container specs to include App Mesh sidecar containers. When the app is deployed, it would run the sidecar.
  2. After installing the deployment, you could patch the deployment to include the sidecar container specs. Upon applying this patch, the old pods are torn down, and the new pods come up with the sidecar.
  3. You can implement the App Mesh injector controller, which watches for new pods to be created and automatically adds the sidecar data to the pods as they are deployed.

For this tutorial, I walk you through the App Mesh injector controller option, as it enables subsequent pod deployments to automatically come up with the App Mesh sidecar. This is not only quicker in the long run, but it also reduces the chances of typos that manual editing may introduce.


Creating the injector controller

To create the injector controller, run a script that creates a namespace, generates certificates, and then installs the injector deployment.

From the base repository directory, change to the injector directory:

cd 2_create_injector

Next, run the create.sh script:


The output should look similar to the following:

namespace/appmesh-inject created
creating certs in tmpdir /var/folders/02/qfw6pbm501xbw4scnk20w80h0_xvht/T/tmp.LFO95khQ
Generating RSA private key, 2048 bit long modulus
e is 65537 (0x10001)
certificatesigningrequest.certificates.k8s.io/aws-app-mesh-inject.appmesh-inject created
NAME                                 AGE   REQUESTOR          CONDITION
aws-app-mesh-inject.appmesh-inject   0s    kubernetes-admin   Pending
certificatesigningrequest.certificates.k8s.io/aws-app-mesh-inject.appmesh-inject approved
secret/aws-app-mesh-inject created

processing templates
Created injector manifest at:/2_create_injector/inject.yaml

serviceaccount/aws-app-mesh-inject-sa created
clusterrole.rbac.authorization.k8s.io/aws-app-mesh-inject-cr unchanged
clusterrolebinding.rbac.authorization.k8s.io/aws-app-mesh-inject-binding configured
service/aws-app-mesh-inject created
deployment.apps/aws-app-mesh-inject created
mutatingwebhookconfiguration.admissionregistration.k8s.io/aws-app-mesh-inject unchanged

Waiting for pods to come up...

App Inject Pods and Services After Install:

NAME                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
aws-app-mesh-inject   ClusterIP   <none>        443/TCP   16s
NAME                                   READY   STATUS    RESTARTS   AGE
aws-app-mesh-inject-5d84d8c96f-gc6bl   1/1     Running   0          16s

If you’re seeing this output, the injector controller has been installed correctly. By default, the injector doesn’t act on any pods—you must give it the criteria on what to act on. For the purpose of this tutorial, you’ll next configure it to inject the App Mesh sidecar into any new pods created in the prod namespace.

Return to the repo’s base directory:

cd ..

Run the following command to label the prod namespace:

kubectl label namespace prod appmesh.k8s.aws/sidecarInjectorWebhook=enabled

The output should be similar to the following:

namespace/prod labeled

Next, verify that the injector controller is running:

kubectl get pods -nappmesh-inject

You should see output similar to the following:

NAME                                   READY   STATUS    RESTARTS   AGE
aws-app-mesh-inject-78c59cc699-9jrb4   1/1     Running   0          1h

With the injector portion of the setup complete, I’ll now show you how to create the App Mesh components.


Choosing a way to create the App Mesh components

There are two ways to create the components of the App Mesh service mesh:

For this tutorial, I show you how to use kubectl to define the App Mesh components.  To do this, add the CRDs and the App Mesh controller logic that syncs your Kubernetes cluster’s CRD state with the AWS Cloud App Mesh control plane.


Adding the CRDs and App Mesh controller

To add the CRDs, run the following commands from the repository base directory:

kubectl apply -f 3_add_crds/mesh-definition.yaml
kubectl apply -f 3_add_crds/virtual-node-definition.yaml
kubectl apply -f 3_add_crds/virtual-service-definition.yaml

The output should be similar to the following:

customresourcedefinition.apiextensions.k8s.io/meshes.appmesh.k8s.aws created
customresourcedefinition.apiextensions.k8s.io/virtualnodes.appmesh.k8s.aws created
customresourcedefinition.apiextensions.k8s.io/virtualservices.appmesh.k8s.aws created

Next, add the controller by executing the following command:

kubectl apply -f 3_add_crds/controller-deployment.yaml

The output should be similar to the following:

namespace/appmesh-system created
deployment.apps/app-mesh-controller created
serviceaccount/app-mesh-sa created
clusterrole.rbac.authorization.k8s.io/app-mesh-controller created
clusterrolebinding.rbac.authorization.k8s.io/app-mesh-controller-binding created

Run the following command to verify that the App Mesh controller is running:

kubectl get pods -nappmesh-system

You should see output similar to the following:

NAME                                   READY   STATUS    RESTARTS   AGE
app-mesh-controller-85f9d4b48f-j9vz4   1/1     Running   0          7m

NOTE: The CRD and injector are AWS-supported open source projects. If you plan to deploy the CRD or injector for production projects, always build them from the latest AWS GitHub repos and deploy them from your own container registry. That way, you stay up-to-date on the latest features and bug fixes.


Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:


This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.


Conclusion of Part 4

In this fourth part of the series, I walked you through setting up the App Mesh sidecar injector and CRD components. In part 5, I show you how to define the App Mesh components required to run DJ App on a service mesh.



PART 5: Configuring existing microservices


In part 4 of this series, I demonstrated how to set up the AWS App Mesh Sidecar Injector and CRDs. In this post, I’ll show how to configure the DJ App microservices to run on top of App Mesh by creating the required App Mesh components.



Make sure that you’ve completed parts 1–4 of this series before running through the steps in this post.


DJ App revisited

As shown in the following diagram, the dj service is hardwired to make requests to either the metal-v1 or jazz-v1 backends.

The service mesh-enabled version functionally does exactly what the current version does. The only difference is that you use App Mesh to create two new virtual services called metal and jazz. The dj service now makes a request to these metal or jazz virtual services, which route to their metal-v1 and jazz-v1 counterparts accordingly, based on the virtual services’ routing rules. The following diagram depicts this process.

By virtualizing the metal and jazz services, you can dynamically configure routing rules to the versioned backends of your choosing. That eliminates the need to re-deploy the entire DJ App each time there’s a new metal or jazz service version release.


Now that you have a better idea of what you’re building, I’ll show you how to create the mesh.


Creating the mesh

The mesh component, which serves as the App Mesh foundation, must be created first. Call the mesh dj-app, and define it in the prod namespace by executing the following command from the repository’s base directory:

kubectl create -f 4_create_initial_mesh_components/mesh.yaml

You should see output similar to the following:

mesh.appmesh.k8s.aws/dj-app created

Because an App Mesh mesh is a custom resource, kubectl can be used to view it using the get command. Run the following command:

kubectl get meshes -nprod

This yields the following:

dj-app   1h

As is the case for any of the custom resources you interact with in this tutorial, you can also view App Mesh resources using the AWS CLI:

aws appmesh list-meshes

    "meshes": [
            "meshName": "dj-app",
            "arn": "arn:aws:appmesh:us-west-2:123586676:mesh/dj-app"

aws appmesh describe-mesh --mesh-name dj-app

    "mesh": {
        "status": {
            "status": "ACTIVE"
        "meshName": "dj-app",
        "metadata": {
            "version": 1,
            "lastUpdatedAt": 1553233281.819,
            "createdAt": 1553233281.819,
            "arn": "arn:aws:appmesh:us-west-2:123586676:mesh/dj-app",
            "uid": "10d86ae0-ece7-4b1d-bc2d-08064d9b55e1"

NOTE: If you do not see dj-app returned from the previous list-meshes command, then your user account (as well as your worker nodes) may not have the correct IAM permissions to access App Mesh resources. Verify that you and your worker nodes have the correct permissions per part 2 of this series.


Creating the virtual nodes and virtual services

With the foundational mesh component created, continue onward to define the App Mesh virtual node and virtual service components. All physical Kubernetes services that interact with each other in App Mesh must first be defined as virtual node objects.

Abstracting out services as virtual nodes helps App Mesh build rulesets around inter-service communication. In addition, as you define virtual service objects, virtual nodes may be referenced as inputs and target endpoints for those virtual services. Because of this, it makes sense to define the virtual nodes first.

Based on the first App Mesh-enabled architecture, the physical service dj makes requests to two new virtual services—metal and jazz. These services route requests respectively to the physical services metal-v1 and jazz-v1, as shown in the following diagram.

Because there are three physical services involved in this configuration, you’ll need to define three virtual nodes. To do that, enter the following:

kubectl create -nprod -f 4_create_initial_mesh_components/nodes_representing_physical_services.yaml

The output should be similar to:

virtualnode.appmesh.k8s.aws/dj created
virtualnode.appmesh.k8s.aws/jazz-v1 created
virtualnode.appmesh.k8s.aws/metal-v1 created

If you open up the YAML in your favorite editor, you may notice a few things about these virtual nodes.

They’re both similar, but for the purposes of this tutorial, examine just the metal-v1.prod.svc.cluster.local VirtualNode:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualNode
  name: metal-v1
  namespace: prod
  meshName: dj-app
    - portMapping:
        port: 9080
        protocol: http
      hostName: metal-v1.prod.svc.cluster.local


According to this YAML, this virtual node points to a service (spec.serviceDiscovery.dns.hostName: metal-v1.prod.svc.cluster.local) that listens on a given port for requests (spec.listeners.portMapping.port: 9080).

You may notice that jazz-v1 and metal-v1 are similar to the dj virtual node, with one key difference; the dj virtual node contains a backend attribute:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualNode
  name: dj
  namespace: prod
  meshName: dj-app
    - portMapping:
        port: 9080
        protocol: http
      hostName: dj.prod.svc.cluster.local
    - virtualService:
        virtualServiceName: jazz.prod.svc.cluster.local
    - virtualService:
        virtualServiceName: metal.prod.svc.cluster.local

The backend attribute specifies that dj is allowed to make requests to the jazz and metal virtual services only.

At this point, you’ve created three virtual nodes:

kubectl get virtualnodes -nprod

NAME            AGE
dj              6m
jazz-v1         6m
metal-v1        6m

The last step is to create the two App Mesh virtual services that intercept and route requests made to jazz and metal. To do this, run the following command:

kubectl apply -nprod -f 4_create_initial_mesh_components/virtual-services.yaml

The output should be similar to:

virtualservice.appmesh.k8s.aws/jazz.prod.svc.cluster.local created
virtualservice.appmesh.k8s.aws/metal.prod.svc.cluster.local created

If you inspect the YAML, you may notice that it created two virtual service resources. Requests made to jazz.prod.svc.cluster.local are intercepted by App Mesh and routed to the virtual node jazz-v1.

Similarly, requests made to metal.prod.svc.cluster.local are routed to the virtual node metal-v1:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualService
  name: jazz.prod.svc.cluster.local
  namespace: prod
  meshName: dj-app
    name: jazz-router
    - name: jazz-route
          prefix: /
            - virtualNodeName: jazz-v1
              weight: 100

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualService
  name: metal.prod.svc.cluster.local
  namespace: prod
  meshName: dj-app
    name: metal-router
    - name: metal-route
          prefix: /
            - virtualNodeName: metal-v1
              weight: 100

NOTE: Remember to use fully qualified DNS names for the virtual service’s metadata.name field to prevent the chance of name collisions when using App Mesh cross-cluster.

With these virtual services defined, to access them by name, clients (in this case, the dj container) first perform a DNS lookup to jazz.prod.svc.cluster.local or metal.prod.svc.cluster.local before making the HTTP request.

If the dj container (or any other client) cannot resolve that name to an IP, the subsequent HTTP request fails with a name lookup error.

The existing physical services (jazz-v1, metal-v1, dj) are defined as physical Kubernetes services, and therefore have resolvable names:

kubectl get svc -nprod

dj         ClusterIP   <none>        9080/TCP   16h
jazz-v1    ClusterIP   <none>        9080/TCP   16h
metal-v1   ClusterIP   <none>        9080/TCP   16h

However, the new jazz and metal virtual services we just created don’t (yet) have resolvable names.

To provide the jazz and metal virtual services with resolvable IP addresses and hostnames, define them as Kubernetes services that do not map to any deployments or pods. Do this by creating them as k8s services without defining selectors for them. Because App Mesh is intercepting and routing requests made for them, they don’t have to map to any pods or deployments on the k8s-side.

To register the placeholder names and IP addresses for these virtual services, run the following command:

kubectl create -nprod -f 4_create_initial_mesh_components/metal_and_jazz_placeholder_services.yaml

The output should be similar to:

service/jazz created
service/metal created

You can now use kubectl to get the registered metal and jazz virtual services:

kubectl get -nprod virtualservices

NAME                           AGE
jazz.prod.svc.cluster.local    10m
metal.prod.svc.cluster.local   10m

You can also get the virtual service placeholder IP addresses and physical service IP addresses:

kubectl get svc -nprod

dj         ClusterIP   <none>        9080/TCP   17h
jazz       ClusterIP   <none>        9080/TCP   27s
jazz-v1    ClusterIP   <none>        9080/TCP   17h
metal      ClusterIP   <none>        9080/TCP   27s
metal-v1   ClusterIP   <none>        9080/TCP   17h

As such, when name lookup requests are made to your virtual services alongside their physical service counterparts, they resolve.

Currently, if you describe any of the pods running in the prod namespace, they are running with just one container (the same one with which you initially deployed it):

kubectl get pods -nprod

NAME                        READY   STATUS    RESTARTS   AGE
dj-5b445fbdf4-qf8sv         1/1     Running   0          3h
jazz-v1-644856f4b4-mshnr    1/1     Running   0          3h
metal-v1-84bffcc887-97qzw   1/1     Running   0          3h

kubectl describe pods/dj-5b445fbdf4-qf8sv -nprod

    Container ID:   docker://76e6d5f7101dfce60158a63cf7af9fcb3c821c087db360e87c5e2fb8850b7aa9
    Image:          970805265562.dkr.ecr.us-west-2.amazonaws.com/hello-world:latest
    Image ID:       docker-pullable://970805265562.dkr.ecr.us-west-2.amazonaws.com/[email protected]:581fe44cf2413a48f0cdf005b86b025501eaff6cafc7b26367860e07be060753
    Port:           9080/TCP
    Host Port:      0/TCP
    State:          Running

The injector controller installed earlier watches for new pods to be created and ensures that any new pods created in the prod namespace are injected with the App Mesh sidecar. Because the dj pods were already running before the injector was created, you’ll now force them to be re-created, this time with the sidecars auto-injected into them.

In production, there are more graceful ways to do this. For the purpose of this tutorial, an easy way to have the deployment re-create the pods in an innocuous fashion is to patch a simple date annotation into the deployment.

To do that with your current deployment, first get all the prod namespace pod names:

kubectl get pods -nprod

The output is the pod names:

NAME                        READY   STATUS    RESTARTS   AGE
dj-5b445fbdf4-qf8sv         1/1     Running   0          3h
jazz-v1-644856f4b4-mshnr    1/1     Running   0          3h
metal-v1-84bffcc887-97qzw   1/1     Running   0          3h


Under the READY column, you see 1/1, which indicates that one container is running for each pod.

Next, run the following commands to add a date label to each dj, jazz-v1, and metal-1 deployment, forcing the pods to be re-created:

kubectl patch deployment dj -nprod -p "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}"
kubectl patch deployment metal-v1 -nprod -p "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}"
kubectl patch deployment jazz-v1 -nprod -p "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}"

Again, get the pods:

kubectl get pods -nprod

Under READY, you see 2/2, which indicates that two containers for each pod are running:

NAME                        READY   STATUS    RESTARTS   AGE
dj-6cfb85cdd9-z5hsp         2/2     Running   0          10m
jazz-v1-79d67b4fd6-hdrj9    2/2     Running   0          16s
metal-v1-769b58d9dc-7q92q   2/2     Running   0          18s

NOTE: If you don’t see this exact output, wait about 10 seconds (your redeployment is underway), and re-run the command.

Now describe the new dj pod to get more detail:

    Container ID:   docker://bef63f2e45fb911f78230ef86c2a047a56c9acf554c2272bc094300c6394c7fb
    Image:          970805265562.dkr.ecr.us-west-2.amazonaws.com/hello-world:latest
    Container ID:   docker://2bd0dc0707f80d436338fce399637dcbcf937eaf95fed90683eaaf5187fee43a
    Image:          111345817488.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.8.0.2-beta

Both the original container and the auto-injected sidecar are running for any new pods created in the prod namespace.

Testing the App Mesh architecture

To test if the new architecture is working as expected, exec into the dj container. Get the name of your dj pod by listing all pods with the dj selector:

kubectl get pods -nprod -lapp=dj

The output should be similar to the following:

NAME                  READY     STATUS    RESTARTS   AGE
dj-5b445fbdf4-8xkwp   1/1       Running   0          32s

Next, exec into the dj pod returned from the last step:

kubectl exec -nprod -it <your-dj-pod-name> bash

The output should be similar to:

[email protected]:/usr/src/app#

Now that you have a root prompt into the dj pod, make a curl request to the virtual service jazz on port 9080. Your request simulates what would happen if code running in the same pod made a request to the jazz backend:

curl jazz.prod.svc.cluster.local:9080;echo

The output should be similar to the following:

["Astrud Gilberto","Miles Davis"]

Try it again, but issue the command to the virtual metal service:

curl metal.prod.svc.cluster.local:9080;echo

You should get a list of heavy metal bands:

["Megadeth","Judas Priest"]

When you’re done exploring this vast, service-mesh-enabled world of music, press CTRL-D, or type exit to exit the container’s shell:

[email protected]:/usr/src/app# exit
command terminated with exit code 1


Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:


This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.

Conclusion of Part 5

In this fifth part of the series, you learned how to enable existing microservices to run on App Mesh. In part 6, I demonstrate the true power of App Mesh by walking you through adding new versions of the metal and jazz services and demonstrating how to route between them.



PART 6: Deploying with the canary technique

In part 5 of this series, I demonstrated how to configure an existing microservices-based application (DJ App) to run on AWS App Mesh. In this post, I demonstrate how App Mesh can be used to deploy new versions of Amazon EKS-based microservices using the canary technique.


Make sure that you’ve completed parts 1–5 of this series before running through the steps in this post.

Canary testing with v2

A canary release is a method of slowly exposing a new version of software. The theory is that by serving the new version of the software to a small percentage of requests, any problems only affect the small percentage of users before they’re discovered and rolled back.

So now, back to the DJ App scenario. Version 2 of the metal and jazz services is out, and they now include the city that each artist is from in the response. You’ll now release v2 versions of the metal and jazz services in a canary fashion using App Mesh. When you complete this process, requests to the metal and jazz services are distributed in a weighted fashion to both the v1 and v2 versions.

The following diagram shows the final (v2) seven-microservices-based application, running on an App Mesh service mesh.



To begin, roll out the v2 deployments, services, and virtual nodes with a single YAML file:

kubectl apply -nprod -f 5_canary/jazz_v2.yaml

The output should be similar to the following:

deployment.apps/jazz-v2 created
service/jazz-v2 created
virtualnode.appmesh.k8s.aws/jazz-v2 created

Next, update the jazz virtual service by modifying the route to spread traffic 50/50 across the two versions. Look at it now, and see that the current route points 100% to jazz-v1:

kubectl describe virtualservice jazz -nprod

Name:         jazz.prod.svc.cluster.local
Namespace:    prod
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:

API Version:  appmesh.k8s.aws/v1beta1
Kind:         VirtualService
  Creation Timestamp:  2019-03-23T00:15:08Z
  Generation:          3
  Resource Version:    2851527
  Self Link:           /apis/appmesh.k8s.aws/v1beta1/namespaces/prod/virtualservices/jazz.prod.svc.cluster.local
  UID:                 b76eed59-4d00-11e9-87e6-06dd752b96a6
  Mesh Name:  dj-app
        Weighted Targets:
          Virtual Node Name:  jazz-v1
          Weight:             100
        Prefix:  /
    Name:        jazz-route
  Virtual Router:
    Name:  jazz-router
Events:  <none>

Apply the updated service definition:

kubectl apply -nprod -f 5_canary/jazz_service_update.yaml

When you describe the virtual service again, you see the updated route:

kubectl describe virtualservice jazz -nprod

Name:         jazz.prod.svc.cluster.local
Namespace:    prod
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:

API Version:  appmesh.k8s.aws/v1beta1
Kind:         VirtualService
  Creation Timestamp:  2019-03-23T00:15:08Z
  Generation:          4
  Resource Version:    2851774
  Self Link:           /apis/appmesh.k8s.aws/v1beta1/namespaces/prod/virtualservices/jazz.prod.svc.cluster.local
  UID:                 b76eed59-4d00-11e9-87e6-06dd752b96a6
  Mesh Name:  dj-app
        Weighted Targets:
          Virtual Node Name:  jazz-v1
          Weight:             90
          Virtual Node Name:  jazz-v2
          Weight:             10
        Prefix:  /
    Name:        jazz-route
  Virtual Router:
    Name:  jazz-router
Events:  <none>

To deploy metal-v2, perform the same steps. Roll out the v2 deployments, services, and virtual nodes with a single YAML file:

kubectl apply -nprod -f 5_canary/metal_v2.yaml

The output should be similar to the following:

deployment.apps/metal-v2 created
service/metal-v2 created
virtualnode.appmesh.k8s.aws/metal-v2 created

Update the metal virtual service by modifying the route to spread traffic 50/50 across the two versions:

kubectl apply -nprod -f 5_canary/metal_service_update.yaml

When you describe the virtual service again, you see the updated route:

kubectl describe virtualservice metal -nprod

Name:         metal.prod.svc.cluster.local
Namespace:    prod
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:

API Version:  appmesh.k8s.aws/v1beta1
Kind:         VirtualService
  Creation Timestamp:  2019-03-23T00:15:08Z
  Generation:          2
  Resource Version:    2852282
  Self Link:           /apis/appmesh.k8s.aws/v1beta1/namespaces/prod/virtualservices/metal.prod.svc.cluster.local
  UID:                 b784e824-4d00-11e9-87e6-06dd752b96a6
  Mesh Name:  dj-app
        Weighted Targets:
          Virtual Node Name:  metal-v1
          Weight:             50
          Virtual Node Name:  metal-v2
          Weight:             50
        Prefix:  /
    Name:        metal-route
  Virtual Router:
    Name:  metal-router
Events:  <none>

Testing the v2 jazz and metal services

Now that the v2 services are deployed, it’s time to test them out. To test if it’s working as expected, exec into the DJ pod. To do that, get the name of your dj pod by listing all pods with the dj selector:

kubectl get pods -nprod -l app=dj

The output should be similar to the following:

NAME                  READY     STATUS    RESTARTS   AGE
dj-5b445fbdf4-8xkwp   1/1       Running   0          32s

Next, exec into the DJ pod by running the following command:

kubectl exec -nprod -it <your dj pod name> bash

The output should be similar to the following:

[email protected]:/usr/src/app#

Now that you have a root prompt into the DJ pod, issue a curl request to the metal virtual service:

while [ 1 ]; do curl http://metal.prod.svc.cluster.local:9080/;echo; done

The output should loop about 50/50 between the v1 and v2 versions of the metal service, similar to:

["Megadeth","Judas Priest"]
["Megadeth (Los Angeles, California)","Judas Priest (West Bromwich, England)"]
["Megadeth","Judas Priest"]
["Megadeth (Los Angeles, California)","Judas Priest (West Bromwich, England)"]

Press CTRL-C to stop the looping.

Next, perform a similar test, but against the jazz service. Issue a curl request to the jazz virtual service from within the dj pod:

while [ 1 ]; do curl http://jazz.prod.svc.cluster.local:9080/;echo; done

The output should loop about in a 90/10 ratio between the v1 and v2 versions of the jazz service, similar to the following:

["Astrud Gilberto","Miles Davis"]
["Astrud Gilberto","Miles Davis"]
["Astrud Gilberto","Miles Davis"]
["Astrud Gilberto (Bahia, Brazil)","Miles Davis (Alton, Illinois)"]
["Astrud Gilberto","Miles Davis"]

Press CTRL-C to stop the looping, and then type exit to exit the pod’s shell.

Cleaning up

When you’re done experimenting and want to delete all the resources created during this tutorial series, run the cleanup script via the following command line:


This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own.

Conclusion of Part 6

In this final part of the series, I demonstrated how App Mesh can be used to roll out new microservice versions using the canary technique. Feel free to experiment further with the cluster by adding or removing microservices, and tweaking routing rules by changing weights and targets.


Geremy is a solutions architect at AWS.  He enjoys spending time with his family, BBQing, and breaking and fixing things around the house.


Updates to Amazon EKS Version Lifecycle

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/updates-to-amazon-eks-version-lifecycle/

Contributed by Nathan Taber and Michael Hausenblas

At re:Invent 2017 we introduced the Amazon Elastic Container Service for Kubernetes, or Amazon EKS for short. We consider these tenets as valid today as they were at launch:

  • EKS is a platform to run production-grade workloads. This means that security and reliability are our first priority. After that we focus on doing the heavy lifting for you in the control plane, including life cycle-related things like version upgrades.
  • EKS provides a native and upstream Kubernetes experience. This means, with EKS you get vanilla, un-forked Kubernetes. Of course, in keeping with our first tenant, we ensure the Kubernetes versions we run have security-related patches, even for older, supported versions as quickly as possible. However, in terms of portability there’s no special sauce and no lock in.
  • If you want to use additional AWS services, the integrations are as seamless as possible.
  • The EKS team in AWS actively contributes to the upstream Kubernetes project, both on the technical level as well as community, from communicating good practices to participation in SIGs and working groups.

The first two tenets are highlighted and that is for a good reason: on the one hand we aim to go in lock-step with the upstream release cadence as much as possible, including outcomes of the SIG PM as well as the LTS Working Group. Given that running a service for production applications is our main focus, we want to make sure that you can rely on the Kubernetes we run for you. This includes, but is not limited to, security considerations around community support for ongoing bug fixes and patches for critical vulnerabilities and exposures (CVEs).

In this post, we want to give you a heads-up on upcoming changes with out Amazon EKS is managing the lifecycle for Kubernetes versions, walk you through the process in general and then have a look at a concrete example, Kubernetes version 1.10. This version happens to be the first version that will be deprecated on Amazon EKS.

But why now?

Glad you asked. It’s really all about security. Past a certain point (usually 1 year), the Kubernetes community stops releasing bug and CVE patches. Additionally, the Kubernetes project does not encourage CVE submission for deprecated versions. This means that vulnerabilities specific to an older version of Kubernetes may not even be reported, leaving users exposed with no notice in the case of a vulnerability. We consider this to be an unacceptable security posture for our customers.

Earlier this year we announced support for Kubernetes 1.12 in EKS. That, together with our commitment to support three Kubernetes versions at any given point in time and the fact that 1.13 will land very soon in EKS means that we have to deprecate 1.10, after which the three supported versions, unsurprisingly, will be 1.11, 1.12, and (you guessed it) 1.13. OK, with that out of the way, let’s have a look at the options you have to move to the latest Kubernetes versions with Amazon EKS and then dive into the update and deprecation process in greater detail:

  • Ideally, you test a new version and move to one of the three supported ones, in time (details below).
  • If you are still on a version we deprecate, you will be upgraded automatically, after some time (details, again, below).
  • If you’re using a deprecated version beyond a certain point and we can’t upgrade the cluster, we may deactivate it.

A quick Kubernetes release cycle refresher

In a nutshell, the Kubernetes versioning and release regime is roughly following a four-releases-per-year pattern, with cadence varying between 70 and 130 days. It also lays out an expectation in terms of upgrades:

We expect users to stay reasonably up-to-date with the versions of Kubernetes they use in production, but understand that it may take time to upgrade, especially for production-critical components.

The formal API versioning allows for a strict deprecation policy which states, amongst other things, that stable (GA) API support is “12 months or 3 releases (whichever is longer)”.

Now that we’re on the same page how upstream Kubernetes releases are managed, let’s have a look at how we at AWS implement the process in EKS.

The EKS Process

In line with the Kubernetes community support for Kubernetes versions, Amazon EKS is committed to running at least three production-ready versions of Kubernetes at any given time, with a fourth version in deprecation. A new Kubernetes version is released as generally available by the Kubernetes project every 70 and 130 days (we take the average of 90 days for simplicity). New GA versions will be supported by EKS some time after GA release (typically at the first patch version release – 1.XX.1, but sometimes later). This means that the total time a version is in production with EKS should be roughly 270 days.

We will announce the deprecation of a given Kubernetes version (n) at least 60 days before the deprecation date and over time, will align the deprecation of a Kubernetes version on EKS to be on or after the date the Kubernetes project stops supporting the version upstream.

For example, we will announce deprecation of version 1.10 while 1.12 is available for EKS and complete the deprecation process after version 1.13 is available for EKS. We will announce the deprecation of 1.11 after 1.13 is available and complete the deprecation after 1.14 is available for EKS.

The following table shows how this will work:

 EKS Version



 About +90 days

 About +180 days

 About +270 days

 Latest Available 


















 In Deprecation 





When we announce the deprecation, we will give customers a specific date when new cluster creation will be disabled for the version targeted for deprecation. On this date, EKS clusters running the version targeted for deprecation will begin to be updated to the next EKS-supported version of Kubernetes. This means that if the deprecated version is 1.10, clusters will be automatically updated to version 1.11. If a cluster is automatically updated by EKS, customers will need to update the version of their worker nodes after the update is complete. Kubernetes has compatibility between masters and workers for at least 2 versions, so 1.10 workers will continue to operate when orchestrated by a 1.11 control plane.

Upcoming deprecation of Kubernetes 1.10 in EKS

Amazon EKS will deprecate Kubernetes version 1.10 on July 22, 2019. On this day, you will no longer be able to create new 1.10 clusters and all EKS clusters running Kubernetes version 1.10 will be updated to the latest available platform version of Kubernetes version 1.11.

We recommend that all Amazon EKS customers update their 1.10 clusters to Kubernetes version 1.11 or 1.12 as soon as possible.


Wrapping up

What can you do today to prepare? Well, first off, internalize the timeline and try to align internal processes with it. Our documentation has more information about the EKS Kubernetes version deprecation process and EKS updates. If you have any questions, send us a note on our version deprecation issue in the public containers roadmap on GitHub.