Many AWS customers are using GitLab for their DevOps needs, including source control, and continuous integration and continuous delivery (CI/CD). Many of our customers are using GitLab SaaS (the hosted edition), while others are using GitLab Self-managed to meet their security and compliance requirements.
Customers can easily add runners to their GitLab instance to perform various CI/CD jobs. These jobs include compiling source code, building software packages or container images, performing unit and integration testing, etc.—even all the way to production deployment. For the SaaS edition, GitLab offers hosted runners, and customers can provide their own runners as well. Customers who run GitLab Self-managed must provide their own runners.
In this post, we’ll discuss how customers can maximize their CI/CD capabilities by managing their GitLab runner and executor fleet with Amazon Elastic Kubernetes Service (Amazon EKS). We’ll leverage both x86 and Graviton runners, allowing customers for the first time to build and test their applications both on x86 and on AWS Graviton, our most powerful, cost-effective, and sustainable instance family. In keeping with AWS’s philosophy of “pay only for what you use,” we’ll keep our Amazon Elastic Compute Cloud (Amazon EC2) instances as small as possible, and launch ephemeral runners on Spot instances. We’ll demonstrate building and testing a simple demo application on both architectures. Finally, we’ll build and deliver a multi-architecture container image that can run on Amazon EC2 instances or AWS Fargate, both on x86 and Graviton.
A runner is an application to which GitLab sends jobs that are defined in a CI/CD pipeline. The runner receives jobs from GitLab and executes them—either by itself, or by passing it to an executor (we’ll visit the executor in the next section).
In our design, we’ll be using a pair of self-hosted runners. One runner will accept jobs for the x86 CPU architecture, and the other will accept jobs for the arm64 (Graviton) CPU architecture. To help us route our jobs to the proper runner, we’ll apply some tags to each runner indicating the architecture for which it will be responsible. We’ll tag the x86 runner with x86, x86-64, and amd64, thereby reflecting the most common nicknames for the architecture, and we’ll tag the arm64 runner with arm64.
Currently, these runners must always be running so that they can receive jobs as they are created. Our runners only require a small amount of memory and CPU, so that we can run them on small EC2 instances to minimize cost. These include t4g.micro for Graviton builds, or t3.micro or t3a.micro for x86 builds.
To save money on these runners, consider purchasing a Savings Plan or Reserved Instances for them. Savings Plans and Reserved Instances can save you up to 72% over on-demand pricing, and there’s no minimum spend required to use them.
Kubernetes executors
In GitLab CI/CD, the executor’s job is to perform the actual build. The runner can create hundreds or thousands of executors as needed to meet current demand, subject to the concurrency limits that you specify. Executors are created only when needed, and they are ephemeral: once a job has finished running on an executor, the runner will terminate it.
In our design, we’ll use the Kubernetes executor that’s built into the GitLab runner. The Kubernetes executor simply schedules a new pod to run each job. Once the job completes, the pod terminates, thereby freeing the node to run other jobs.
The Kubernetes executor is highly customizable. We’ll configure each runner with a nodeSelector that makes sure that the jobs are scheduled only onto nodes that are running the specified CPU architecture. Other possible customizations include CPU and memory reservations, node and pod tolerations, service accounts, volume mounts, and much more.
Scaling worker nodes
For most customers, CI/CD jobs aren’t likely to be running all of the time. To save cost, we only want to run worker nodes when there’s a job to run.
To make this happen, we’ll turn to Karpenter. Karpenter provisions EC2 instances as soon as needed to fit newly-scheduled pods. If a new executor pod is scheduled, and there isn’t a qualified instance with enough capacity remaining on it, then Karpenter will quickly and automatically launch a new instance to fit the pod. Karpenter will also periodically scan the cluster and terminate idle nodes, thereby saving on costs. Karpenter can terminate a vacant node in as little as 30 seconds.
Karpenter can launch either Amazon EC2 on-demand or Spot instances depending on your needs. With Spot instances, you can save up to 90% over on-demand instance prices. Since CI/CD jobs often aren’t time-sensitive, Spot instances can be an excellent choice for GitLab execution pods. Karpenter will even automatically find the best Spot instance type to speed up the time it takes to launch an instance and minimize the likelihood of job interruption.
Deploying our solution
To deploy our solution, we’ll write a small application using the AWS Cloud Development Kit (AWS CDK) and the EKS Blueprints library. AWS CDK is an open-source software development framework to define your cloud application resources using familiar programming languages. EKS Blueprints is a library designed to make it simple to deploy complex Kubernetes resources to an Amazon EKS cluster with minimum coding.
The high-level infrastructure code – which can be found in our GitLab repo – is very simple. I’ve included comments to explain how it works.
// All CDK applications start with a new cdk.App object.
const app = new cdk.App();
// Create a new EKS cluster at v1.23. Run all non-DaemonSet pods in the
// `kube-system` (coredns, etc.) and `karpenter` namespaces in Fargate
// so that we don't have to maintain EC2 instances for them.
const clusterProvider = new blueprints.GenericClusterProvider({
version: KubernetesVersion.V1_23,
fargateProfiles: {
main: {
selectors: [
{ namespace: 'kube-system' },
{ namespace: 'karpenter' },
]
}
},
clusterLogging: [
ClusterLoggingTypes.API,
ClusterLoggingTypes.AUDIT,
ClusterLoggingTypes.AUTHENTICATOR,
ClusterLoggingTypes.CONTROLLER_MANAGER,
ClusterLoggingTypes.SCHEDULER
]
});
// EKS Blueprints uses a Builder pattern.
blueprints.EksBlueprint.builder()
.clusterProvider(clusterProvider) // start with the Cluster Provider
.addOns(
// Use the EKS add-ons that manage coredns and the VPC CNI plugin
new blueprints.addons.CoreDnsAddOn('v1.8.7-eksbuild.3'),
new blueprints.addons.VpcCniAddOn('v1.12.0-eksbuild.1'),
// Install Karpenter
new blueprints.addons.KarpenterAddOn({
provisionerSpecs: {
// Karpenter examines scheduled pods for the following labels
// in their `nodeSelector` or `nodeAffinity` rules and routes
// the pods to the node with the best fit, provisioning a new
// node if necessary to meet the requirements.
//
// Allow either amd64 or arm64 nodes to be provisioned
'kubernetes.io/arch': ['amd64', 'arm64'],
// Allow either Spot or On-Demand nodes to be provisioned
'karpenter.sh/capacity-type': ['spot', 'on-demand']
},
// Launch instances in the VPC private subnets
subnetTags: {
Name: 'gitlab-runner-eks-demo/gitlab-runner-eks-demo-vpc/PrivateSubnet*'
},
// Apply security groups that match the following tags to the launched instances
securityGroupTags: {
'kubernetes.io/cluster/gitlab-runner-eks-demo': 'owned'
}
}),
// Create a pair of a new GitLab runner deployments, one running on
// arm64 (Graviton) instance, the other on an x86_64 instance.
// We'll show the definition of the GitLabRunner class below.
new GitLabRunner({
arch: CpuArch.ARM_64,
// If you're using an on-premise GitLab installation, you'll want
// to change the URL below.
gitlabUrl: 'https://gitlab.com',
// Kubernetes Secret containing the runner registration token
// (discussed later)
secretName: 'gitlab-runner-secret'
}),
new GitLabRunner({
arch: CpuArch.X86_64,
gitlabUrl: 'https://gitlab.com',
secretName: 'gitlab-runner-secret'
}),
)
.build(app,
// Stack name
'gitlab-runner-eks-demo');
The GitLabRunner class is a HelmAddOn subclass that takes a few parameters from the top-level application:
// The location and name of the GitLab Runner Helm chart
const CHART_REPO = 'https://charts.gitlab.io';
const HELM_CHART = 'gitlab-runner';
// The default namespace for the runner
const DEFAULT_NAMESPACE = 'gitlab';
// The default Helm chart version
const DEFAULT_VERSION = '0.40.1';
export enum CpuArch {
ARM_64 = 'arm64',
X86_64 = 'amd64'
}
// Configuration parameters
interface GitLabRunnerProps {
// The CPU architecture of the node on which the runner pod will reside
arch: CpuArch
// The GitLab API URL
gitlabUrl: string
// Kubernetes Secret containing the runner registration token (discussed later)
secretName: string
// Optional tags for the runner. These will be added to the default list
// corresponding to the runner's CPU architecture.
tags?: string[]
// Optional Kubernetes namespace in which the runner will be installed
namespace?: string
// Optional Helm chart version
chartVersion?: string
}
export class GitLabRunner extends HelmAddOn {
private arch: CpuArch;
private gitlabUrl: string;
private secretName: string;
private tags: string[] = [];
constructor(props: GitLabRunnerProps) {
// Invoke the superclass (HelmAddOn) constructor
super({
name: `gitlab-runner-${props.arch}`,
chart: HELM_CHART,
repository: CHART_REPO,
namespace: props.namespace || DEFAULT_NAMESPACE,
version: props.chartVersion || DEFAULT_VERSION,
release: `gitlab-runner-${props.arch}`,
});
this.arch = props.arch;
this.gitlabUrl = props.gitlabUrl;
this.secretName = props.secretName;
// Set default runner tags
switch (this.arch) {
case CpuArch.X86_64:
this.tags.push('amd64', 'x86', 'x86-64', 'x86_64');
break;
case CpuArch.ARM_64:
this.tags.push('arm64');
break;
}
this.tags.push(...props.tags || []); // Add any custom tags
};
// `deploy` method required by the abstract class definition. Our implementation
// simply installs a Helm chart to the cluster with the proper values.
deploy(clusterInfo: ClusterInfo): void | Promise<Construct> {
const chart = this.addHelmChart(clusterInfo, this.getValues(), true);
return Promise.resolve(chart);
}
// Returns the values for the GitLab Runner Helm chart
private getValues(): Values {
return {
gitlabUrl: this.gitlabUrl,
runners: {
config: this.runnerConfig(), // runner config.toml file, from below
name: `demo-runner-${this.arch}`, // name as seen in GitLab UI
tags: uniq(this.tags).join(','),
secret: this.secretName, // see below
},
// Labels to constrain the nodes where this runner can be placed
nodeSelector: {
'kubernetes.io/arch': this.arch,
'karpenter.sh/capacity-type': 'on-demand'
},
// Default pod label
podLabels: {
'gitlab-role': 'manager'
},
// Create all the necessary RBAC resources including the ServiceAccount
rbac: {
create: true
},
// Required resources (memory/CPU) for the runner pod. The runner
// is fairly lightweight as it's a self-contained Golang app.
resources: {
requests: {
memory: '128Mi',
cpu: '256m'
}
}
};
}
// This string contains the runner's `config.toml` file including the
// Kubernetes executor's configuration. Note the nodeSelector constraints
// (including the use of Spot capacity and the CPU architecture).
private runnerConfig(): string {
return `
[[runners]]
[runners.kubernetes]
namespace = "{{.Release.Namespace}}"
image = "ubuntu:16.04"
[runners.kubernetes.node_selector]
"kubernetes.io/arch" = "${this.arch}"
"kubernetes.io/os" = "linux"
"karpenter.sh/capacity-type" = "spot"
[runners.kubernetes.pod_labels]
gitlab-role = "runner"
`.trim();
}
}
# These two values must match the parameters supplied to the GitLabRunner constructor
NAMESPACE=gitlab
SECRET_NAME=gitlab-runner-secret
# The value of the registration token.
TOKEN=GRxxxxxxxxxxxxxxxxxxxxxx
kubectl -n $NAMESPACE create secret generic $SECRET_NAME \
--from-literal="runner-registration-token=$TOKEN" \
--from-literal="runner-token="
Building a multi-architecture container image
Now that we’ve launched our GitLab runners and configured the executors, we can build and test a simple multi-architecture container image. If the tests pass, we can then upload it to our project’s GitLab container registry. Our application will be pretty simple: we’ll create a web server in Go that simply prints out “Hello World” and prints out the current architecture.
Find the source code of our sample app in our GitLab repo.
In GitLab, the CI/CD configuration lives in the .gitlab-ci.yml file at the root of the source repository. In this file, we declare a list of ordered build stages, and then we declare the specific jobs associated with each stage.
Our stages are:
The build stage, in which we compile our code, produce our architecture-specific images, and upload these images to the GitLab container registry. These uploaded images are tagged with a suffix indicating the architecture on which they were built. This job uses a matrix variable to run it in parallel against two different runners – one for each supported architecture. Furthermore, rather than using docker build to produce our images, we use Kaniko to build them. This lets us build our images in an unprivileged container environment and improve the security posture considerably.
The test stage, in which we test the code. As with the build stage, we use a matrix variable to run the tests in parallel in separate pods on each supported architecture.
The assemblystage, in which we create a multi-architecture image manifest from the two architecture-specific images. Then, we push the manifest into the image registry so that we can refer to it in future deployments.
Figure 2. Example CI/CD pipeline for multi-architecture images.
Here’s what our top-level configuration looks like:
variables:
# These are used by the runner to configure the Kubernetes executor, and define
# the values of spec.containers[].resources.limits.{memory,cpu} for the Pod(s).
KUBERNETES_MEMORY_REQUEST: 1Gi
KUBERNETES_CPU_REQUEST: 1
# List of stages for jobs, and their order of execution
stages:
- build
- test
- create-multiarch-manifest
Here’s what our build stage job looks like. Note the matrix of variables which are set in BUILD_ARCH as the two jobs are run in parallel:
build-job:
stage: build
parallel:
matrix: # This job is run twice, once on amd64 (x86), once on arm64
- BUILD_ARCH: amd64
- BUILD_ARCH: arm64
tags: [$BUILD_ARCH] # Associate the job with the appropriate runner
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
script:
- mkdir -p /kaniko/.docker
# Configure authentication data for Kaniko so it can push to the
# GitLab container registry
- echo "{\"auths\":{\"${CI_REGISTRY}\":{\"auth\":\"$(printf "%s:%s" "${CI_REGISTRY_USER}" "${CI_REGISTRY_PASSWORD}" | base64 | tr -d '\n')\"}}}" > /kaniko/.docker/config.json
# Build the image and push to the registry. In this stage, we append the build
# architecture as a tag suffix.
- >-
/kaniko/executor
--context "${CI_PROJECT_DIR}"
--dockerfile "${CI_PROJECT_DIR}/Dockerfile"
--destination "${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}-${BUILD_ARCH}"
Here’s what our test stage job looks like. This time we use the image that we just produced. Our source code is copied into the application container. Then, we can run make test-api to execute the server test suite.
build-job:
stage: build
parallel:
matrix: # This job is run twice, once on amd64 (x86), once on arm64
- BUILD_ARCH: amd64
- BUILD_ARCH: arm64
tags: [$BUILD_ARCH] # Associate the job with the appropriate runner
image:
# Use the image we just built
name: "${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}-${BUILD_ARCH}"
script:
- make test-container
Finally, here’s what our assembly stage looks like. We use Podman to build the multi-architecture manifest and push it into the image registry. Traditionally we might have used docker buildx to do this, but using Podman lets us do this work in an unprivileged container for additional security.
create-manifest-job:
stage: create-multiarch-manifest
tags: [arm64]
image: public.ecr.aws/docker/library/fedora:36
script:
- yum -y install podman
- echo "${CI_REGISTRY_PASSWORD}" | podman login -u "${CI_REGISTRY_USER}" --password-stdin "${CI_REGISTRY}"
- COMPOSITE_IMAGE=${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}
- podman manifest create ${COMPOSITE_IMAGE}
- >-
for arch in arm64 amd64; do
podman manifest add ${COMPOSITE_IMAGE} docker://${COMPOSITE_IMAGE}-${arch};
done
- podman manifest inspect ${COMPOSITE_IMAGE}
# The composite image manifest omits the architecture from the tag suffix.
- podman manifest push ${COMPOSITE_IMAGE} docker://${COMPOSITE_IMAGE}
Trying it out
I’ve created a public test GitLab project containing the sample source code, and attached the runners to the project. We can see them at Settings > CI/CD > Runners:
Figure 3. GitLab runner configurations.
Here we can also see some pipeline executions, where some have succeeded, and others have failed.
Figure 4. GitLab sample pipeline executions.
We can also see the specific jobs associated with a pipeline execution:
Figure 5. GitLab sample job executions.
Finally, here are our container images:
Figure 6. GitLab sample container registry.
Conclusion
In this post, we’ve illustrated how you can quickly and easily construct multi-architecture container images with GitLab, Amazon EKS, Karpenter, and Amazon EC2, using both x86 and Graviton instance families. We indexed on using as many managed services as possible, maximizing security, and minimizing complexity and TCO. We dove deep on multiple facets of the process, and discussed how to save up to 90% of the solution’s cost by using Spot instances for CI/CD executions.
Find the sample code, including everything shown here today, in our GitLab repository.
Building multi-architecture images will unlock the value and performance of running your applications on AWS Graviton and give you increased flexibility over compute choice. We encourage you to get started today.
In this two part series, we discuss the challenges faced by OfferUp, a Digital Native customer, to meet business growth and time-to-market. Their journey involved modernizing their existing DevOps platform, from the traditional monolith virtual machine (VM) based architecture to modern containerized architecture and running cloud-native applications for secured progressive delivery to accelerate time to market. This series will provide strategies, architecture patterns, and technical steps you can adopt to become more agile and innovative like OfferUp has.
OfferUp engineers were using the homegrown DevOps platform to build and release new services on the marketplace platform. In this first post, we discuss the key challenges encountered by OfferUp engineers with the existing DevOps platform, as well as how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger, automating production releases with progressive delivery techniques for faster time-to-market with new products and services. Amazon EKS is a managed container service to run and scale Kubernetes applications in the cloud or on-premises.
Previous DevOps architecture
OfferUp is a leading online and mobile customer to customer (C2C) marketplace where users can both buy and sell goods on the platform. Users can browse and purchase products from a broad range of categories, including furniture, clothing, sports equipment, toys, and many more. As a mobile-first company, OfferUp puts a great deal of emphasis on in-person communication between buyers and sellers.
OfferUp built a home grown, self-managed DevOps platform. This platform used a set of manual processes and third-party applications that allows both developers and operations engineers to build and deploy code to a production environment. The DevOps pipeline included topic areas such as source code control, continuous integration/continuous delivery (CI/CD), microservices, as well as development and test Methodologies. The following diagram depicts the previous architecture of OfferUp’s DevOps platform, which was self-managed on Amazon Elastic Compute Cloud (Amazon EC2).
Figure 1: Previous DevOps architecture of OfferUp
OfferUp used GitHub for code repositories. Once the source code was committed in the code repository, Jenkins pulled the source code from code repositories on a scheduled or on-demand basis and built Amazon Machine Images (AMI). The built image was deployed in production by a custom built deployment tool, Vanaheim, which supports one-box canary deployment and full roll-out deployment strategies. The DevOps engineers used to manually create a deployment job in the Vanaheim portal and then manually monitor the test success rate and service metrics to detect any impact from the deployment. Once the success rate was reached, a full production roll out was performed from the Vanaheim portal.
Key challenges with previous DevOps pipeline
In 2020, OfferUp experienced significant transaction volume growth on its Marketplace platform with the increase of its user base. With OfferUp’s acquisition of LetGo in 2020, there was a need to build a scalable DevOps platform to support future integration and organic growth. The previous DevOps platform, designed and deployed over seven years ago, had reached the limits of its scalability, and could no longer keep up with the platform’s growth. The previous architecture was expensive to run and had a complex infrastructure that made it difficult to upgrade and add new features.
The following key factors drove the push for modernization:
Manual verification was required to check if the code was correctly deployed in one of the servers in production, and if the deployment was right in one server, then it was rolled out to other production servers. Full Rollout to production wasn’t automated due to frequent failures requiring manual rollbacks.
The previous platform required a longer deployment time (1–2 hours) due to the authoritative batch process, which sometimes caused delays in releasing and testing of new features.
The self-managed nature of the Jenkins and Vanaheim clusters was consuming far too much engineering time. Most of the institutional knowledge of this legacy platform was lost over the years and it didn’t align with OfferUp’s philosophy of small DevOps engineering teams. Innovation had stalled partly due to the difficulty of simultaneously upgrading the DevOps platform and releasing new features.
DevOps platform automation with Flagger and Gloo Ingress Controller on Amazon EKS
A key requirement for the next-generation system was that the new architecture would reduce the operational burden on engineering teams, deployment lifecycle, and total cost of ownership. OfferUp evaluated multiple managed container orchestration platforms for its DevOps Platform. It finally selected Amazon EKS for high availability, reducing the average time to deploy a change to the stack from hours to just a few minutes and reducing the complexity in managing and upgrading the Kubernetes cluster. On the Amazon EKS platform, OfferUp uses Flagger, a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements several deployment strategies (Canary releases, A/B testing, and Blue/Green mirroring) using the Gloo Edge ingress controller for traffic routing. Datadog is used as an observability service for monitoring the health of the deployments and effectively managing the canary to progressive delivery. For release analysis, Flagger runs a query on Datadog logs and uses Slack for alerting and notifications. The cloud native technology components of the architecture are described as follows:
Kubernetes and Amazon EKS – Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. Kubernetes is a graduate project in the CNCF. Amazon EKS is a fully-managed, certified Kubernetes conformant service that simplifies the process of building, securing, operating, and maintaining Kubernetes clusters on AWS. Amazon EKS integrates with core AWS services, such as Amazon CloudWatch,Auto Scaling Groups, and AWS Identity and Access Management (IAM) to provide a seamless experience for monitoring, scaling, and load balancing your containerized applications.
Helm – Helm manage Kubernetes applications. Helm Charts define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish. If Kubernetes were an operating system, then Helm would be the package manager. Helm is a graduate project in the CNCF and is maintained by the Helm community.
Flagger – Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance indicators such as HTTP requests success rate, requests average duration, and pods health. Based on the set thresholds, a canary is either promoted or aborted and its analysis is pushed to a Slack channel. Flagger became a CNCF project – part of the Flux family of GitOps tools.
Gloo Edge – Gloo Edge is a feature-rich, Kubernetes-native ingress controller. Gloo Edge is exceptional in its function-level routing; its support for legacy apps, microservices, and serverless; its discovery capabilities; and its tight integration with leading open-source projects. Gloo Edge is uniquely designed to support hybrid applications, in which multiple technologies, architectures, protocols, and clouds can coexist.
Observability platform – Datadog’s integrations with Kubernetes, Docker, and AWS will let you track the full range of Amazon EKS metrics, as well as logs and performance data from your cluster and applications. Datadog gives you comprehensive coverage of your dynamic infrastructure and applications with features like auto discovery to track services across containers, sophisticated graphing, and alerting options.
Modernized DevOps architecture
In the new architecture, OfferUp uses Github as a version control tool and Github actions as their CI/CD tool. On every Pull request, tests are run, artifacts are built and stored in the JFrog Artifactory, and docker Images are stored in the Amazon Elastic Container Registry (Amazon ECR). Separate deployment pipelines are triggered based on the environment (dev, staging, and production) of choice. Flagger detects any changes in the version of the application and gradually shifts production traffic to the canary. It measures the requests success rate and average response duration metrics from Datadog to decide full rollout in production. For an application deployment, a canary promotion can be defined using Flagger’s custom resource. Flagger rolls back the deployment when the success rate falls below the defined desired success rate metrics.
Figure 2: Modernized DevOps architecture of OfferUp
With the modernized DevOps platform, OfferUp moved from monolithic to microservice architecture where front-end applications and GraphQL runs on the Amazon EKS cluster. The production cluster runs 110 services and 650+ pods on 60 nodes. The cluster scales up to 100 nodes with Amazon Auto Scaling group based on the traffic pattern. On the networking front, the cluster has a private endpoint and uses both VPC CNI plugin, and the CoreDNS add-on. There are four Amazon EKS clusters, one each for the production, test, utility, and the staging environments. OfferUp has a plan to explore Karpenter open-source autoscaling project, and it will move new applications to the Amazon EKS cluster, allowing the total node counts to scale up to 200.
Benefits of modernized architecture
The new architecture helped OfferUp make automated decisions to deploy new releases and improve the time to market while reducing unplanned production downtime
Faster deployments and Quicker rollbacks – The new architecture reduces the Service Deployment time from one hour down to five minutes, and automates rollback time to five minutes from the manual rollback time of one hour.
Automate deployment of new releases – The lack of canary deployment processes in the previous architecture required OfferUp engineers to manually intervene to validate the deployment status, which led to administrative overhead and production outages. The canary deployments take care of the traffic shifting by automatically measuring the requests’ success rate and latency metrics from Datadog and subsequently release the service to production. Deployments are automatically rolled back when the success rate falls below the defined success rate metric thresholds.
Simplified Configuration – Configuration has been simplified drastically and integrated within the CI/CD pipeline in the new architecture, thereby reducing configuration complexity, eliminating manual processes, and saving Developers time.
More time to Focus on Innovation – With fully automated progressive delivery, the developers no longer need to spend time testing and releasing source code in production. Similarly, migrating from a Self-managed DevOps platform to the Managed Amazon EKS services lowered the DevOps platform’s infrastructure management burden on the engineering team. This helps developers spend more time focusing on building and testing new features and innovations.
Cost reduction – Moving from self-managed Amazon EC2-based architecture to the Amazon EKS cluster reduced the cost of operations through shared nodes and improved pod density. The previous architecture was using 200 nodes of Amazon EC2 instances. The same workload was moved to a 50 nodes Amazon EKS cluster. Furthermore, custom applications (Vanaheim and Jenkins) were retired, further reducing the costs.
Conclusion
In this post, you see how OfferUp embarked on the journey to modernize its DevOps platform to support its growth and developers’ velocity. The key factors that drove the modernization decisions were the ability to scale the platform to support the automated testing of features in production, the faster release of new features, cost reduction, and to facilitate future innovation. The modernized DevOps platform on Amazon EKS also decreased the ongoing operational support burden for engineers, and the scalability of the design opens up a lot of headroom for growth.
We encourage you to look into modernizing your existing CI/CD pipeline on Amazon EKS with the Flagger progressive delivery mechanism. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities, such as new Kubernetes version deployments.
In the next part of the series, you’ll discover how to implement Flagger and Gloo Edge Ingress Controller on Amazon EKS to automate the release process for applications running on Kubernetes.
Further Reading
Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller
This blog is written by Rajesh Kesaraju, Sr. Solution Architect, EC2-Flexible Compute and Peter Manastyrny, Sr. Product Manager, EC2.
Today AWS is adding two new attributes for the attribute-based instance type selection (ABS) feature to make it even easier to create and manage instance type flexible configurations on Amazon EC2. The new network bandwidth attribute allows customers to request instances based on the network requirements of their workload. The new allowed instance types attribute is useful for workloads that have some instance type flexibility but still need more granular control over which instance types to run on.
Before exploring the new attributes in detail, let us review the core ABS capability.
ABS refresher
ABS lets you express your instance type requirements as a set of attributes, such as vCPU, memory, and storage when provisioning EC2 instances with ASG, EC2 Fleet, or Spot Fleet. Your requirements are translated by ABS to all matching EC2 instance types, simplifying the creation and maintenance of instance type flexible configurations. ABS identifies the instance types based on attributes that you set in ASG, EC2 Fleet, or Spot Fleet configurations. When Amazon EC2 releases new instance types, ABS will automatically consider them for provisioning if they match the selected attributes, removing the need to update configurations to include new instance types.
ABS helps you to shift from an infrastructure-first to an application-first paradigm. ABS is ideal for workloads that need generic compute resources and do not necessarily require the hardware differentiation that the Amazon EC2 instance type portfolio delivers. By defining a set of compute attributes instead of specific instance types, you allow ABS to always consider the broadest and newest set of instance types that qualify for your workload. When you use EC2 Spot Instances to optimize your costs and save up to 90% compared to On-Demand prices, instance type diversification is the key to access the highest amount of Spot capacity. ABS provides an easy way to configure and maintain instance type flexible configurations to run fault-tolerant workloads on Spot Instances.
We recommend ABS as the default compute provisioning method for instance type flexible workloads including containerized apps, microservices, web applications, big data, and CI/CD.
Now, let us dive deep on the two new attributes: network bandwidth and allowed instance types.
How network bandwidth attribute for ABS works
Network bandwidth attribute allows customers with network-sensitive workloads to specify their network bandwidth requirements for compute infrastructure. Some of the workloads that depend on network bandwidth include video streaming, networking appliances (e.g., firewalls), and data processing workloads that require faster inter-node communication and high-volume data handling.
The network bandwidth attribute uses the same min/max format as other ABS attributes (e.g., vCPU count or memory) that assume a numeric value or range (e.g., min: ‘10’ or min: ‘15’; max: ‘40’). Note that setting the minimum network bandwidth does not guarantee that your instance will achieve that network bandwidth. ABS will identify instance types that support the specified minimum bandwidth, but the actual bandwidth of your instance might go below the specified minimum at times.
Two important things to remember when using the network bandwidth attribute are:
ABS will only take burst bandwidth values into account when evaluating maximum values. When evaluating minimum values, only the baseline bandwidth will be considered.
For example, if you specify the minimum bandwidth as 10 Gbps, instances that have burst bandwidth of “up to 10 Gbps” will not be considered, as their baseline bandwidth is lower than the minimum requested value (e.g., m5.4xlarge is burstable up to 10 Gbps with a baseline bandwidth of 5 Gbps).
Alternatively, c5n.2xlarge, which is burstable up to 25 Gbps with a baseline bandwidth of 10 Gbps will be considered because its baseline bandwidth meets the minimum requested value.
Our recommendation is to only set a value for maximum network bandwidth if you have specific requirements to restrict instances with higher bandwidth. That would help to ensure that ABS considers the broadest possible set of instance types to choose from.
Using the network bandwidth attribute in ASG
In this example, let us look at a high-performance computing (HPC) workload or similar network bandwidth sensitive workload that requires a high volume of inter-node communications. We use ABS to select instances that have at minimum 10 Gpbs of network bandwidth and at least 32 vCPUs and 64 GiB of memory.
To get started, you can create or update an ASG or EC2 Fleet set up with ABS configuration and specify the network bandwidth attribute.
The following example shows an ABS configuration with network bandwidth attribute set to a minimum of 10 Gbps. In this example, we do not set a maximum limit for network bandwidth. This is done to remain flexible and avoid restricting available instance type choices that meet our minimum network bandwidth requirement.
Create the following configuration file and name it: my_asg_network_bandwidth_configuration.json
As a result, you have created an ASG that may include instance types m5.8xlarge, m5.12xlarge, m5.16xlarge, m5n.8xlarge, and c5.9xlarge, among others. The actual selection at the time of the request is made by capacity optimized Spot allocation strategy. If EC2 releases an instance type in the future that would satisfy the attributes provided in the request, that instance will also be automatically considered for provisioning.
Considered Instances (not an exhaustive list)
Instance Type Network Bandwidth m5.8xlarge “10 Gbps”
m5.12xlarge “12 Gbps”
m5.16xlarge “20 Gbps”
m5n.8xlarge “25 Gbps”
c5.9xlarge “10 Gbps”
c5.12xlarge “12 Gbps”
c5.18xlarge “25 Gbps”
c5n.9xlarge “50 Gbps”
c5n.18xlarge “100 Gbps”
…
Now let us focus our attention on another new attribute – allowed instance types.
How allowed instance types attribute works in ABS
As discussed earlier, ABS lets us provision compute infrastructure based on our application requirements instead of selecting specific EC2 instance types. Although this infrastructure agnostic approach is suitable for many workloads, some workloads, while having some instance type flexibility, still need to limit the selection to specific instance families, and/or generations due to reasons like licensing or compliance requirements, application performance benchmarking, and others. Furthermore, customers have asked us to provide the ability to restrict the auto-consideration of newly released instances types in their ABS configurations to meet their specific hardware qualification requirements before considering them for their workload. To provide this functionality, we added a new allowed instance types attribute to ABS.
The allowed instance types attribute allows ABS customers to narrow down the list of instance types that ABS considers for selection to a specific list of instances, families, or generations. It takes a comma separated list of specific instance types, instance families, and wildcard (*) patterns. Please note, that it does not use the full regular expression syntax.
For example, consider container-based web application that can only run on any 5th generation instances from compute optimized (c), general purpose (m), or memory optimized (r) families. It can be specified as “AllowedInstanceTypes”: [“c5*”, “m5*”,”r5*”].
Another example could be to limit the ABS selection to only memory-optimized instances for big data Spark workloads. It can be specified as “AllowedInstanceTypes”: [“r6*”, “r5*”, “r4*”].
Note that you cannot use both the existing exclude instance types and the new allowed instance types attributes together, because it would lead to a validation error.
Using allowed instance types attribute in ASG
Let us look at the InstanceRequirements section of an ASG configuration file for a sample web application. The AllowedInstanceTypes attribute is configured as [“c5.*”, “m5.*”,”c4.*”, “m4.*”] which means that ABS will limit the instance type consideration set to any instance from 4th and 5th generation of c or m families. Additional attributes are defined to a minimum of 4 vCPUs and 16 GiB RAM and allow both Intel and AMD processors.
Create the following configuration file and name it: my_asg_allow_instance_types_configuration.json
As a result, you have created an ASG that may include instance types like m5.xlarge, m5.2xlarge, c5.xlarge, and c5.2xlarge, among others. The actual selection at the time of the request is made by capacity optimized Spot allocation strategy. Please note that if EC2 will in the future release a new instance type which will satisfy the other attributes provided in the request, but will not be a member of 4th or 5th generation of m or c families specified in the allowed instance types attribute, the instance type will not be considered for provisioning.
Selected Instances (not an exhaustive list)
m5.xlarge
m5.2xlarge
m5.4xlarge
…
c5.xlarge
c5.2xlarge
…
m4.xlarge
m4.2xlarge
m4.4xlarge
…
c4.xlarge
c4.2xlarge
…
As you can see, ABS considers a broad set of instance types for provisioning, however they all meet the compute attributes that are required for your workload.
Cleanup
To delete both ASGs and terminate all the instances, execute the following commands:
In this post, we explored the two new ABS attributes – network bandwidth and allowed instance types. Customers can use these attributes to select instances based on network bandwidth and to limit the set of instances that ABS selects from. The two new attributes, as well as the existing set of ABS attributes enable you to save time on creating and maintaining instance type flexible configurations and make it even easier to express the compute requirements of your workload.
ABS represents the paradigm shift in the way that our customers interact with compute, making it easier than ever to request diversified compute resources at scale. We recommend ABS as a tool to help you identify and access the largest amount of EC2 compute capacity for your instance type flexible workloads.
We often hear from customers about their challenges architecting Jenkins for scale and high availability (HA). Jenkins was originally built as a continuous integration (CI) system to test software before it was committed to a repository. Since its beginning, Jenkins has grown out of necessity versus grand master plan. Developers who extended Jenkins favored speed of creating functionality over performance or scalability of the entire system. This is not to say that it’s impossible to scale Jenkins, it’s only mentioned here to highlight the challenges and technical debt that has accumulated because of the prioritization of features versus developing towards a specific architecture. In this post, we discuss these challenges and our proposed solution.
Challenges with Jenkins at scale and HA
Business and customer demand are forcing organizations to increase the speed and agility at which they release features and functionality. As organizations make this transition, the usage of continuous integration and continuous delivery (CI/CD) increases, which drives the need to scale Jenkins. Overlay this with an organization that commits hundreds of changes per day and works around the clock, with developers dispersed globally, and you end up with an operational situation where there is no room for downtime. To mitigate the risk of impacting an organization’s ability to release when they need it, developers require a system that not only scales but is also highly available.
The ability to scale Jenkins and provide HA comes down to two problems. One is the ability to scale compute to handle additional jobs, and the second is storage. To scale compute, we typically do it in one of two ways, horizontally or vertically. Horizontally means we scale Jenkins to add additional compute nodes. Scaling vertically means we scale Jenkins by adding more resources to the compute node.
Let’s start with the storage problem. Jenkins is designed around the local file system. Anyone who has spent time around Jenkins is aware that logs, cloned repos, plugins, and build artifacts are stored into JENKINS_HOME. Local file systems, while good for single-server designs, tend to be a challenge when HA comes into the picture. In on-premises designs, administrators have often used Network File System (NFS) and Storage Area Networks (SAN) to achieve some scale and resiliency. This type of design comes with a trade-off of performance and doesn’t provide the true HA and inherent disaster recovery (DR) required to meet the demands of the business.
Because of the local file system constraint, there are two native families of storage available in AWS: Amazon Elastic Block Store (Amazon EBS) and Amazon Elastic File System (Amazon EFS). Amazon EBS is great for a single-server design in a single Availability Zone. The challenge is trying to scale a single-server design to support HA. Because of the requirement to assign an EBS volume to a specific Availability Zone, you can’t automatically transition the EBS volume to another Availability Zone and attach it to a Jenkins instance. If you don’t mind having an impact on Recovery Time Objective (RTO) and Recovery Point Objective (RPO), a solution using Amazon EBS snapshots copied to additional Availability Zones might work. Although EBS snapshot copy is possible, it’s not a recommended solution because it doesn’t scale and has complexities in building and maintaining this type of solution.
Amazon EFS as an alternative has worked well for customers that don’t have high usage patterns of Jenkins. All Jenkins instances within the Region can access the Amazon EFS file system and data durably stored in multiple Availability Zones. If a single Availability Zone experiences an outage, the Jenkins file system is still accessible from other Availability Zones providing HA for the storage layer. This solution is not recommended for high-usage systems due to the way that Jenkins reads and writes data. Jenkins’s access pattern is skewed towards writing data such as logs, cloned repos, and building artifacts versus reading data. Amazon EFS, on the other hand, is designed for workloads that read more than they write. On high-usage workloads, customers have experienced Jenkins build slowness and Jenkins page load latency. This is why Amazon EFS isn’t recommended for high-usage Jenkins systems.
Solution for Jenkins at scale and HA
Solving the compute problem is relatively straightforward by using Amazon Elastic Kubernetes Service (Amazon EKS). In the context of Jenkins, an organization would run Jenkins in an Amazon EKS cluster that spans multiple Availability Zones, as shown in the following diagram.
Figure 1 –Jenkins deployment in Amazon EKS with multiple availability zones.
Jenkins Controller and Agent would run in an Availability Zone as a Kubernetes pod. Amazon EKS is designed around Desired State Configuration (DSC), which means that it continuously make sure that the running environment matches the configuration that has been applied to Amazon EKS. In practice, when Amazon EKS is told that you want a single pod of Jenkins running, it monitors and makes sure that pod is always running. If an Availability Zone is unavailable, Amazon EKS launches a new node in another Availability Zone and deploys all pods to meet any necessary constraints defined in Amazon EKS. With this option, we still need to have the data in other Availability Zones, which we cover later in this post.
The only option of scaling Jenkins controllers is vertical. Scaling Jenkins horizontally could lead to an undesirable state because the system wasn’t designed to have multiple instances of Jenkins attached to the same storage layer. There is no exclusive file locking mechanism to ensure data consistency. For organizations that have exhausted the limits with vertical scaling, the recommendation is to run multiple independent Jenkins controllers and separate them per team or group. Vertical scaling of Jenkins is simpler in Amazon EKS. Node sizes and container memory are controlled by configuration. Increasing memory size is as simple as changing a container’s memory setting. Due to the ease of changing configuration, it’s best to start with a lower memory setting, monitor performance, and increase as necessary. You want to find a good balance between price and performance.
For Jenkins agents, there are many options to scale the compute. In the context of scale and HA, the best options are to use AWS CodeBuild, AWS Fargate for Amazon EKS, or Amazon EKS managed node groups. With CodeBuild, you don’t need to provision, manage, or scale your build servers. CodeBuild scales continuously and processes multiple builds concurrently. You can use the Jenkins plugin for CodeBuild to integrate CodeBuild with Jenkins. Fargate is a good option but has some challenges if you’re trying to build container images within a container due to permissions necessary that aren’t exposed in Fargate. For additional information on how to overcome this challenge with Jenkins, refer to How to build container images with Amazon EKS on Fargate.
Now let’s look at the storage layer and see how LINBIT is helping organizations solve this problem with LINSTOR. LINBIT’s LINSTOR is an open-source management tool designed to manage block storage devices. Its primary use case is to provide Linux block storage for Kubernetes and other public and private cloud platforms. LINBIT also provides enterprise subscription for LINSTOR, which include technical support with SLA.
The following diagram illustrates a LINSTOR storage solution running on Amazon EKS using multiple Availability Zones and Amazon Simple Storage Service (Amazon S3) for snapshots.
Figure 2. LINSTOR storage solution running on Amazon EKS using multiple availability zones and S3 for snapshot.
LINSTOR is composed of a control plane and a data plane. The control plane consists of a set of containers deployed into Amazon EKS and is responsible for managing the data plane. The data plane consists of a collection of open-source block storage software, most importantly LINBIT’s Distributed Replicated Storage System (DRBD) software. DRBD is responsible for provisioning and synchronously replicating storage between Amazon EKS worker instances in different Availability Zones.
LINSTOR is deployed via Helm into Amazon EKS, and the LINSTOR cluster is initialized by the LINSTOR Operator. Once deployed, LINSTOR volumes and volume snapshots are managed via Kubernetes Storage Classes and Snapshot Classes in a Kubernetes native fashion. LINSTOR volumes are backed by LINSTOR objects known as storage pools, which are composed of one or more EBS volumes attached to each Amazon EKS worker instance.
LINSTOR volumes layer DRBD on top of the worker’s attached EBS volume to enable synchronous replication between peers in the Amazon EKS cluster. This ensures that you have an identical copy of your persistent volume on the EBS volumes in each Availability Zone. In the event of an Availability Zone outage or planned migration, Amazon EKS moves the Jenkins deployment to another Availability Zone where the persistent volume copy is available. In terms of scaling, LINBIT DRDB supports up to 32 replicas per volume, with a maximum size of 1 PiB per volume. LINSTOR node itself can scale beyond hundreds of nodes, as shown in this case study.
LINSTOR also provides an HA Controller component in its control plane to speed up failover times during outages. LINSTOR’s HA Controller looks for pods with a specific label, and if LINSTOR’s persistent volumes replication network becomes interrupted (like during an Availability Zone outage), LINSTOR reschedules the pod sooner than the default Kubernetes pod-eviction-timeout.
LINBIT provides a detailed full installation for Jenkins HA in AWS. A sample of LINSTOR’s helm values supporting these features is as follows:
To protect against entire AWS Region outages and provide disaster recovery, LINSTOR takes volume snapshots and replicates it cross-Region using Amazon S3. LINSTOR requires read and write access to the target S3 bucket using AWS credentials provided as Kubernetes secrets:
The target S3 bucket is referenced as a snapshot shipping target using a LINSTOR S3 VolumeSnapshotClass. The following example shows a VolumeSnapshotClass referencing the S3 bucket’s secret and additional configuration for the target S3 bucket:
Jenkins deployment persistent volume claim (PVC) is stored as a snapshot in Amazon S3 by using a standard Kubernetes volumeSnapshot definition with LINSTOR’s snapshot class for Amazon S3:
In this post, we explained the challenges to scale Jenkins for HA and DR. We also reviewed Jenkins storage architecture with Amazon EBS and Amazon EFS and where to apply these. We demonstrated how you can use Amazon EKS to scale Jenkins compute for HA and how AWS partner solutions such as LINBIT LINSTOR can help scale Jenkins storage for HA and DR. Combining both solutions can help organizations maintain their ability to deploy software with speed and agility. We hope you found this post useful as you think through building your CI/CD infrastructure in AWS. To learn more about running Jenkins in Amazon EKS, check out Orchestrate Jenkins Workloads using Dynamic Pod Autoscaling with Amazon EKS. To find out more information about LINBIT’s LINSTOR, check the Jenkins technical guide.
This post is written by Carlos Manzanedo Rueda, WW SA Leader for EC2 Spot, and Brandon Wagner, Senior Software Development Engineer for EC2.
This post focuses on how you can leverage recently released tools to optimize your usage of Amazon EC2 Spot Instances on Kubernetes Operations (kOps) clusters. Spot Instances let you utilize unused capacity in the AWS cloud for up to 90% off compared to On-Demand prices, and they are a great fit for fault-tolerant, containerized applications. kOps is an open source project providing a cohesive toolset for provisioning, operating, and deleting Kubernetes clusters in the cloud.
Even with customers such as Snap Inc., Babylon Health, and Fidelity Investments telling us how Amazon Elastic Kubernetes Service (EKS) is essential for running their containerized workloads, we appreciate that there are scenarios where using Amazon EC2 instances and kOps are a viable alternative. At AWS, we understand “one size does not fit all.” While we encourage Kubernetes users to contribute their feedback to the AWS container roadmap so that we can improve our services, we also would like to reduce heavy lifting and simplify Spot best practices integration in kOps clusters.
To simplify the integration of Spot Instances in kOps clusters, in January of 2021 we introduced a new kops toolbox command: kops toolbox instance-selector. The utility is distributed as part of the standard kOps distribution. Moreover, it simplifies the creation of kOps Instance Groups by configuring them with full adherence to Spot Instances best practices.
Handling Spot interruption notifications in Kubernetes
Let’s quickly recap Spot best practices. Spot Instances perform exactly like any other EC2 Instances, except that in exchange for their discounted price, they can be interrupted with a two-minute warning when EC2 must reclaim capacity. Applications running on Spot can typically recover from transient interruptions by simply starting a new instance. Spot best practices involve measures such as diversifying into as many Spot capacity pools as possible, choosing the right Spot allocation strategy, and utilizing Spot integrated services. These handle the Spot Instances lifecycles for you. This blog post on handling Spot interruptions dives deeper into AWS’s EC2 Spot best practices.
The kOps managed addon will let you configure the Node Termination Handler within your kOps cluster spec and, more importantly, manage provisioning the necessary infrastructure for you.
To deploy the AWS Node Termination Handler, we start by editing our cluster spec:
kops edit cluster --name ${KOPS_CLUSTER_NAME}
We append the nodeTerminationHandler configuration to the spec node:
${KOPS_CLUSTER_NAME} refers to the environment variable containing the cluster name, and ${KOPS_STATE_STORE} indicates the Amazon Simple Storage Service (S3) bucket – or kOps State Store – where kOps configuration is stored.
To check that your Node Termination Handler deployment was successful, you can execute:
kops get deployment aws-node-termination-handler -n kube-system
Instance Flexibility and Diversification
Diversification and selection of multiple instances types is essential to acquire and maintain Spot capacity, as well as to successfully replace interrupted instances with others from different pools. When running kOps on AWS, this is implemented by utilizing Amazon EC2 Auto Scaling. Amazon Auto Scaling group’s capacity-optimized allocation strategy ensures that Spot capacity is provisioned from the optimal pools, thereby reducing the chances of Spot terminations.
Simplifying adoption of Spot Best practices on kOps
Before the kops toolbox instance-selector, you would have to setup Spot best practices on kOps manually. This involved writing a stub file following the InstanceGroup specification and examples, and then implementing every best practice, including finding every pool that qualifies for our workload.
The new functionality in kops toolbox instance-selector simplifies InstanceGroup creation by moving the focus of kOps users and administrators from this manual configuration over to simply selecting the vCPUs and Memory requirements for their application (or a base instance type), and then letting kops toolbox instance-selector define the right configuration. Behind the scenes, it utilizes a library allowing it to plug into the feature-set of Amazon EC2 instance selector. At its core, ec2 instance selector helps you select compatible instance types for your application to run on. Utilize ec2 instance selector CLI or library when automating your configurations. In the case of kOps, the integration already comes in the kops toolbox.
For example, let’s say your cluster runs stateless, fault tolerant applications that are CPU/Memory bound and have a ratio of vCPU to Memory requiring at least 1vCPU : 4GB of RAM. You can run the following command in order to acquire cluster spot capacity:
Let’s focus first on the command, and later cover its output. You can get a list of parameters and default values by running: kops toolbox instance-selector –help. A few default parameters weren’t passed in the command above, but they will be set to sane defaults, such as the maximum and minimum number of instances in the Instance Group. The parameter –flexible refers to our request to provide a group of flexible instance types spanning multiple generations.
Once you’ve defined the InstanceGroups, start them up by using the command:
The two commands above define and create a request for spot capacity from a flexible and diversified pool set, which meet the criteria to provide at least 4GB of RAM for each vCPU. The command creates not just one, but two node groups named “spot-group-1” and “spot-group-2” (–ig-count 2).
Now, let’s check the contents of the configuration file generated by kops toolbox instance-selector. To preview a configuration without making changes, add –dry-run –output yaml.
The configuration above lists one of the groups created by kops toolbox instance-selector in the previous example. The second group will have a very similar make-up and format, except that it will refer to instances such as: r3.xlarge, r4.xlarge, r5.xlarge, and r5a.xlarge in the mixedInstancesPolicy section. By defining the parameter –usage-class to Spot, the configuration created by kops toolbox instance-selector will add the tags identifying this Auto Scaling group as a Spot group. When the nodes are initialized, kOps controller will identify the nodes as Spot and add the label node-role.kubernetes.io/spot-worker=true. Therefore, at a later stage, we can apply placement logic to our cluster by using nodeSelector and affinity. The configuration above adheres to the definition of kOps support for mixed Instance Groups in AWS, and adds all of the right cloudLabels in order to integrate and implement not only with Spot best practices, but also with Cluster Autoscaler Auto-Discovery configuration best practices.
Kubernetes Cluster Autoscaler is a Kubernetes controller that dynamically adjusts the cluster size. According to a 2020 survey by Cloud Native Computing Foundation (CNCF), 70% of Kubernetes workloads plan to autoscale their stateless applications. Dynamically scaling applications and clusters is also a great practice for optimizing your system costs in situations where capacity is unnecessary, as well as for scaling out accordingly in order to meet business demands. If there are Pods that can’t be scheduled due to insufficient resources, then Cluster Autoscaler will issue a Scale-out action. When there are nodes in the cluster that have been under-utilized for a configurable period of time, Cluster Autoscaler will Scale-in the cluster, and even down-scale to 0 instances when applications don’t need to be run.
On Scale-out operations, Cluster Autoscaler evaluates a set of node groups. When Cluster Autoscaler runs on AWS, node groups are implemented by using Auto Scaling groups (referring to the same instance group as a kOps Instance Group). Therefore, to calculate the number of nodes to scale-out, Cluster Autoscaler assumes that every instance in a node group has the same number of vCPUs and memory size.
By creating two node groups, you apply two diversification levels. You diversify within each node group by using an Auto Scaling group with Mixed Instance Policies and capacity-optimized allocation strategy. Then, to increase the pool range you can leverage, you add more than one node group, while still adhering to the best practices required by Cluster Autoscaler.
While we’ve been focusing on Spot Instances, the parameter –usage-class can be utilized to get OnDemand instances instead of Spot. In the next example, let’s say we would like to get On-Demand capacity in order to train complex deep learning models that will take hours to run. To train our models, we need instances that have at least one GPU with 16GB of RAM on instances that have at least 32GB Ram and 8 vCPUs.
The command above, followed by kops update cluster –state=${KOPS_STATE_STORE} –name=${KOPS_CLUSTER_NAME} –yes can be utilized to produce a configuration and create a nodegroup with the right requirements. This could be created at the start of the training procedure, and then – once the training is done and the capacity is no longer needed – you could automate the nodegroup removal with the following command:
We believe the best way to run Kubernetes on AWS is by using Amazon EKS. However, scenarios may exist where kOps is utilized in AWS. By using the kOps managed add-on to install aws-node-termination-handler and kops toolbox instance-selector, it is easier than ever to apply Spot best practices to Kubernetes workloads on kOps, and cost-optimize fault-tolerant, stateless applications. These tools let kOps workloads gracefully terminate applications, as well as proactively handle the replacement of instances that are at an elevated risk of termination. kops toolbox instance-selector leverages Amazon ec2-instance-selector in order to simplify the creation of Instance Group configurations adhering to Spot Instances best practices, implementing instance type flexibility, and utilizing capacity-optimized allocation strategy.
By adhering to these best practices to reduce the frequency of Spot interruptions, we will optimize not only the cost, but also our Spot Instances selection. This will enable us to acquire capacity at a massive scale if necessary.
To start using the tools we have described, follow along this step-by-step tutorial. Also, head over to the kops toolbox documentation to learn more about the ways in which you can use it.
Customers want a single and comprehensive view of the security posture of their workloads. Runtime security event monitoring is important to building secure, operationally excellent, and reliable workloads, especially in environments that run containers and container orchestration platforms. In this blog post, we show you how to use services such as AWS Security Hub and Falco, a Cloud Native Computing Foundation project, to build a continuous runtime security monitoring solution.
With the solution in place, you can collect runtime security findings from multiple AWS accounts running one or more workloads on AWS container orchestration platforms, such as Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS). The solution collates the findings across those accounts into a designated account where you can view the security posture across accounts and workloads.
Solution overview
Security Hub collects security findings from other AWS services using a standardized AWS Security Findings Format (ASFF). Falco provides the ability to detect security events at runtime for containers. Partner integrations like Falco are also available on Security Hub and use ASFF. Security Hub provides a custom integrations feature using ASFF to enable collection and aggregation of findings that are generated by custom security products.
Figure 1: Architecture diagram of continuous runtime security monitoring
Here’s how the solution works, as shown in Figure 1:
An AWS account is running a workload on Amazon EKS.
Runtime security events detected by Falco for that workload are sent to CloudWatch logs using AWS FireLens.
CloudWatch logs act as the source for FireLens and a trigger for the Lambda function in the next step.
The Lambda function transforms the logs into the ASFF. These findings can now be imported into Security Hub.
The Security Hub instance that is running in the same account as the workload running on Amazon EKS stores and processes the findings provided by Lambda and provides the security posture to users of the account. This instance also acts as a member account for Security Hub.
Another AWS account is running a workload on Amazon ECS.
Runtime security events detected by Falco for that workload are sent to CloudWatch logs using AWS FireLens.
CloudWatch logs acts as the source for FireLens and a trigger for the Lambda function in the next step.
The Lambda function transforms the logs into the ASFF. These findings can now be imported into Security Hub.
The Security Hub instance that is running in the same account as the workload running on Amazon ECS stores and processes the findings provided by Lambda and provides the security posture to users of the account. This instance also acts as another member account for Security Hub.
The designated Security Hub administrator account combines the findings generated by the two member accounts, and then provides a comprehensive view of security alerts and security posture across AWS accounts. If your workloads span multiple regions, Security Hub supports aggregating findings across Regions.
Prerequisites
For this walkthrough, you should have the following in place:
Three AWS accounts.
Note: We recommend three accounts so you can experience Security Hub’s support for a multi-account setup. However, you can use a single AWS account instead to host the Amazon ECS and Amazon EKS workloads, and send findings to Security Hub in the same account. If you are using a single account, skip the following account specific-guidance. If you are integrated with AWS Organizations, the designated Security Hub administrator account will automatically have access to the member accounts.
Security Hub set up with an administrator account on one account.
Security Hub set up with member accounts on two accounts: one account to host the Amazon EKS workload, and one account to host the Amazon ECS workload.
Falco set up on the Amazon EKS and Amazon ECS clusters, with logs routed to CloudWatch Logs using FireLens. For instructions on how to do this, see:
Important: Take note of the names of the CloudWatch Logs groups, as you will need them in the next section.
AWS Cloud Development Kit (CDK) installed on the member accounts to deploy the solution that provides the custom integration between Falco and Security Hub.
Deploying the solution
In this section, you will learn how to deploy the solution and enable the CloudWatch Logs group. Enabling the CloudWatch Logs group is the trigger for running the Lambda function in both member accounts.
To deploy this solution in your own account
Clone the aws-securityhub-falco-ecs-eks-integration GitHub repository by running the following command. $git clone https://github.com/aws-samples/aws-securityhub-falco-ecs-eks-integration
Follow the instructions in the README file provided on GitHub to build and deploy the solution. Make sure that you deploy the solution to the accounts hosting the Amazon EKS and Amazon ECS clusters.
Navigate to the AWS Lambda console and confirm that you see the newly created Lambda function. You will use this function in the next section.
Figure 2: Lambda function for Falco integration with Security Hub
To enable the CloudWatch Logs group
In the AWS Management Console, select the Lambda function shown in Figure 2—AwsSecurityhubFalcoEcsEksln-lambdafunction—and then, on the Function overview screen, select + Add trigger.
On the Add trigger screen, provide the following information and then select Add, as shown in Figure 3.
Trigger configuration – From the drop-down, select CloudWatch logs.
Log group – Choose the Log group you noted in Step 4 of the Prerequisites. In our setup, the log group for the Amazon ECS and Amazon EKS clusters, deployed in separate AWS accounts, was set with the same value (falco).
Filter name – Provide a name for the filter. In our example, we used the name falco.
Filter pattern – optional – Leave this field blank.
Figure 3: Lambda function trigger – CloudWatch Log group
Repeat these steps (as applicable) to set up the trigger for the Lambda function deployed in other accounts.
Testing the deployment
Now that you’ve deployed the solution, you will verify that it’s working.
With the default rules, Falco generates alerts for activities such as:
An attempt to write to a file below the /etc folder. The /etc folder contains important system configuration files.
An attempt to open a sensitive file (such as /etc/shadow) for reading.
To test your deployment, you will attempt to perform these activities to generate Falco alerts that are reported as Security Hub findings in the same account. Then you will review the findings.
To test the deployment in member account 1
Run the following commands to trigger an alert in member account 1, which is running an Amazon EKS cluster. Replace <container_name> with your own value. kubectl exec -it <container_name> /bin/bash touch /etc/5 cat /etc/shadow > /dev/null
To see the list of findings, log in to your Security Hub admin account and navigate to Security Hub > Findings. As shown in Figure 4, you will see the alerts generated by Falco, including the Falco-generated title, and the instance where the alert was triggered.
Figure 4: Findings in Security Hub
To see more detail about a finding, check the box next to the finding. Figure 5 shows some of the details for the finding Read sensitive file untrusted.
Figure 6 shows the Resources section of this finding, that includes the instance ID of the Amazon EKS cluster node. In our example this is the Amazon Elastic Compute Cloud (Amazon EC2) instance.
To test the deployment in member account 2
Run the following commands to trigger a Falco alert in member account 2, which is running an Amazon ECS cluster. Replace <<container_id> with your own value. docker exec -it <container_id> bash touch /etc/5 cat /etc/shadow > /dev/null
As in the preceding example with member account 1, to view the findings related to this alert, navigate to your Security Hub admin account and select Findings.
To view the collated findings from both member accounts in Security Hub
In the designated Security Hub administrator account, navigate to Security Hub> Findings. The findings from both member accounts are collated in the designated Security Hub administrator account. You can use this centralized account to view the security posture across accounts and workloads. Figure 7 shows two findings, one from each member account, viewable in the Single Pane of Glass administrator account.
Figure 7: Write below /etc findings in a single view
To see more information and a link to the corresponding member account where the finding was generated, check the box next to the finding. Figure 8 shows the account detail associated with a specific finding in member account 1.
Figure 8: Write under /etc detail view in Security Hub admin account
By centralizing and enriching the findings from Falco, you can take action more quickly or perform automated remediation on the impacted resources.
Delete the Amazon EKS and Amazon ECS clusters created as part of the Prerequisites.
Conclusion
In this post, you learned how to achieve multi-account continuous runtime security monitoring for container-based workloads running on Amazon EKS and Amazon ECS. This is achieved by creating a custom integration between Falco and Security Hub.
You can extend this solution in a number of ways. For example:
You can forward findings across accounts using a single source to security information and event management (SIEM) tools such as Splunk.
You can perform automated remediation activities based on the findings generated, using Lambda.
This post is written by Brad Kirby, Principal Outposts Specialist, and Chris Lunsford, Senior Outposts SA.
Customers are modernizing applications by deconstructing monolithic architectures and migrating application components into container–based, service-oriented, and microservices architectures. Modern applications improve scalability, reliability, and development efficiency by allowing services to be owned by smaller, more focused teams.
This post shows how you can combine Amazon Elastic Kubernetes Service (Amazon EKS) with AWS Outposts to deploy managed Kubernetes application environments on-premises alongside your existing data and applications.
For a brief introduction to the subject matter, the watch this video, which demonstrates:
The benefits of application modernization
How containers are an ideal enabling technology for microservices architectures
How AWS Outposts combined with Amazon container services enables you to unwind complex service interdependencies and modernize on-premises applications with low latency, local data processing, and data residency requirements
Understanding the Amazon EKS on AWS Outposts architecture
Amazon EKS
Many organizations chose Kubernetes as their container orchestration platform because of its openness, flexibility, and a growing pool of Kubernetes literate IT professionals. Amazon EKS enables you to run Kubernetes clusters on AWS without needing to install and manage the Kubernetes control plane. The control plane, including the API servers, scheduler, and cluster store services, runs within a managed environment in the AWS Region. Kubernetes clients (like kubectl) and cluster worker nodes communicate with the managed control plane via public and private EKS service endpoints.
AWS Outposts
The AWS Outposts service delivers AWS infrastructure and services to on-premises locations from the AWS Global Cloud Infrastructure. An Outpost functions as an extension of the Availability Zone (AZ) where it is anchored. AWS operates, monitors, and manages AWS Outposts infrastructure as part of its parent Region. Each Outpost connects back to its anchor AZ via a Service Link connection (a set of encrypted VPN tunnels). AWS Outposts extend Virtual Private Cloud (VPC) environments from the Region to on-premises locations and enables you to deploy VPC subnets to Outposts in your data center and co-location spaces. The Outposts Local Gateway (LGW) routes traffic between Outpost VPC subnets and the on-premises network.
Amazon EKS on AWS Outposts
You deploy Amazon EKS worker nodes to Outposts using self-managed node groups. The worker nodes run on Outposts and register with the Kubernetes control plane in the AWS Region. The worker nodes, and containers running on the nodes, can communicate with AWS services and resources running on the Outpost and in the region (via the Service Link) and with on-premises networks (via the Local Gateway).
You use the same AWS and Kubernetes tools and APIs to work with EKS on Outposts nodes that you use to work with EKS nodes in the Region. You can use eksctl, the AWS Management Console, AWS CLI, or infrastructure as code (IaC) tools like AWS CloudFormation or HashiCorp Terraform to create self-managed node groups on AWS Outposts.
Amazon EKS self-managed node groups on AWS Outposts
Like many customers, you might use managed node groups when you deploy EKS worker nodes in your Region, and you may be wondering, “what are self-managed node groups?”
Self-managed node groups, like managed node groups, use standard Amazon Elastic Compute Cloud (Amazon EC2) instances in EC2 Auto Scaling groups to deploy, scale-up, and scale-down EKS worker nodes using Amazon EKS optimized Amazon Machine Images (AMIs). Amazon configures the EKS optimized AMIs to work with the EKS service. The images include Docker, kubelet, and the AWS IAM Authenticator. The AMIs also contain a specialized bootstrap script /etc/eks/bootstrap.sh that allows worker nodes to discover and connect to your cluster control plane and add Kubernetes labels that identify the nodes in the cluster.
What makes the node groups self-managed? Managed node groups have additional features that simplify updating nodes in the node group. With self-managed nodes, you must implement a process to update or replace your node group when you want to update the nodes to use a new Amazon EKS optimized AMI.
You create self-managed node groups on AWS Outposts using the same process and resources that you use to deploy EC2 instances using EC2 Auto Scaling groups. All instances in a self-managed node group must:
Be tagged with the io/cluster/<cluster-name>=owned tag
Additionally, to deploy on AWS Outposts, instances must:
Use encrypted EBS volumes
Launch in Outposts subnets
Kubernetes authenticates all API requests to a cluster. You must configure an EKS cluster to allow nodes from a self-managed node group to join the cluster. Self-managed nodes use the node IAM role to authenticate with an EKS cluster. Amazon EKS clusters use the AWS IAM Authenticator for Kubernetes to authenticate requests and Kubernetes native Role Based Access Control (RBAC) to authorize requests. To enable self-managed nodes to register with a cluster, configure the AWS IAM Authenticator to recognize the node IAM role and assign the role to the system:bootstrappers and system:nodes RBAC groups.
In the following tutorial, we take you through the steps required to deploy EKS worker nodes on an Outpost and register those nodes with an Amazon EKS cluster running in the Region. We created a sample Terraform module aws-eks-self-managed-node-group to help you get started quickly. If you are interested, you can dive into the module sample code to see the detailed configurations for self-managed node groups.
Deploying Amazon EKS on AWS Outposts (Terraform)
To deploy Amazon EKS nodes on an AWS Outposts deployment, you must complete two high-level steps:
Step 1: Create a self-managed node group
Step 2: Configure Kubernetes to allow the nodes to register
We use Terraform in this tutorial to provide visibility (through the sample code) into the details of the configurations required to deploy self-managed node groups on AWS Outposts.
Prerequisites
To follow along with this tutorial, you should have the following prerequisites:
An AWS account.
An operational AWS Outposts deployment (Outpost) associated with your AWS account.
Familiarity with Terraform HashiCorp Configuration Language (HCL) syntax and experience using Terraform to deploy AWS resources.
An existing VPC that meets the requirements for an Amazon EKS cluster. For more information, see Cluster VPC considerations. You can use the Getting started with Amazon EKS guide to walk you through creating a VPC that meets the requirements.
An existing Amazon EKS cluster. You can use the Creating an Amazon EKS cluster guide to walk you through creating the cluster.
Tip: We recommend creating and using a new VPC and Amazon EKS cluster to complete this tutorial. Procedures like modifying the aws-auth ConfigMap on the cluster may impact other nodes, users, and workloads on the cluster. Using a new VPC and Amazon EKS cluster will help minimize the risk of impacting other applications as you complete the tutorial.
Note: Do not reference subnets deployed on AWS Outposts when creating Amazon EKS clusters in the Region. The Amazon EKS cluster control plane runs in the Region and attaches to VPC subnets in the Region availability zones.
A symmetric KMS key to encrypt the EBS volumes of the EKS worker nodes. You can use the alias/aws/ebs AWS managed key for this prerequisite.
Using the sample code
The source code for the amazon-eks-self-managed-node-group Terraform module is available in AWS-Samples on GitHub.
Setup
To follow along with this tutorial, complete the following steps to setup your workstation:
Open your terminal.
Make a new directory to hold your EKS on Outposts Terraform configurations.
Change to the new directory.
Create a new file named providers.tf with the following contents:
Keep your terminal open and do all the work in this tutorial from this directory.
Step 1: Create a self-managed node group
To create a self-managed node group on your Outpost
Create a new Terraform configuration file named self-managed-node-group.tf.
Configure the aws Terraform provider with your AWS CLI profile and Outpost parent Region:
provider "aws" {
region = "us-west-2"
profile = "default"
}
Configure the aws-eks-self-managed-node-group module with the following (minimum) arguments:
eks_cluster_name the name of your EKS cluster
instance_type an instance type supported on your AWS Outposts deployment
desired_capacity, min_size, and max_size as desired to control the number of nodes in your node group (ensure that your Outpost has sufficient resources available to run the desired number nodes of the specified instance type)
subnets the subnet IDs of the Outpost subnets where you want the nodes deployed
(Optional) Kubernetes node_labels to apply to the nodes
Allowebs_encrypted and configure the ebs_kms_key_arn with KMS key you want to use to encrypt the nodes’ EBS volumes (required for Outposts deployments)
You may configure other optional arguments as appropriate for your use case – see the module README for details.
Step 2: Configure Kubernetes to allow the nodes to register
Use the following procedures to configure Terraform to manage the AWS IAM Authenticator (aws-auth) ConfigMap on the cluster. This adds the node-group IAM role to the IAM authenticator and Kubernetes RBAC groups.
Note: If you add a node group to a cluster with existing node groups, mapped IAM roles, or mapped IAM users, the aws-auth ConfigMap may already be configured on your cluster. If the ConfigMap exists, you must download, edit, and replace the ConfigMap on the cluster using kubectl. We do not provide a procedure for this operation as it may affect workloads running on your cluster. Please see the section Managing users or IAM roles for your cluster in the Amazon EKS User Guide for more information.
To check if your cluster has the aws-auth ConfigMap configured
Run the aws eks --region <region> update-kubeconfig --name <cluster-name> command to update your workstation’s ~/.kube/config with the information needed to connect to your cluster, substituting your <region> and <cluster-name>.
Run the kubectl describe configmap -n kube-system aws-auth
If you receive an error stating, Error from server (NotFound): configmaps "aws-auth" not found, then proceed with the following procedures to use Terraform to apply the ConfigMap.
❯ kubectl describe configmap -n kube-system aws-auth
Error from server (NotFound): configmaps "aws-auth" not found
If you do not receive the preceding error, and kubectl returns an aws-auth ConfigMap, then you should not use Terraform to manage the ConfigMap.
To configure the Terraform Kubernetes provider
Create a new Terraform configuration file named aws-auth-config-map.tf.
Add the aws_eks_cluster Terraform data source, and configure it to look up your cluster by name.
data "aws_eks_cluster" "selected" {
name = "cmluns-eks-cluster"
}
Add the aws_eks_cluster_auth Terraform data source, and configure it to look up your cluster by name.
data "aws_eks_cluster_auth" "selected" {
name = "cmluns-eks-cluster"
}
Configure the kubernetes provider with your cluster host (endpoint address), cluster_ca_certificate, and the token from the aws_eks_cluster and aws_eks_cluster_auth data sources.
Apply the configuration and verify node registration
You have created the Terraform configurations to deploy an EKS self-managed node group to your Outpost and configure Kubernetes to authenticate the nodes. Now you apply the configurations and verify that the nodes register with your Kubernetes cluster.
To apply the Terraform configurations
Run the terraform init command to download the providers, self-managed node group module, and prepare the directory for use.
Run the terraform apply
Review the resources that will be created.
Enter yes to confirm that you want to create the resources.
❯ terraform apply
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
<-- Output omitted for brevity -->
Plan: 9 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
Run the aws eks --region <region> update-kubeconfig --name <cluster-name> command to update your workstation’s ~/.kube/config with the information required to connect to your cluster – substituting your <region> and <cluster-name>.
Run the kubectl get nodes command to view the status of your cluster nodes.
Verify your new nodes show up in the list with a STATUS of Ready.
❯ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-253-46-119.us-west-2.compute.internal Ready <none> 37s v1.18.9-eks-d1db3c
(Optional) If your nodes are registering, and their STATUS does not show Ready, run the kubectl get nodes --watch command to watch them come online.
(Optional) Run the kubectl get nodes --show-labels command to view the node list with the labels assigned to each node. The nodes created by your AWS Outposts node group will have the labels you assigned in Step 1.
To verify the Kubernetes system pods deploy on the worker nodes
Run the kubectl get pods --namespace kube-system
Verify that each pod shows READY 1/1 with a STATUS of Running.
❯ kubectl get pods --namespace kube-system
NAME READY STATUS RESTARTS AGE
aws-node-84xlc 1/1 Running 0 2m16s
coredns-559b5db75d-jxqbp 1/1 Running 0 5m36s
coredns-559b5db75d-vdccc 1/1 Running 0 5m36s
kube-proxy-fspw4 1/1 Running 0 2m16s
The nodes in your AWS Outposts node group should be registered with the EKS control plane in the Region and the Kubernetes system pods should successfully deploy on the new nodes.
Clean up
One of the nice things about using infrastructure as code tools like Terraform is that they automate stack creation and deletion. Use the following procedure to remove the resources you created in this tutorial.
To clean up the resources created in this tutorial
Run the terraform destroy
Review the resources that will be destroyed.
Enter yes to confirm that you want to destroy the resources.
❯ terraform destroy
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
- destroy
Terraform will perform the following actions:
Plan: 0 to add, 0 to change, 9 to destroy.
Do you really want to destroy all resources?
Terraform will destroy all your managed infrastructure, as shown above.
There is no undo. Only 'yes' will be accepted to confirm.
Enter a value: yes
Press Enter.
Destroy complete! Resources: 9 destroyed.
Clean up any resources you created for the prerequisites.
Conclusion
In this post, we discussed how containers sit at the heart of the application modernization process, making it easy to adopt microservices architectures that improve application scalability, availability, and performance. We also outlined the challenges associated with modernizing on-premises applications with low latency, local data processing, and data residency requirements. In these cases, AWS Outposts brings cloud services, like Amazon EKS, close to the workloads being refactored. We also walked you through deploying Amazon EKS worker nodes on-premises on AWS Outposts.
Now that you have deployed Amazon EKS worker nodes on in a test VPC using Terraform, you should adapt the Terraform module(s) and resources to prepare and deploy your production Kubernetes clusters on-premises on Outposts. If you don’t have an Outpost and you want to get started modernizing your on-premises applications with Amazon EKS and AWS Outposts, contact your local AWS account Solutions Architect (SA) to help you get started.
I’m pleased to announce that the Defense Information Systems Agency (DISA) has authorized 17 additional Amazon Web Services (AWS) services and features in the AWS GovCloud (US) Regions, bringing the total to 105 services and major features that are authorized for use by the U.S. Department of Defense (DoD). AWS now offers additional services to DoD mission owners in these categories: business applications; computing; containers; cost management; developer tools; management and governance; media services; security, identity, and compliance; and storage.
Why does authorization matter?
DISA authorization of 17 new cloud services enables mission owners to build secure innovative solutions to include systems that process unclassified national security data (for example, Impact Level 5). DISA’s authorization demonstrates that AWS effectively implemented more than 421 security controls by using applicable criteria from NIST SP 800-53 Revision 4, the US General Services Administration’s FedRAMP High baseline, and the DoD Cloud Computing Security Requirements Guide.
Recently authorized AWS services at DoD Impact Levels (IL) 4 and 5 include the following:
AWS Marketplace – A digital catalog with thousands of software listings from independent software vendors that you can use to find, test, buy, and deploy software that runs on AWS
Amazon Textract – Extract printed text, handwriting, and data from virtually any document
Security, Identity & Compliance
Amazon Cognito – Secure user sign-up, sign-in, and access control
AWS Security Hub – Centrally view and manage security alerts and automate security checks
Storage
AWS Backup – Centrally manage and automate backups across AWS services
Figure 1 shows the IL 4 and IL 5 AWS services that are now authorized for DoD workloads, broken out into functional categories.
Figure 1: The AWS services newly authorized by DISA
To learn more about AWS solutions for the DoD, see our AWS solution offerings. Follow the AWS Security Blog for updates on our Services in Scope by Compliance Program. If you have feedback about this blog post, let us know in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
This blog post will demonstrate how to leverage Jenkins with Amazon Elastic Kubernetes Service (EKS) by running a Jenkins Manager within an EKS pod. In doing so, we can run Jenkins workloads by allowing Amazon EKS to spawn dynamic Jenkins Agent(s) in order to perform application and infrastructure deployment. Traditionally, customers will setup a Jenkins Manager-Agent architecture that contains a set of manually added nodes with no autoscaling capabilities. Implementing this strategy will ensure that a robust approach optimizes the performance with the right-sized compute capacity and work needed to successfully perform the build tasks.
In setting up our Amazon EKS cluster with Jenkins, we’ll utilize the eksctl simple CLI tool for creating clusters on EKS. Then, we’ll build both the Jenkins Manager and Jenkins Agent image. Afterward, we’ll run a container deployment on our cluster to access the Jenkins application and utilize the dynamic Jenkins Agent pods to run pipelines and jobs.
Solution Overview
The architecture below illustrates the execution steps.
Figure 1. Solution overview diagram
Disclaimer(s):(Note: This Jenkins application is not configured with a persistent volume storage. Therefore, you must establish and configure this template to fit that requirement).
To accomplish this deployment workflow, we will do the following:
Centralized Shared Services account
Deploy the Amazon EKS Cluster into a Centralized Shared Services Account.
Create the Amazon ECR Repository for the Jenkins Manager and Jenkins Agent to store docker images.
Deploy the kubernetes manifest file for the Jenkins Manager.
Target Account(s)
Establish a set of AWS Identity and Access Management (IAM) roles with permissions for cross-across access from the Share Services account into the Target account(s).
Jenkins Application UI
Jenkins Plugins – Install and configure the Kubernetes Plugin and CloudBees AWS Credentials Plugin from Manage Plugins (you will not have to manually install this since it will be packaged and installed as part of the Jenkins image build).
Jenkins Pipeline Example—Fetch the Jenkinsfile to deploy an S3 Bucket with CloudFormation in the Target account using a Jenkins parameterized pipeline.
Prerequisites
The following is the minimum requirements for ensuring this solution will work.
Account Prerequisites
Shared Services Account: The location of the Amazon EKS Cluster.
Target Account: The destination of the CI/CD pipeline deployments.
This blog provides a high-level overview of the best practices for cross-account deployment and isolation maintenance between the applications. We evaluated the cross-account application deployment permissions and will describe the current state as well as what to avoid. As part of the security best practices, we will maintain isolation among multiple apps deployed in these environments, e.g., Pipeline 1 does not deploy to the Pipeline 2 infrastructure.
Requirement
A Jenkins manager is running as a container in an EC2 compute instance that resides within a Shared AWS account. This Jenkins application represents individual pipelines deploying unique microservices that build and deploy to multiple environments in separate AWS accounts. The cross-account deployment utilizes the target AWS account admin credentials in order to do the deployment.
This methodology means that it is not good practice to share the account credentials externally. Additionally, the deployment errors risk should be eliminated and application isolation should be maintained within the same account.
Note that the deployment steps are being run using AWS CLIs, thus our solution will be focused on AWS CLI usage.
The risk is much lower when utilizing CloudFormation / CDK to conduct deployments because the AWS CLIs executed from the build jobs will specify stack names as parametrized inputs and the very low probability of stack-name error. However, it remains inadvisable to utilize admin credentials of the target account.
Best Practice — Current Approach
We utilized cross-account roles that can restrict unauthorized access across build jobs. Behind this approach, we will utilize the assume-role concept that will enable the requesting role to obtain temporary credentials (from the STS service) of the target role and execute actions permitted by the target role. This is safer than utilizing hard-coded credentials. The requesting role could be either the inherited EC2 instance role OR specific user credentials. However, in our case, we are utilizing the inherited EC2 instance role.
For ease of understanding, we will refer the target-role as execution-role below.
Figure 2. Current approach
As per the security best practice of assigning minimum privileges, we must first create execution role in IAM in the target account that has deployment permissions (either via CloudFormation OR via CLI’s), e.g., app-dev-role in Dev account and app-prod-role in Prod account.
For each of those roles, we configure a trust relationship with the parent account ID (Shared Services account). This enables any roles in the Shared Services account (with assume-role permission) to assume the execution-role and deploy it on respective hosting infrastructure, e.g., the app-dev-role in Dev account will be a common execution role that will deploy various apps across infrastructure.
Then, we create a local role in the Shared Services account and configure credentials within Jenkins to be utilized by the Build Jobs. Provide the job with the assume-role permissions and specify the list of ARNs across every account. Alternatively, the inherited EC2 instance role can also be utilized to assume the execution-role.
Create Cross-Account IAM Roles
Cross-account IAM roles allow users to securely access AWS resources in a target account while maintaining the observability of that AWS account. The cross-account IAM role includes a trust policy allowing AWS identities in another AWS account to assume the given role. This allows us to create a role in one AWS account that delegates specific permissions to another AWS account.
Create an IAM role with a common name in each target account. The role name we’ve created is AWSCloudFormationStackExecutionRole. The role must have permissions to perform CloudFormation actions and any actions regarding the resources that will be created. In our case, we will be creating an S3 Bucket utilizing CloudFormation.
This IAM role must also have an established trust relationship to the Shared Services account. In this case, the Jenkins Agent will be granted the ability to assume the role of the particular target account from the Shared Services account.
In our case, the IAM entity that will assume the AWSCloudFormationStackExecutionRole is the EKS Node Instance Role that associated with the EKS Cluster Nodes.
Build the custom docker images for the Jenkins Manager and the Jenkins Agent, and then push the images to AWS ECR Repository. Navigate to the docker/ directory, then execute the command according to the required parameters with the AWS account ID, repository name, region, and the build folder name jenkins-manager/ or jenkins-agent/ that resides in the current docker directory. The custom docker images will contain a set of starter package installations.
After building both images, navigate to the k8s/ directory, modify the manifest file for the Jenkins image, and then execute the Jenkins manifest.yaml template to setup the Jenkins application. (Note: This Jenkins application is not configured with a persistent volume storage. Therefore, you will need to establish and configure this template to fit that requirement).
# Fetch the Application URL or navigate to the AWS Console for the Load Balancer
kubectl get svc -n jenkins
# Verify that jenkins deployment/pods are up running
kubectl get pods -n jenkins
# Replace with jenkins manager pod name and fetch Jenkins login password
kubectl exec -it pod/<JENKINS-MANAGER-POD-NAME> -n jenkins -- cat /var/jenkins_home/secrets/initialAdminPassword
Click: Add a new cloud → select Kubernetes from the drop menus
Figure 4a. Jenkins Configure Nodes and Clouds
Note: Before proceeding, please ensure that you can access your Amazon EKS cluster information, whether it is through Console or CLI.
Enter a Name in the Kubernetes Cloud configuration field.
Enter the Kubernetes URL which can be found via AWS Console by navigating to the Amazon EKS service and locating the API server endpoint of the cluster, or run the command kubectl cluster-info.
Enter the namespace that will be utilized in the Kubernetes Namespace field. This will determine where the dynamic kubernetes pods will spawn. In our case, the name of the namespace is jenkins.
During the initial setup of Jenkins Manager on kubernetes, there is an environment variable JENKINS_URL that automatically utilizes the Load Balancer URL to resolve requests. However, we will resolve our requests locally to the cluster IP address.
The format is as follows: https://<service-name>.<namespace>.svc.cluster.local
Figure 4b. Configure Kubernetes Cloud
Set AWS Credentials
Security concerns are a key reason why we’re utilizing an IAM role instead of access keys. For any given approach involving IAM, it is the best practice to utilize temporary credentials.
You must have the AWS Credentials Binding Plugin installed before this step. Enter the unique ID name as shown in the example below.
Enter the IAM Role ARN you created earlier for both the ID and IAM Role to use in the field as shown below.
Figure 5. AWS Credentials Binding
Figure 6. Managed Credentials
Create a pipeline
Navigate to the Jenkins main menu and select new item
Create a Pipeline
Figure 7. Create a pipeline
Configure Jenkins Agent
Setup a Kubernetes YAML template after you’ve built the agent image. In this example, we will be using the k8sPodTemplate.yaml file stored in the k8s/ folder.
This deploy-stack.sh file can accept four different parameters and conduct several types of CloudFormation stack executions such as deploy, create-changeset, and execute-changeset. This is also reflected in the stages of this Jenkinsfile pipeline. As for the delete-stack.sh file, two parameters are accepted, and, when executed, it will delete a CloudFormation stack based on the given stack name and region.
In this Jenkinsfile, the individual pipeline build jobs will deploy individual microservices. The k8sPodTemplate.yaml is utilized to specify the kubernetes pod details and the inbound-agent that will be utilized to run the pipeline.
Click Build with Parameters and then select a build action.
Figure 8a. Build with Parameters
Examine the pipeline stages even further for the choice you selected. Also, view more details of the stages below and verify in your AWS account that the CloudFormation stack was executed.
Figure 8b. Pipeline Stage View
The Final Step is to execute your pipeline and watch the pods spin up dynamically in your terminal. As is shown below, the Jenkins agent pod spawned and then terminated after the work completed. Watch this task on your own by executing the following command:
# Watch the pods spawn in the "jenkins" namespace
kubectl get pods -n jenkins -w
In order to avoid incurring future charges, delete the resources utilized in the walkthrough.
Delete the EKS cluster. You can utilize the eksctl to delete the cluster.
Delete any remaining AWS resources created by EKS such as AWS LoadBalancer, Target Groups, etc.
Delete any related IAM entities.
Conclusion
This post walked you through the process of building out Amazon EKS based infrastructure and integrating Jenkins to orchestrate workloads. We demonstrated how you can utilize this to deploy securely across multiple accounts with dynamic Jenkins agents and create alignment to your business with similar use cases. To learn more about Amazon EKS, see our documentation pages or explore our console.
About the Authors
Vladimir P. Toussaint
Vladimir is a DevOps Cloud Architect at Amazon Web Services. He works with GovCloud customers to build solutions and capabilities as they move to the cloud. Previous to Amazon Web Services, Vladimir has leveraged container orchestration tools such as Kubernetes to securely manage microservice applications for large enterprises.
Matt Noyce
Matt is a Sr. Cloud Application Architect at Amazon Web Services. He works primarily with health care and life sciences customers to help them architect and build applications, data lakes, and DevOps pipelines that solve their business needs. In his spare time Matt likes to run and hike along with enjoying time with friends and family.
Nikunj Vaidya
Nikunj is a DevOps Tech Leader at Amazon Web Services. He offers technical guidance to the customers on AWS DevOps solutions and services that would streamline the application development process, accelerate application delivery, and enable maintaining a high bar of software quality. Prior to AWS, Nikunj has worked in software engineering roles, leading transformation projects, driving releases and improvements in the software quality and customer experience.
In this blog post, I demonstrate how to implement service-to-service authorization using OAuth 2.0 access tokens for microservice APIs hosted on Amazon Elastic Kubernetes Service (Amazon EKS). A common use case for OAuth 2.0 access tokens is to facilitate user authorization to a public facing application. Access tokens can also be used to identify and authorize programmatic access to services with a system identity instead of a user identity. In service-to-service authorization, OAuth 2.0 access tokens can be used to help protect your microservice API for the entire development lifecycle and for every application layer. AWS Well Architected recommends that you validate security at all layers, and by incorporating access tokens validated by the microservice, you can minimize the potential impact if your application gateway allows unintended access. The solution sample application in this post includes access token security at the outset. Access tokens are validated in unit tests, local deployment, and remote cluster deployment on Amazon EKS. Amazon Cognito is used as the OAuth 2.0 token issuer.
Benefits of using access token security with microservice APIs
Some of the reasons you should consider using access token security with microservices include the following:
Access tokens provide production grade security for microservices in non-production environments, and are designed to ensure consistent authentication and authorization and protect the application developer from changes to security controls at a cluster level.
They enable service-to-service applications to identify the caller and their permissions.
Access tokens are short-lived credentials that expire, which makes them preferable to traditional API gateway long-lived API keys.
You get better system integration with a web or mobile interface, or application gateway, when you include token validation in the microservice at the outset.
Overview of solution
In the solution described in this post, the sample microservice API is deployed to Amazon EKS, with an Application Load Balancer (ALB) for incoming traffic. Figure 1 shows the application architecture on Amazon Web Services (AWS).
Figure 1: Application architecture
The application client shown in Figure 1 represents a service-to-service workflow on Amazon EKS, and shows the following three steps:
The application client requests an access token from the Amazon Cognito user pool token endpoint.
The access token is forwarded to the ALB endpoint over HTTPS when requesting the microservice API, in the bearer token authorization header. The ALB is configured to use IP Classless Inter-Domain Routing (CIDR) range filtering.
The microservice deployed to Amazon EKS validates the access token using JSON Web Key Sets (JWKS), and enforces the authorization claims.
Walkthrough
The walkthrough in this post has the following steps:
Amazon EKS cluster setup
Amazon Cognito configuration
Microservice OAuth 2.0 integration
Unit test the access token claims
Deployment of microservice on Amazon EKS
Integration tests for local and remote deployments
Prerequisites
For this walkthrough, you should have the following prerequisites in place:
A basic familiarity with Kotlin or Java, and frameworks such as Quarkus or Spring
A Unix terminal
Set up
Amazon EKS is the target for your microservices deployment in the sample application. Use the following steps to create an EKS cluster. If you already have an EKS cluster, you can skip to the next section: To set up the AWS Load Balancer Controller. The following example creates an EKS cluster in the Asia Pacific (Singapore) ap-southeast-1 AWS Region. Be sure to update the Region to use your value.
To create an EKS cluster with eksctl
In your Unix editor, create a file named eks-cluster-config.yaml, with the following cluster configuration:
Create the cluster by using the following eksctl command:
eksctl create cluster -f eks-cluster-config.yaml
Allow 10–15 minutes for the EKS control plane and managed nodes creation. eksctl will automatically add the cluster details in your kubeconfig for use with kubectl.
Validate your cluster node status as “ready” with the following command
kubectl get nodes
Create the demo namespace to host the sample application by using the following command:
kubectl create namespace demo
With the EKS cluster now up and running, there is one final setup step. The ALB for inbound HTTPS traffic is created by the AWS Load Balancer Controller directly from the EKS cluster using a Kubernetes Ingress resource.
To set up the AWS Load Balancer Controller
Follow the installation steps to deploy the AWS Load Balancer Controller to Amazon EKS.
An Ingress resource defines the ALB configuration. You customize the ALB by using annotations. Create a file named alb.yml, and add resource definition as follows, replacing the inbound IP CIDR with your values:
Deploy the Ingress resource with kubectl to create the ALB by using the following command:
kubectl apply -f alb.yml
After a few moments, you should see the ALB move from status provisioning to active, with an auto-generated public DNS name.
Validate the ALB DNS name and the ALB is in active status by using the following command:
kubectl -n demo describe ingress alb-ingress
To alias your host, in this case gateway.example.com with the ALB, create a Route 53 alias record. The remote API is now accessible using your Route 53 alias, for example: https://gateway.example.com/api/demo/*
The ALB that you created will only allow incoming HTTPS traffic on port 443, and restricts incoming traffic to known source IP addresses. If you want to share the ALB across multiple microservices, you can add the alb.ingress.kubernetes.io/group.name annotation. To help protect the application from common exploits, you should add an annotation to bind AWS Web Application Firewall (WAFv2) ACLs, including rate-limiting options for the microservice.
Configure the Amazon Cognito user pool
To manage the OAuth 2.0 client credential flow, you create an Amazon Cognito user pool. Use the following procedure to create the Amazon Cognito user pool in the console.
In the top-right corner of the page, choose Create a user pool.
Provide a name for your user pool, and choose Review defaults to save the name.
Review the user pool information and make any necessary changes. Scroll down and choose Create pool.
Note down your created Pool Id, because you will need this for the microservice configuration.
Next, to simulate the client in subsequent tests, you will create three app clients: one for read permission, one for write permission, and one for the microservice.
To create Amazon Cognito app clients
In the left navigation pane, under General settings, choose App clients.
On the right pane, choose Add an app client.
Enter the App client name as readClient.
Leave all other options unchanged.
Choose Create app client to save.
Choose Add another app client, and add an app client with the name writeClient, then repeat step 5 to save.
Choose Add another app client, and add an app client with the name microService. Clear Generate Client Secret, as this isn’t required for the microservice. Leave all other options unchanged. Repeat step 5 to save.
Note down the App client id created for the microService app client, because you will need it to configure the microservice.
You now have three app clients: readClient, writeClient, and microService.
With the read and write clients created, the next step is to create the permission scope (role), which will be subsequently assigned.
To create read and write permission scopes (roles) for use with the app clients
In the left navigation pane, under App integration, choose Resource servers.
On the right pane, choose Add a resource server.
Enter the name Gateway for the resource server.
For the Identifier enter your host name, in this case https://gateway.example.com.Figure 2 shows the resource identifier and custom scopes for read and write role permission.
Figure 2: Resource identifier and custom scopes
In the first row under Scopes, for Name enter demo.read, and for Description enter Demo Read role.
In the second row under Scopes, for Name enter demo.write, and for Description enter Demo Write role.
Choose Save changes.
You have now completed configuring the custom role scopes that will be bound to the app clients. To complete the app client configuration, you will now bind the role scopes and configure the OAuth2.0 flow.
To configure app clients for client credential flow
In the left navigation pane, under App Integration, select App client settings.
On the right pane, the first of three app clients will be visible.
Scroll to the readClient app client and make the following selections:
For Enabled Identity Providers, select Cognito User Pool.
Under OAuth 2.0, for Allowed OAuth Flows, select Client credentials.
Under OAuth 2.0, under Allowed Custom Scopes, select the demo.read scope.
Leave all other options blank.
Scroll to the writeClient app client and make the following selections:
For Enabled Identity Providers, select Cognito User Pool.
Under OAuth 2.0, for Allowed OAuth Flows, select Client credentials.
Under OAuth 2.0, under Allowed Custom Scopes, select the demo.write scope.
Leave all other options blank.
Scroll to the microService app client and make the following selections:
For Enabled Identity Providers, select Cognito User Pool.
Under OAuth 2.0, for Allowed OAuth Flows, select Client credentials.
Under OAuth 2.0, under Allowed Custom Scopes, select the demo.read scope.
Leave all other options blank.
Figure 3 shows the app client configured with the client credentials flow and custom scope—all other options remain blank
Figure 3: App client configuration
Your Amazon Cognito configuration is now complete. Next you will integrate the microservice with OAuth 2.0.
Microservice OAuth 2.0 integration
For the server-side microservice, you will use Quarkus with Kotlin. Quarkus is a cloud-native microservice framework with strong Kubernetes and AWS integration, for the Java Virtual Machine (JVM) and GraalVM. GraalVM native-image can be used to create native executables, for fast startup and low memory usage, which is important for microservice applications.
On the top left, you can modify the Group, Artifact and Build Tool to your preference, or accept the defaults.
In the Pick your extensions search box, select each of the following extensions:
RESTEasy JAX-RS
RESTEasy Jackson
Kubernetes
Container Image Jib
OpenID Connect
Choose Generate your application to download your application as a .zip file.
Quarkus permits low-code integration with an identity provider such as Amazon Cognito, and is configured by the project application.properties file.
To configure application properties to use the Amazon Cognito IDP
Edit the application.properties file in your quick start project:
src/main/resources/application.properties
Add the following properties, replacing the variables with your values. Use the cognito-pool-id and microservice App client id that you noted down when creating these Amazon Cognito resources in the previous sections, along with your Region.
The Kotlin code sample that follows verifies the authenticated principle by using the @Authenticated annotation filter, which performs JSON Web Key Set (JWKS) token validation. The JWKS details are cached, adding nominal latency to the application performance.
The access token claims are auto-filtered by the @RolesAllowed annotation for the custom scopes, read and write. The protected methods are illustrations of a microservice API and how to integrate this with one to two lines of code.
import io.quarkus.security.Authenticated
import javax.annotation.security.RolesAllowed
import javax.enterprise.context.RequestScoped
import javax.ws.rs.*
@Authenticated
@RequestScoped
@Path("/api/demo")
class DemoResource {
@GET
@Path("protectedRole/{name}")
@RolesAllowed("https://gateway.example.com/demo.read")
fun protectedRole(@PathParam(value = "name") name: String) = mapOf("protectedAPI" to "true", "paramName" to name)
@POST
@Path("protectedUpload")
@RolesAllowed("https://gateway.example.com/demo.write")
fun protectedDataUpload(values: Map<String, String>) = "Received: $values"
}
Unit test the access token claims
For the unit tests you will test three scenarios: unauthorized, forbidden, and ok. The @TestSecurity annotation injects an access token with the specified role claim using the Quarkus test security library. To include access token security in your unit test only requires one line of code, the @TestSecurity annotation, which is a strong reason to include access token security validation upfront in your development. The unit test code in the following example maps to the protectedRole method for the microservice via the uri /api/demo/protectedRole, with an additional path parameter sample-username to be returned by the method for confirmation.
Deploying the microservice to Amazon EKS is the same as deploying to any upstream Kubernetes-compliant installation. You declare your application resources in a manifest file, and you deploy a container image of your application to your container registry. You can do this in a similar low-code manner with the Quarkus Kubernetes extension, which automatically generates the Kubernetes deployment and service resources at build time. The Quarkus Container Image Jib extension to automatically build the container image and deploys the container image to Amazon Elastic Container Registry (ECR), without the need for a Dockerfile.
Amazon ECR setup
Your microservice container image created during the build process will be published to Amazon Elastic Container Registry (Amazon ECR) in the same Region as the target Amazon EKS cluster deployment. Container images are stored in a repository in Amazon ECR, and in the following example uses a convention for the repository name of project name and microservice name. The first command that follows creates the Amazon ECR repository to host the microservice container image, and the second command obtains login credentials to publish the container image to Amazon ECR.
To set up the application for Amazon ECR integration
In the AWS CLI, create an Amazon ECR repository by using the following command. Replace the project name variable with your parent project name, and replace the microservice name with the microservice name.
Obtain an ECR authorization token, by using your IAM principal with the following command. Replace the variables with your values for the AWS account ID and Region.
After the application re-build, you should now have a container image deployed to Amazon ECR in your region with the following name [project-group]/[project-name]. The Quarkus build will give an error if the push to Amazon ECR failed.
Now, you can deploy your application to Amazon EKS, with kubectl from the following build path:
kubectl apply -f build/kubernetes/kubernetes.yml
Integration tests for local and remote deployments
The following environment assumes a Unix shell: either MacOS, Linux, or Windows Subsystem for Linux (WSL 2).
How to obtain the access token from the token endpoint
Obtain the access token for the application client by using the Amazon Cognito OAuth 2.0 token endpoint, and export an environment variable for re-use. Replace the variables with your Amazon Cognito pool name, and AWS Region respectively.
To generate the client credentials in the required format, you need the Base64 representation of the app client client-id:client-secret. There are many tools online to help you generate a Base64 encoded string. Export the following environment variables, to avoid hard-coding in configuration or scripts.
You can use curl to post to the token endpoint, and obtain an access token for the read and write app client respectively. You can pass grant_type=client_credentials and the custom scopes as appropriate. If you pass an incorrect scope, you will receive an invalid_grant error. The Unix jq tool extracts the access token from the JSON string. If you do not have the jq tool installed, you can use your relevant package manager (such as apt-get, yum, or brew), to install using sudo [package manager] install jq.
The following shell commands obtain the access token associated with the read or write scope. The client credentials are used to authorize the generation of the access token. An environment variable stores the read or write access token for future use. Update the scope URL to your host, in this case gateway.example.com.
If the curl commands are successful, you should see the access tokens in the environment variables by using the following echo commands:
echo $access_token_read
echo $access_token_write
For more information or troubleshooting, see TOKEN Endpoint in the Amazon Cognito Developer Guide.
Test scope with automation script
Now that you have saved the read and write access tokens, you can test the API. The endpoint can be local or on a remote cluster. The process is the same, all that changes is the target URL. The simplicity of toggling the target URL between local and remote is one of the reasons why access token security can be integrated into the full development lifecycle.
To perform integration tests in bulk, use a shell script that validates the response code. The example script that follows validates the API call under three test scenarios, the same as the unit tests:
If no valid access token is passed: 401 (unauthorized) response is expected.
A valid access token is passed, but with an incorrect role claim: 403 (forbidden) response is expected.
A valid access token and valid role-claim is passed: 200 (ok) response with content-type of application/json expected.
Name the following script, demo-api.sh. For each API method in the microservice, you duplicate these three tests, but for the sake of brevity in this post, I’m only showing you one API method here, protectedRole.
Test the microservice API against the access token claims
Run the script for a local host deployment on http://localhost:8080, and on the remote EKS cluster, in this case https://gateway.example.com.
If everything works as expected, you will have demonstrated the same test process for local and remote deployments of your microservice. Another advantage of creating a security test automation process like the one demonstrated, is that you can also include it as part of your continuous integration/continuous delivery (CI/CD) test automation.
The test automation script accepts the microservice host URL as a parameter (the default is local), referencing the stored access tokens from the environment variables. Upon error, the script will exit with the error code. To test the remote EKS cluster, use the following command, with your host URL, in this case gateway.example.com.
./demo-api.sh https://<gateway.example.com>
Expected output:
Test 401-unauthorized: No access token for /api/demo/protectedRole/{name}
Test 403-forbidden: Incorrect role/custom-scope for /api/demo/protectedRole/{name}
Test 200-ok: Correct role for /api/demo/protectedRole/{name}
Output: {"protectedAPI":"true","paramName":"sample-username"}!200!application/json
Best practices for a well architected production service-to-service client
For elevated security in alignment with AWS Well Architected, it is recommend to use AWS Secrets Manager to hold the client credentials. Separating your credentials from the application permits credential rotation without the requirement to release a new version of the application or modify environment variables used by the service. Access to secrets must be tightly controlled because the secrets contain extremely sensitive information. Secrets Manager uses AWS Identity and Access Management (IAM) to secure access to the secrets. By using the permissions capabilities of IAM permissions policies, you can control which users or services have access to your secrets. Secrets Manager uses envelope encryption with AWS KMS customer master keys (CMKs) and data key to protect each secret value. When you create a secret, you can choose any symmetric customer managed CMK in the AWS account and Region, or you can use the AWS managed CMK for Secrets Manager aws/secretsmanager.
Access tokens can be configured on Amazon Cognito to expire in as little as 5 minutes or as long as 24 hours. To avoid unnecessary calls to the token endpoint, the application client should cache the access token and refresh close to expiry. In the Quarkus framework used for the microservice, this can be automatically performed for a client service by adding the quarkus-oidc-client extension to the application.
Cleaning up
To avoid incurring future charges, delete all the resources created.
To delete the EKS cluster, and the associated Application Load Balancer, follow the steps described in Deleting a cluster in the Amazon EKS User Guide.
This post has focused on the last line of defense, the microservice, and the importance of a layered security approach throughout the development lifecycle. Access token security should be validated both at the application gateway and microservice for end-to-end API protection.
As an additional layer of security at the application gateway, you should consider using Amazon API Gateway, and the inbuilt JWT authorizer to perform the same API access token validation for public facing APIs. For more advanced business-to-business solutions, Amazon API Gateway provides integrated mutual TLS authentication.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Cognito forum or contact AWS Support.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
In this blog post, we show you how to set up end-to-end encryption on Amazon Elastic Kubernetes Service (Amazon EKS) with AWS Certificate Manager Private Certificate Authority. For this example of end-to-end encryption, traffic originates from your client and terminates at an Ingress controller server running inside a sample app. By following the instructions in this post, you can set up an NGINX ingress controller on Amazon EKS. As part of the example, we show you how to configure an AWS Network Load Balancer (NLB) with HTTPS using certificates issued via ACM Private CA in front of the ingress controller.
AWS Private CA supports an open source plugin for cert-manager that offers a more secure certificate authority solution for Kubernetes containers. cert-manager is a widely-adopted solution for TLS certificate management in Kubernetes. Customers who use cert-manager for application certificate lifecycle management can now use this solution to improve security over the default cert-manager CA, which stores keys in plaintext in server memory. Customers with regulatory requirements for controlling access to and auditing their CA operations can use this solution to improve auditability and support compliance.
Solution components
Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications.
Amazon EKS is a managed service that you can use to run Kubernetes on Amazon Web Services (AWS) without needing to install, operate, and maintain your own Kubernetes control plane or nodes.
cert-manager is an add on to Kubernetes to provide TLS certificate management. cert-manager requests certificates, distributes them to Kubernetes containers, and automates certificate renewal. cert-manager ensures certificates are valid and up-to-date, and attempts to renew certificates at an appropriate time before expiry.
ACM Private CA enables the creation of private CA hierarchies, including root and subordinate CAs, without the investment and maintenance costs of operating an on-premises CA. With ACM Private CA, you can issue certificates for authenticating internal users, computers, applications, services, servers, and other devices, and for signing computer code. The private keys for private CAs are stored in AWS managed hardware security modules (HSMs), which are FIPS 140-2 certified, providing a better security profile compared to the default CAs in Kubernetes. Private certificates help identify and secure communication between connected resources on private networks such as servers, mobile and IoT devices, and applications.
AWS Private CA Issuer plugin. Kubernetes containers and applications use digital certificates to provide secure authentication and encryption over TLS. With this plugin, cert-manager requests TLS certificates from Private CA. The integration supports certificate automation for TLS in a range of configurations, including at the ingress, on the pod, and mutual TLS between pods. You can use the AWS Private CA Issuer plugin with Amazon Elastic Kubernetes Service, self managed Kubernetes on AWS, and Kubernetes on-premises.
An AWS Application Load Balancer (ALB) when you create a Kubernetes Ingress.
An AWS Network Load Balancer (NLB) when you create a Kubernetes Service of type LoadBalancer.
Different points for terminating TLS in Kubernetes
How and where you terminate your TLS connection depends on your use case, security policies, and need to comply with regulatory requirements. This section talks about four different use cases that are regularly used for terminating TLS. The use cases are illustrated in Figure 1 and described in the text that follows.
Figure 1: Terminating TLS at different points
At the load balancer: The most common use case for terminating TLS at the load balancer level is to use publicly trusted certificates. This use case is simple to deploy and the certificate is bound to the load balancer itself. For example, you can use ACM to issue a public certificate and bind it with AWS NLB. You can learn more from How do I terminate HTTPS traffic on Amazon EKS workloads with ACM?
At the ingress: If there is no strict requirement for end-to-end encryption, you can offload this processing to the ingress controller or the NLB. This helps you to optimize the performance of your workloads and make them easier to configure and manage. We examine this use case in this blog post.
On the pod: In Kubernetes, a pod is the smallest deployable unit of computing and it encapsulates one or more applications. End-to-end encryption of the traffic from the client all the way to a Kubernetes pod provides a secure communication model where the TLS is terminated at the pod inside the Kubernetes cluster. This could be useful for meeting certain security requirements. You can learn more from the blog post Setting up end-to-end TLS encryption on Amazon EKS with the new AWS Load Balancer Controller.
Mutual TLS between pods: This use case focuses on encryption in transit for data flowing inside Kubernetes cluster. For more details on how this can be achieved with Cert-manager using an Istio service mesh, please see the Securing Istio workloads with mTLS using cert-manager blog post. You can use the AWS Private CA Issuer plugin in conjunction with cert-manager to use ACM Private CA to issue certificates for securing communication between the pods.
In this blog post, we use a scenario where there is a requirement to terminate TLS at the ingress controller level, demonstrating the second example above.
Figure 2 provides an overall picture of the solution described in this blog post. The components and steps illustrated in Figure 2 are described fully in the sections that follow.
Verify that you have the latest versions of these tools installed before you begin.
Provision an Amazon EKS cluster
If you already have a running Amazon EKS cluster, you can skip this step and move on to install NGINX Ingress.
You can use the AWS Management Console or AWS CLI, but this example uses eksctl to provision the cluster. eksctl is a tool that makes it easier to deploy and manage an Amazon EKS cluster.
This example uses the US-EAST-2 Region and the T2 node type. Select the node type and Region that are appropriate for your environment. Cluster provisioning takes approximately 15 minutes.
To provision an Amazon EKS cluster
Run the following eksctl command to create an Amazon EKS cluster in the us-east-2 Region with Kubernetes version 1.19 and two nodes. You can change the Region to the one that best fits your use case.
Run the following command to determine the address that AWS has assigned to your NLB:
$ kubectl get service -n ingress-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.100.214.10 a3ebe22e7ca0522d1123456fbc92605c-8ac7f1d49be2fc42.elb.us-east-2.amazonaws.com 80:32598/TCP,443:30624/TCP 14s
ingress-nginx-controller-admission ClusterIP 10.100.118.1 <none> 443/TCP 14s
It can take up to 5 minutes for the load balancer to be ready. Once the external IP is created, run the following command to verify that traffic is being correctly routed to ingress-nginx:
curl http://a3ebe22e7ca0522d1123456fbc92605c-8ac7f1d49be2fc42.elb.us-east-2.amazonaws.com
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
Note: Even though, it’s returning an HTTP 404 error code, in this case curl is still reaching the ingress controller and getting the expected HTTP response back.
Configure your DNS records
Once your load balancer is provisioned, the next step is to point the application’s DNS record to the URL of the NLB.
You can use your DNS provider’s console, for example Route53, and set a CNAME record pointing to your NLB. See CNAME record type for more details on how to setup a CNAME record using Route53.
This scenario uses the sample domain rsa-2048.example.com.
As you go through the scenario, replace rsa-2048.example.com with your registered domain.
Install cert-manager
cert-manager is a Kubernetes add-on that you can use to automate the management and issuance of TLS certificates from various issuing sources. It runs within your Kubernetes cluster and will ensure that certificates are valid and attempt to renew certificates at an appropriate time before they expire.
After you’ve deployed cert-manager, you can verify the installation by following these instructions. If all the above steps have completed without error, you’re good to go!
Note: If you’re planning to use Amazon EKS with Kubernetes pods running on AWS Fargate, please follow the cert-manager Fargate instructions to make sure cert-manager installation works as expected. AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers.
Install aws-privateca-issuer
The AWS PrivateCA Issuer plugin acts as an addon (see external cert configuration) to cert-manager that signs certificate requests using ACM Private CA.
To install aws-privateca-issuer
For installation, use the following helm commands:
Verify that the AWS Private CA Issuer is configured correctly by running the following command and ensure that it is in READY state with status as Running:
$ kubectl get pods --namespace aws-pca-issuer
NAME READY STATUS RESTARTS AGE
aws-pca-issuer-1622570742-56474c464b-j6k8s 1/1 Running 0 21s
You can check the chart configuration in the default values file.
Create an ACM Private CA
In this scenario, you create a private certificate authority in ACM Private CA with RSA 2048 selected as the key algorithm. You can create a CA using the AWS console, AWS CLI, or AWS CloudFormation.
To create an ACM Private CA
Download the CA certificate using the following command. Replace the <CA_ARN> and <Region> values with the values from the CA you created earlier and save it to a file named cacert.pem:
aws acm-pca get-certificate-authority-certificate --certificate-authority-arn <CA_ARN> -- region <region> --output text > cacert.pem
Once your private CA is active, you can proceed to the next step. You private CA will look similar to the CA in Figure 3.
Figure 3: Sample ACM Private CA
Set EKS node permission for ACM Private CA
In order to issue a certificate from ACM Private CA, add the IAM policy from the prerequisites to your EKS NodeInstanceRole. Replace the <CA_ARN> value with the value from the CA you created earlier:
Now that the ACM Private CA is active, you can begin requesting private certificates which can be used by Kubernetes applications. Use aws-privateca-issuer plugin to create the ClusterIssuer, which will be used with the ACM PCA to issue certificates.
Issuers (and ClusterIssuers) represent a certificate authority from which signed x509 certificates can be obtained, such as ACM Private CA. You need at least one Issuer or ClusterIssuer before you can start requesting certificates in your cluster. There are two custom resources that can be used to create an Issuer inside Kubernetes using the aws-privateca-issuer add-on:
AWSPCAIssuer is a regular namespaced issuer that can be used as a reference in your Certificate custom resources.
AWSPCAClusterIssuer is specified in exactly the same way, but it doesn’t belong to a single namespace and can be referenced by certificate resources from multiple different namespaces.
To create an Issuer in Amazon EKS
For this scenario, you create an AWSPCAClusterIssuer. Start by creating a file named cluster-issuer.yaml and save the following text in it, replacing <CA_ARN> and <Region> information with your own.
Verify the installation and make sure that the following command returns a Kubernetes service of kind AWSPCAClusterIssuer:
$ kubectl get AWSPCAClusterIssuer
NAME AGE
demo-test-root-ca 51s
Request the certificate
Now, you can begin requesting certificates which can be used by Kubernetes applications from the provisioned issuer. For more details on how to specify and request Certificate resources, please check Certificate Resources guide.
To request the certificate
As a first step, create a new namespace that contains your application and secret:
$ kubectl create namespace acm-pca-lab-demo
namespace/acm-pca-lab-demo created
Next, create a basic X509 private certificate for your domain. Create a file named rsa-2048.yaml and save the following text in it. Replace rsa-2048.example.com with your domain.
For the purpose of this scenario, you can create a new service—a simple “hello world” website that uses echoheaders that respond with the HTTP request headers along with some cluster details.
To deploy a demo application
Create a new file named hello-world.yaml with below content:
You should see an output similar to the following:
Hostname: hello-world-d8fbd49c6-9bczb
Pod Information:
-no pod information available-
Server values:
server_version=nginx: 1.13.3 - lua: 10008
Request Information:
client_address=192.162.32.64
method=GET
real path=/
query=
request_version=1.1
request_scheme=http
request_uri=http://rsa-2048.example.com:8080/
Request Headers:
accept=*/*
host=rsa-2048.example.com
user-agent=curl/7.62.0
x-forwarded-for=52.94.2.2
x-forwarded-host=rsa-2048.example.com
x-forwarded-port=443
x-forwarded-proto=https
x-real-ip=52.94.2.2
x-request-id=371b6fc15a45d189430281693cccb76e
x-scheme=https
Request Body:
-no body in request-…
This response is returned from the service running behind the Kubernetes Ingress controller and demonstrates that a successful TLS handshake happened at port 443 with https protocol.
You can use the following command to verify that the certificate issued earlier is being used for the SSL handshake:
You can also clean up from your Cloudformation console by deleting the stacks named eksctl-acm-pca-lab-nodegroup-acm-pca-nlb-lab-workers and eksctl-acm-pca-lab-cluster.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Certificate Manager forum or contact AWS Support.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.