Tag Archives: Amazon Elastic Container Service

Deploying a ASP.NET Core web application to Amazon ECS using an Azure DevOps pipeline

Post Syndicated from John Formento original https://aws.amazon.com/blogs/devops/deploying-a-asp-net-core-web-application-to-amazon-ecs-using-an-azure-devops-pipeline/

For .NET developers, leveraging Team Foundation Server (TFS) has been the cornerstone for CI/CD over the years. As more and more .NET developers start to deploy onto AWS, they have been asking questions about using the same tools to deploy to the AWS cloud. By configuring a pipeline in Azure DevOps to deploy to the AWS cloud, you can easily use familiar Microsoft development tools to build great applications.

Solution overview

This blog post demonstrates how to create a simple Azure DevOps project, repository, and pipeline to deploy an ASP.NET Core web application to Amazon ECS using Azure DevOps. The following screenshot shows a high-level architecture diagram of the pipeline:

 

Solution Architecture Diagram

In this example, you perform the following steps:

  1. Create an Azure DevOps Project, clone project repo, and push ASP.NET Core web application.
  2. Create a pipeline in Azure DevOps
  3. Build an Amazon ECS Cluster, Task and Service.
  4. Kick-off deployment of the ASP.Net Core web application using the newly create Azure DevOps pipeline.

 

Prerequisites

Ensure you have the following prerequisites set up:

  • An Amazon ECR repository
  • An IAM user with permissions for Amazon ECR and Amazon ECS (the user will need an access key and secret access key)

 

Create an Azure DevOps Project, clone project repo, and push ASP.NET Core web application

Follow these steps to deploy a .NET Core app onto your Amazon ECS cluster using the Azure DevOps (ADO) repository and pipeline:

 

  1. Login to dev.azure.com and navigate to the marketplace.
  2. Go to Visual Studio, search for “AWS”, and add the AWS Tools for Microsoft Visual Studio Team Services.
  3. Create a project in ADO: Provide a project name and choose Create.
  4. On the Project Summary page, choose Project Settings.
  5. In the Project Settings pane, navigate to the Service Connections page.
  6. Choose Create service connection, select AWS, and choose Next.
  7. Input an Access Key ID and Secret Access Key. (You’ll need an IAM user with permissions for Amazon ECR and Amazon ECS in order to deploy via the Azure DevOps pipeline.) Choose Save.
  8. Choose Repos in the left pane, then Clone in Visual Studio under Clone to your computer.
  9. Create a ASP.NET Core web application in Visual Studio, set the location to locally cloned repository, and check Enable Docker support.
  10. Once you’ve created the new project, perform an initial commit and push to the repository in Azure DevOps.

 

Creating a pipeline in Azure DevOps

Now that you have synced the repository, create a pipeline in Azure DevOps.

  1. Go to the pipeline page within Azure DevOps and choose Create Pipeline.
  2. Choose Use the classic editor.Pipeline configuration with repository
  3. Select Azure Repos Git for the location of your code and select the repository you created earlier.
  4. On the Choose a Template page, select Docker Container and choose Apply.
  5. Remove the Push an image step.
  6. Add an Amazon ECR Push task by choosing the + symbol next to Agent job 1. You can search for “AWS” in the Add tasks pane to filter for all AWS tasks.

 

Now, configure each task:

  1. Choose the Build an image task and ensure that the action is set to Build an image. Additionally, you can modify the Image Name to your standards.Pipeline configuration page Azure DevOps
  2. Choose the Push Image task and provide the following
    • Enter a name under Display Name.
    • Select the AWS Credentials that you created in Service Connections.
    • Select the AWS Region.
    • Provide the source image name, which you can find in the setting for the Build an image task.
    • Enter the name of the repository in Amazon ECR to which the image is pushedPipeline configuration page Azure DevOps
  3. Choose Save and queue.

Build Amazon ECS Cluster, Task, and Service

The goal here is to test up to building the Docker image and ensure it’s pushed to Amazon ECR. Once the Docker image is in Amazon ECR, you can create the Amazon ECS cluster, task definition, and service leveraging the newly created Docker image.

  1. Create an Amazon ECS cluster.
  2. Create an Amazon ECS task definition. When you create the task definition and configure the container, use the Amazon ECR URI for the Docker image that was just pushed to Amazon ECR.
  3. Create an Amazon ECS service.

Go back and edit the pipeline:

  1. Add the last step by choosing the + symbol next to Agent job 1.
  2. Search for “AWS CLI” in the search bar and add the task.
  3. Choose AWS CLI and configure the task.
  4. Enter a name under Display Name, such as Update ECS Service.
  5. Select the AWS Credentials that you created in Service Connections.
  6. Select the AWS Region.
  7. Input the following command, which updates the Amazon ECS service after a new image is pushed to Amazon ECR. Replace <clustername> and <servicename> with your Amazon ECS cluster and service names.
    • Command:ecs
    • Subcommand:update-service
    • Options and parameters: --cluster <clustername> --service <servicename> --force-new-deployment
  8. Now choose the Triggers tab and select Enable continuous integration with the repository you created.
  9. Choose Save and queue.

 

At this point, your build pipeline kicks off and builds a Docker image from the source code in the repository you created, pushes the image to Amazon ECR, and updates the Amazon ECS service with the new image.

You can verify by viewing the build. Choose Pipelines in Azure DevOps, selecting the entry for the latest run, and then the icon under the status column. Once it successfully completes, you can log in to the AWS console and view the updated image in Amazon ECR and the updated service in Amazon ECS.Pipeline status page Azure DevOps

Every time you commit and push your code through Visual Studio, this pipeline kicks off and builds and deploys your application to Amazon ECS.

Cleanup

At the end of this example, once you’ve completed all steps and are finished testing, follow these steps to disable or delete resources to avoid incurring costs:

  1. Go to the Amazon ECS console within the AWS Console.
  2. Navigate to the cluster you created, then choose the Tasks tab.
  3. Choose Stop all to turn off the tasks.

Conclusion

This blog post reviewed how to create a CI/CD pipeline in Azure DevOps to deploy a Docker Image to Amazon ECR and container to Amazon ECS. It provided detailed steps on how to set up a basic CI/CD pipeline, leveraging tools with which .NET developers are familiar and the steps needed to integrate with Amazon ECR and Amazon ECS.

I hope this post was informative and has helped you learn the basics of how to integrate Amazon ECR and Amazon ECS with Azure DevOps to create a robust CI/CD pipeline.

About the Authors

John Formento

 

 

John Formento is a Solution Architect at Amazon Web Services. He helps large enterprises achieve their goals by architecting secure and scalable solutions on the AWS Cloud.

Amazon Elastic Container Service now supports Amazon EFS file systems

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/amazon-ecs-supports-efs/

It has only been five years since Jeff wrote on this blog about the launch of the Amazon Elastic Container Service. I remember reading that post and thinking how exotic and unusual containers sounded. Fast forward just five years, and containers are an everyday part of most developers lives, but whilst customers are increasingly adopting container orchestrators such as ECS, there are still some types of applications that have been hard to move into this containerized world.

Customers building applications that require data persistence or shared storage have faced a challenge since containers are temporary in nature. As containers are scaled in and out dynamically, any local data is lost as containers are terminated. Today we are changing that for ECS by launching support for Amazon Elastic File System (EFS) file systems. Both containers running on ECS and AWS Fargate will be able to use Amazon Elastic File System (EFS).

This new capability will help customers containerize applications that require shared storage such as content management systems, internal DevOps tools, and machine learning frameworks. A whole new set of workloads will now enjoy the benefits containers bring, enabling customers to speed up their deployment process, optimize their infrastructure utilization, and build more resilient systems.

Amazon Elastic File System (EFS) provides a fully managed, highly available, scalable shared file system; it means that you can store your data separately from your compute. It is also a regional service, meaning that the service is storing data within and across 3 Availability Zones for high availability and durability.

Until now, it was possible to get EFS working with ECS if you were running your containers on a cluster of EC2 instances. However, if you wanted to use AWS Fargate as your container data plane then, prior to this announcement, you couldn’t mount an EFS file system. Fargate does not allow you as a customer to gain access to the managed instances inside the Fargate fleet and so you are unable to make the modifications required to the instances to setup EFS.

I’m sure many of our customers will be delighted that they now have a way of connecting EFS file systems easily to ECS, personally I’m ecstatic that we can use this new feature in combination with Fargate, it will be perfect for a little side project that I am currently building and finally give us a way of having persistent storage work in combination with serverless containers.

There is a good reason both ECS and EFS include the word Elastic in their name as both of these services can scale up and down as your application requires. EFS scales on demand from zero to petabytes with no disruptions, growing and shrinking automatically as you add and remove files. With ECS there are options to use either Cluster Auto Scaling or Fargate to ensure that your capacity grows and shrinks to meet demand. For you, our customer, this means that you are only ever paying for the storage and compute that you are actually going to use.

So, enough talking, let’s get to the fun bit and see how we can get a containerized application working with Amazon Elastic File System (EFS).

A Simple Shared File System Example
For this example, I have built some basic infrastructure, so I can show you the before and after effect of adding EFS to a Fargate cluster. Firstly I have created a VPC that spans two Availability Zones. Secondly, I have created an ECS cluster. On the ECS Cluster, I plan to run two containers using Fargate, and this means that I don’t have to set up any EC2 instances as my containers will run on the Fargate fleet that is managed by AWS.

To deploy my application, I create a Task Definition that uses a Docker Image of an Open Source application called Cloud Commander, which is a simple drag and drop, file manager.

In the ECS console, I create a service and use the Task Definition that I created to deploy my application. Once the service is deployed, and the containers are provisioned, I head over to the Application Load Balancer URL which was created as part of my service, and I can see that my application appears to be working. I can drag a file to upload it to the application.

However there is a problem. If I refresh the page, occasionally, the file that I dragged to upload disappears. This happens because I have two containers running my application, and they are both using their local file systems. As I refresh my browser, the load balancer sends me to one of the two containers, and only one of the containers is storing the image on its local volume.

What I need is a shared file system that both the containers can mount and to which they can both write files.

Next, I create a new file system inside the EFS console. In the wizard, I choose the same VPC that I used when I created my ECS cluster and select all of the availability zones that the VPC spans and ask the service to create mount targets in each one. These mount targets will mean that containers that are in different Availability Zones will still be able to connect to the file system.

I select the defaults for all the other options in the wizard. In Step 3, I click the button to Add an access Point. An access point is a way of me giving a particular application access to the file system and gives me incredibly granular control over what data my application is allowed to access. You can add multiple Access points to your EFS file system and provide different applications, different levels of access to the same file system.

The application I am deploying will handle user uploads for my web site, so I will create an EFS Access Point that gives this application full access to the /uploads directory, but nothing else. To do this, I will create an access point with a new User ID (1000) and Group ID (1000), and a home directory of /uploads. The directory will be created with this user and group as the owner with full permissions, giving read permissions to all other users.

Security is the number one priority at AWS and the team have worked hard to ensure ECS integrates with EFS to provide multiple layers of security for protecting EFS filesystems from unauthorized access, including IAM role-based access control, VPC security groups, and encryption of data in transit.

After working through the wizard, my file system is created, and I’m given a File system ID and an Access point ID. I will need these IDs to configure the task definitions in Fargate.

I go back to the Task Definition inside my ECS Cluster and create a new revision of the Task Definition. I scroll down to the Volumes section of the definition and click Add volume.

I can then add my EFS File System details, I select the correct File system ID and also the correct Access point ID that I created earlier.

I have opted to Enable Encryption in transit, but for this example I have not enabled EFS IAM authorization which would be helpful in a larger application with many clients requiring different levels of access for different portions of a filesystem. This feature can simplify management by using IAM authorization, if you want more details on this check out the blog we wrote when the feature launched earlier in the year.

Now that I have updated my task definition, I can update my ECS service to use this new definition. It’s also essential here to make sure that I set the platform version to 1.4.0.

The service deploys my two new containers and decommissions the old two. The new containers will now be using the shared EFS file system, and so my application now works as expected.

If I upload files and then revisit the application, my files will still be there. If my containers are replaced or are scaled up or down, the file system will persist.

Looking to the Future
I am loving the innovation that has been coming out of the containers teams recently, and looking at their public roadmap; they have some really exciting plans for the future. If you have ideas or feature requests, make sure you add your voice to the many customers that are already guiding their roadmap.

The new feature is available in all regions where ECS and EFS are available and comes at no additional cost. So, go check it out in the AWS console and let us know what you think.

Happy Containerizing

— Martin

 

Identifying and resolving security code vulnerabilities using Snyk in AWS CI/CD Pipeline

Post Syndicated from Jay Yeras original https://aws.amazon.com/blogs/devops/identifying-and-resolving-vulnerabilities-in-your-code/

The majority of companies have embraced open-source software (OSS) at an accelerated rate even when building proprietary applications. Some of the obvious benefits for this shift include transparency, cost, flexibility, and a faster time to market. Snyk’s unique combination of developer-first tooling and best in class security depth enables businesses to easily build security into their continuous development process.

Even for teams building proprietary code, use of open-source packages and libraries is a necessity. In reality, a developer’s own code is often a small core within the app, and the rest is open-source software. While relying on third-party elements has obvious benefits, it also presents numerous complexities. Inadvertently introducing vulnerabilities into your codebase through repositories that are maintained in a distributed fashion and with widely varying levels of security expertise can be common, and opens up applications to effective attacks downstream.

There are three common barriers to truly effective open-source security:

  1. The security task remains in the realm of security and compliance, often perpetuating the siloed structure that DevOps strives to eliminate and slowing down release pace.
  2. Current practice may offer automated scanning of repositories, but the remediation advice it provides is manual and often un-actionable.
  3. The data generated often focuses solely on public sources, without unique and timely insights.

Developer-led application security

This blog post demonstrates techniques to improve your application security posture using Snyk tools to seamlessly integrate within the developer workflow using AWS services such as Amazon ECR, AWS Lambda, AWS CodePipeline, and AWS CodeBuild. Snyk is a SaaS offering that organizations use to find, fix, prevent, and monitor open source dependencies. Snyk is a developer-first platform that can be easily integrated into the Software Development Lifecycle (SDLC). The examples presented in this post enable you to actively scan code checked into source code management, container images, and serverless, creating a highly efficient and effective method of managing the risk inherent to open source dependencies.

Prerequisites

The examples provided in this post assume that you already have an AWS account and that your account has the ability to create new IAM roles and scope other IAM permissions. You can use your integrated development environment (IDE) of choice. The examples reference AWS Cloud9 cloud-based IDE. An AWS Quick Start for Cloud9 is available to quickly deploy to either a new or existing Amazon VPC and offers expandable Amazon EBS volume size.

Sample code and AWS CloudFormation templates are available to simplify provisioning the various services you need to configure this integration. You can fork or clone those resources. You also need a working knowledge of git and how to fork or clone within your source provider to complete these tasks.

cd ~/environment && \ 
git clone https://github.com/aws-samples/aws-modernization-with-snyk.git modernization-workshop 
cd modernization-workshop 
git submodule init 
git submodule update

Configure your CI/CD pipeline

The workflow for this example consists of a continuous integration and continuous delivery pipeline leveraging AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, Amazon ECR, and AWS Fargate, as shown in the following screenshot.

CI/CD Pipeline

For simplicity, AWS CloudFormation templates are available in the sample repo for services.yaml, pipeline.yaml, and ecs-fargate.yaml, which deploy all services necessary for this example.

Launch AWS CloudFormation templates

A detailed step-by-step guide can be found in the self-paced workshop, but if you are familiar with AWS CloudFormation, you can launch the templates in three steps. From your Cloud9 IDE terminal, change directory to the location of the sample templates and complete the following three steps.

1) Launch basic services

aws cloudformation create-stack --stack-name WorkshopServices --template-body file://services.yaml \
--capabilities CAPABILITY_NAMED_IAM until [[ `aws cloudformation describe-stacks \
--stack-name "WorkshopServices" --query "Stacks[0].[StackStatus]" \
--output text` == "CREATE_COMPLETE" ]]; do echo "The stack is NOT in a state of CREATE_COMPLETE at `date`"; sleep 30; done &&; echo "The Stack is built at `date` - Please proceed"

2) Launch Fargate:

aws cloudformation create-stack --stack-name WorkshopECS --template-body file://ecs-fargate.yaml \
--capabilities CAPABILITY_NAMED_IAM until [[ `aws cloudformation describe-stacks \ 
--stack-name "WorkshopECS" --query "Stacks[0].[StackStatus]" \ 
--output text` == "CREATE_COMPLETE" ]]; do echo "The stack is NOT in a state of CREATE_COMPLETE at `date`"; sleep 30; done &&; echo "The Stack is built at `date` - Please proceed"

3) From your Cloud9 IDE terminal, change directory to the location of the sample templates and run the following command:

aws cloudformation create-stack --stack-name WorkshopPipeline --template-body file://pipeline.yaml \
--capabilities CAPABILITY_NAMED_IAM until [[ `aws cloudformation describe-stacks \
--stack-name "WorkshopPipeline" --query "Stacks[0].[StackStatus]" \
--output text` == "CREATE_COMPLETE" ]]; do echo "The stack is NOT in a state of CREATE_COMPLETE at `date`"; sleep 30; done &&; echo "The Stack is built at `date` - Please proceed"

Improving your security posture

You need to sign up for a free account with Snyk. You may use your Google, Bitbucket, or Github credentials to sign up. Snyk utilizes these services for authentication and does not store your password. Once signed up, navigate to your name and select Account Settings. Under API Token, choose Show, which will reveal the token to copy, and copy this value. It will be unique for each user.

Save your password to the session manager

Run the following command, replacing abc123 with your unique token. This places the token in the session parameter manager.

aws ssm put-parameter --name "snykAuthToken" --value "abc123" --type SecureString

Set up application scanning

Next, you need to insert testing with Snyk after maven builds the application. The simplest method is to insert commands to download, authorize, and run the Snyk commands after maven has built the application/dependency tree.

The sample Dockerfile contains an environment variable from a value passed to the docker build command, which contains the token for Snyk. By using an environment variable, Snyk automatically detects the token when used.

#~~~~~~~SNYK Variable~~~~~~~~~~~~ 
# Declare Snyktoken as a build-arg ARG snyk_auth_token
# Set the SNYK_TOKEN environment variable ENV
SNYK_TOKEN=${snyk_auth_token}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Download Snyk, and run a test, looking for medium to high severity issues. If the build succeeds, post the results to Snyk for monitoring and reporting. If a new vulnerability is found, you are notified.

# package the application
RUN mvn package -Dmaven.test.skip=true

#~~~~~~~SNYK test~~~~~~~~~~~~
# download, configure and run snyk. Break build if vulns present, post results to `https://snyk.io/`
RUN curl -Lo ./snyk "https://github.com/snyk/snyk/releases/download/v1.210.0/snyk-linux"
RUN chmod -R +x ./snyk
#Auth set through environment variable
RUN ./snyk test --severity-threshold=medium
RUN ./snyk monitor

Set up docker scanning

Later in the build process, a docker image is created. Analyze it for vulnerabilities in buildspec.yml. First, pull the Snyk token snykAuthToken from the parameter store.

env:
  parameter-store:
    SNYK_AUTH_TOKEN: "snykAuthToken"

Next, in the prebuild phase, install Snyk.

phases:
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - aws --version
      - $(aws ecr get-login --region $AWS_DEFAULT_REGION --no-include-email)
      - REPOSITORY_URI=$(aws ecr describe-repositories --repository-name petstore_frontend --query=repositories[0].repositoryUri --output=text)
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=${COMMIT_HASH:=latest}
      - PWD=$(pwd)
      - PWDUTILS=$(pwd)
      - curl -Lo ./snyk "https://github.com/snyk/snyk/releases/download/v1.210.0/snyk-linux"
      - chmod -R +x ./snyk

Next, in the build phase, pass the token to the docker compose command, where it is retrieved in the Dockerfile code you set up to test the application.

build:
    commands:
      - echo Build started on `date`
      - echo Building the Docker image...
      - cd modules/containerize-application
      - docker build --build-arg snyk_auth_token=$SNYK_AUTH_TOKEN -t $REPOSITORY_URI:latest.

You can further extend the build phase to authorize the Snyk instance for testing the Docker image that’s produced. If it passes, you can pass the results to Snyk for monitoring and reporting.

build:
    commands:
      - $PWDUTILS/snyk auth $SNYK_AUTH_TOKEN
      - $PWDUTILS/snyk test --docker $REPOSITORY_URI:latest
      - $PWDUTILS/snyk monitor --docker $REPOSITORY_URI:latest
      - docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG

For reference, a sample buildspec.yaml configured with Snyk is available in the sample repo. You can either copy this file and overwrite your existing buildspec.yaml or open an editor and replace the contents.

Testing the application

Now that services have been provisioned and Snyk tools have been integrated into your CI/CD pipeline, any new git commit triggers a fresh build and application scanning with Snyk detects vulnerabilities in your code.

In the CodeBuild console, you can look at your build history to see why your build failed, identify security vulnerabilities, and pinpoint how to fix them.

Testing /usr/src/app...
✗ Medium severity vulnerability found in org.primefaces:primefaces
Description: Cross-site Scripting (XSS)
Info: https://snyk.io/vuln/SNYK-JAVA-ORGPRIMEFACES-31642
Introduced through: org.primefaces:[email protected]
From: org.primefaces:[email protected]
Remediation:
Upgrade direct dependency org.primefaces:[email protected] to org.primefaces:[email protected] (triggers upgrades to org.primefaces:[email protected])
✗ Medium severity vulnerability found in org.primefaces:primefaces
Description: Cross-site Scripting (XSS)
Info: https://snyk.io/vuln/SNYK-JAVA-ORGPRIMEFACES-31643
Introduced through: org.primefaces:[email protected]
From: org.primefaces:[email protected]
Remediation:
Upgrade direct dependency org.primefaces:[email protected] to org.primefaces:[email protected] (triggers upgrades to org.primefaces:[email protected])
Organisation: sample-integrations
Package manager: maven
Target file: pom.xml
Open source: no
Project path: /usr/src/app
Tested 37 dependencies for known vulnerabilities, found 2 vulnerabilities, 2 vulnerable paths.
The command '/bin/sh -c ./snyk test' returned a non-zero code: 1
[Container] 2020/02/14 03:46:22 Command did not exit successfully docker build --build-arg snyk_auth_token=$SNYK_AUTH_TOKEN -t $REPOSITORY_URI:latest . exit status 1
[Container] 2020/02/14 03:46:22 Phase complete: BUILD Success: false
[Container] 2020/02/14 03:46:22 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: docker build --build-arg snyk_auth_token=$SNYK_AUTH_TOKEN -t $REPOSITORY_URI:latest .. Reason: exit status 1

Remediation

Once you remediate your vulnerabilities and check in your code, another build is triggered and an additional scan is performed by Snyk. This time, you should see the build pass with a status of Succeeded.

You can also drill down into the CodeBuild logs and see that Snyk successfully scanned the Docker Image and found no package dependency issues with your Docker container!

[Container] 2020/02/14 03:54:14 Running command $PWDUTILS/snyk test --docker $REPOSITORY_URI:latest
Testing 300326902600.dkr.ecr.us-west-2.amazonaws.com/petstore_frontend:latest...
Organisation: sample-integrations
Package manager: rpm
Docker image: 300326902600.dkr.ecr.us-west-2.amazonaws.com/petstore_frontend:latest
✓ Tested 190 dependencies for known vulnerabilities, no vulnerable paths found.

Reporting

Snyk provides detailed reports for your imported projects. You can navigate to Projects and choose View Report to set the frequency with which the project is checked for vulnerabilities. You can also choose View Report and then the Dependencies tab to see which libraries were used. Snyk offers a comprehensive database and remediation guidance for known vulnerabilities in their Vulnerability DB. Specifics on potential vulnerabilities that may exist in your code would be contingent on the particular open source dependencies used with your application.

Cleaning up

Remember to delete any resources you may have created in order to avoid additional costs. If you used the AWS CloudFormation templates provided here, you can safely remove them by deleting those stacks from the AWS CloudFormation Console.

Conclusion

In this post, you learned how to leverage various AWS services to build a fully automated CI/CD pipeline and cloud IDE development environment. You also learned how to utilize Snyk to seamlessly integrate with AWS and secure your open-source dependencies and container images. If you are interested in learning more about DevSecOps with Snyk and AWS, then I invite you to check out this workshop and watch this video.

 

About the Author

Author Photo

 

Jay is a Senior Partner Solutions Architect at AWS bringing over 20 years of experience in various technical roles. He holds a Master of Science degree in Computer Information Systems and is a subject matter expert and thought leader for strategic initiatives that help customers embrace a DevOps culture.

 

 

Collect and distribute high-resolution crypto market data with ECS, S3, Athena, Lambda, and AWS Data Exchange

Post Syndicated from Jared Katz original https://aws.amazon.com/blogs/big-data/collect-and-distribute-high-resolution-crypto-market-data-with-ecs-s3-athena-lambda-and-aws-data-exchange/

This is a guest post by Floating Point Group. In their own words, “Floating Point Group is on a mission to bring institutional-grade trading services to the world of cryptocurrency.”

The need and demand for financial infrastructure designed specifically for trading digital assets may not be obvious. There’s a rather pervasive narrative that these coins and tokens are effectively natively digital counterparts to traditional assets such as currencies, commodities, equities, and fixed income. This narrative often manifests in the form of pithy one-liners recycled by pundits attempting to communicate the value proposition of various projects in the space (such as, “Bitcoin is just a currency with an algorithmically controlled, tamper-proof monetary policy,” or, “Ether is just a commodity like gasoline that you can use to pay for computational work on a global computer.”). Unsurprisingly, we at FPG often hear the question, “What’s so special about cryptocurrencies that they warrant dedicated financial services? Why do we need solutions for problems that have already been solved?”

The truth is that these assets and the widespread public interest surrounding them are entirely unprecedented. The decentralized ledger technology that serves as an immutable record of network transactions, the clever use of proof-of-work algorithms to economically incentivize rational actors to help uphold the security of the network (the proof-of-work concept dates back at least as far as 1993, but it was not until bitcoin that the technology showed potential for widespread adoption), the irreversible nature of transactions that poses unique legal challenges in cases such as human error or extortion, the precariousness of self-custody (third-party custody solutions don’t exactly have track records that inspire trust), the regulatory uncertainties that come with the difficulty of both classifying these assets as well as arbitrating their exchange which must ultimately be reconciled by entities like the IRS, SEC, and CFTC—it is all very new, and very weird. With 24-hour market volume regularly exceeding $100 billion, we decided to direct our focus towards problems related specifically to trading these assets. Granted, crypto trading has undoubtedly matured since the days of bartering for bitcoin in web forums and witnessing 10% price spreads between international exchanges. But there is still a long path ahead.

One major pain point we are aiming to address for institutional traders involves liquidity (or, more precisely, the lack thereof). Simply put, the buying and selling of cryptocurrencies occurs across many different trading venues (exchanges), and liquidity (the offers to buy or sell a certain quantity of an asset at a certain price) continues to become more fragmented as new exchanges emerge. So say you’re trying to buy 100 bitcoins. You must buy from people who are willing to sell. As you take the best (cheapest) offers, you’re left with increasingly expensive offers. By the time you fill your order (in this example, buy all 100 bitcoins), you may have paid a much higher average price than, say, the price you paid for the first bitcoin of your order. This phenomenon is referred to as slippage. One easy way to minimize slippage is by expanding your search for offers. So rather than looking at the offers on just one exchange, look at the offers across hundreds of exchanges. This process, traditionally referred to as smart order routing (SOR), is one of the core services we provide. Our SOR service allows traders to easily submit orders that our system can match against the best offers available across multiple trading venues by actively monitoring liquidity across dozens of exchanges.

Fanning out large orders in search of the best prices is a rather intuitive and widely applicable concept—roughly 75% of equities are purchased and sold via SOR. But the value of such a service for crypto markets is particularly salient: a perpetual cycle of new exchanges surging in popularity while incumbents falter has resulted in a seemingly incessant fragmentation of liquidity across trading venues—yet traders tend to assume an exchange-agnostic mindset, concerned exclusively with finding the best price for a given quantity of an asset.

Access to both real-time and historical market data is essential to the functionality of our SOR service. The highest resolution data we could hope to obtain for a given market would include every trade and every change applied to the order book, effectively allowing us to recreate the state of a market at any given point in time. The updates provided through the WebSocket streams are not sufficient for reconstructing order books. We also need to periodically fetch snapshots of the order books and store those, which we can do using an exchange’s REST API. We can fetch a snapshot and apply the corresponding updates from the streams to “replay” the order book.

Fortunately, this data is freely available, because many exchanges offer real-time feeds of market data via WebSocket APIs. We found several third-party vendors selling subscriptions to these data sets, typically in the form of CSV dumps delivered at a weekly or monthly cadence. This presented the question of build vs. buy. Given that we felt capable of building a robust and reliable system for ingesting real-time market data in a relatively short amount of time and at a fraction of the cost of purchasing the data from a vendor, we were already leaning in favor of building. Further investigation made buying look like an increasingly unattractive option. Disclaimers that multiple vendors issued about their inability to guarantee data quality and consistency did not inspire confidence. Inspecting sample data sets revealed that some essential fields provided in the original data streams were missing—fields necessary for achieving our goal of recreating the state of a market at an arbitrary point in time. We also recognized that a weekly or monthly delivery schedule would restrict our ability to explore relatively recent market data.

This post provides a high-level overview of how we ingest and store real-time market data and how we use the AWS Data Exchange API to organize and publish our data sets programmatically. Our system’s functionality extends well beyond data ingestion, normalization, and persistence; we run dedicated services for data validation, caching the most recent trade and order book for every market, computing and storing derivative metrics, and other services that help safeguard data accuracy and minimize the latency of our trading systems.

Data ingestion

The WebSocket streams we connect to for data consumption are often the same APIs responsible for providing real-time updates to an exchange’s trading dashboard.

WebSocket connections transmit data as discrete messages. We can inspect the content of individual messages as they stream into the browser. For example, the following screenshot shows a batch of order book updates.

The updates are expressed as arrays of bids and asks that were either added to the book or removed from it. Client-side code processes each update, resulting in a real-time rendering of the market’s order book. In practice, our data ingestion service (Ingester) does not read a single stream, but rather thousands of different streams, covering various data feeds for all markets across multiple exchanges. All the connections required for such broad coverage and the resulting flood of incoming data raise some obvious concerns about data loss. We’ve taken several measures to mitigate such concerns, including a redundant system design that allows us to spin up an arbitrary number of instances of the Ingester service. Like most of our microservices, Ingester is a Dockerized service run on Amazon ECS and deployed via Terraform.

All these instances consume the same data feeds as each other while a downstream mechanism handles deduplication (this is covered in more detail later in this post). We also set up Amazon CloudWatch alerts to notify us when we detect non-contiguous messages, indicating a gap in the incoming data. The alerts don’t directly mitigate data loss, but they do serve the important function of prompting an investigation.

Ingester builds up separate buffers of incoming messages, split out by data-type/exchange/market. Then, after a fixed time interval, each buffer is flushed into Amazon S3 as a gzipped JSON file. The buffer-flush cycle repeats.

The following screenshot shows a portion of the file content.

This code snippet is a single, pretty-printed JSON record from the file in the screenshot above.

{
   "event_type":"trade",
   "timestamp":1571980320422,
   "ticker_pair":"BTCUSDT",
   "trade_id":194230159,
   "price":"7405.69000000",
   "quantity":"3.20285300",
   "buyer_order_id":730178987,
   "seller_order_id":730178953,
   "trade_timestamp":1571980320417,
   "buyer_market_maker":false,
   "M":true
}

Ingester handles additional functionality, such as applying pre-defined mappings of venue-specific field names to our internal field names. Data normalization is one of many processes necessary to enable our systems to build a holistic understanding of market dynamics.

As with most distributed system designs, our services are written with horizontal scalability as a first-order priority. We took the same approach in designing our data ingestion service, but it has some features that make it a bit different than the archetypical horizontally scalable microservice. The most common motivations for adjusting the number of instances of a given service are load-balancing and throttling throughput. Either your system is experiencing backpressure and a consumer service scales to alleviate that pressure, or the consumer is over-provisioned and you scale down the number of instances for the sake of parsimony. For our data ingestion service, however, our motivation for running multiple instances is to minimize data loss via redundancy. The CPU usage for each instance is independent of instance count, because each instance does identical work.

For example, rather than helping alleviate backpressure by pulling messages from a single queue, each instance of our data ingestion service connects to the same WebSocket streams and performs the same amount of work. Another somewhat unusual and confounding aspect of horizontally scaling our data ingestion service is related to state: we batch records in memory and flush the records to S3 every minute (based on the incoming message’s timestamp, not the system timestamp, because those would be inconsistent). Redundancy is our primary measure for minimizing data loss, but we also need each instance to write the files to S3 in such a way that we don’t end up with duplicate records. Our first thought was that we’d need a mechanism for coordinating activity across the instances, such as maintaining a cache that would allow us to check if a record had already been persisted. But we realized that we could perform this deduplication without any coordination between instances at all. Most of the message streams we consume publish messages with sequence IDs. We can combine the sequence IDs with the incoming message timestamp to achieve our deduplication mechanism: we can deterministically generate the same exact file names containing the exact same data by writing our service code to check that the message added to the batch has the appropriate sequence ID relative to the previous message in the batch and using the timestamp on the incoming message to determine the exact start and end of each batch (we typically get a UNIX timestamp and check when we’ve rolled over to the next clock minute). This allows us to simply rely on a key collision in S3 for deduplication.

AWS suggests a similar solution for a slightly different problem, relating to Amazon Kinesis Data Streams. For more information, see Handling Duplicate Records.

With this scheme, even if records are processed more than one time, the resulting Amazon S3 file has the same name and has the same data. The retries only result in writing the same data to the same file more than one time.

After we store the data, we can perform simple analytics queries on the billions of records we’ve stored in S3 using Amazon Athena, a query service that requires minimal configuration and zero infrastructure overhead. Athena has a concept of partitions (inherited from one of its underlying services, Apache Hive). Partitions are mappings between virtual columns (in our case: pair, year, month, and day) and the S3 directories in which the corresponding data is stored.

S3’s file system is not actually hierarchical. Files are prepended with long key prefixes that are rendered as directories in the AWS console when browsing a bucket’s contents. This has some non-trivial performance consequences when querying or filtering on large data sets.

The following screenshot illustrates a typical directory path.

By pointing Athena directly to a particular subset of data, a well-defined partitioning scheme can drastically reduce query run times and costs. Though the ability the perform ad hoc business analytics queries is primarily a convenience, taking time to choose a sane multi-level partitioning scheme for Athena based on some of our most common access patterns seemed worthwhile. A poorly designed partition structure can result in Athena unnecessarily scanning huge swaths of data and ultimately render the service unusable.

Data publication

Our pipeline for transforming thousands of small gzipped JSON files into clean CSVs and loading them into AWS Data Exchange involves three distinct jobs, each expressed as an AWS Lambda function.

Job 1

Job 1 is initiated shortly after midnight UTC by a cron-scheduled CloudWatch event. As mentioned previously, our data ingestion service’s batching mechanism flushes each batch to S3 at a regular time interval. A timestamp on the incoming message (applied server-side) determines the rollover from one interval to the next, as opposed to the ingestion service’s system timestamp, so in the rare case that a non-trivial amount of time elapses between the consumption of the final message of batch n and the first message of batch n+1, we kick off the first Lambda function 20 minutes after midnight UTC to minimize the likelihood of omitting data pending write.

Job 1 formats values for the date and data source into an Athena query template and outputs the query results as a CSV to a specified prefix path in S3. (Every Athena query produces a .metadata file and a CSV file of the query results, though DDL statements do not output a CSV.) This PUT request to S3 triggers an S3 event notification.

We run a full replica data ingestion system as an additional layer of redundancy. Using the coalesce conditional expression, the Athena query in Job 1 merges data from our primary system with the corresponding data from our replica system, and fills in any gaps while deduplicating redundant records.

We experimented fairly extensively with AWS Glue and PySpark for the ETL-related work performed in Job 1. When we realized that we could merge all the small source files into one, join the primary and replica data sets, and sort the results with a single Athena query, we decided to stick with this seemingly simpler and more elegant approach.

The following code shows one of our Athena query templates.

Job 2

Job 2 is triggered by the S3 event notification from Job 1. Job 2 simply copies the query results CSV file to a different key within the same S3 bucket.

The motivation for this step is twofold. First, we cannot dictate the name of an Athena query results CSV file; it is automatically set to the Athena query ID. Second, when adding an S3 object as an asset to an AWS Data Exchange revision, the asset’s name is automatically set to the S3 object’s key. So to dictate how the CSV file name appears in AWS Data Exchange, we must first rename it, which we accomplish by copying it to a specified S3 key.

Job 3

Job 3 handles all work related to AWS Data Exchange and AWS Marketplace Catalog via their respective APIs. We use boto3, AWS’s Python SDK, to interface with these APIs. The AWS Marketplace Catalog API is necessary for adding data set revisions to products that have already been published. For more information, see Tutorial: Adding New Data Set Revisions to a Published Data Product.

Our code explicitly defines mappings with the following structure:

data source / DataSet / Product

The following code shows how we configure relationships between data sources, data sets, and products.

Our data sources are typically represented by a trading venue and data type combination (such as Binance trades or CoinbasePro order books). Each new file for a given data source is delivered as a single asset within a single new revision for a particular data set.

An S3 trigger kicks off the Lambda function. The trigger is scoped to a specified prefix that maps to a single data set. The function alias feature of AWS Lambda allows us to define the unique S3 triggers for each data set while reusing the same underlying Lambda function. Job 3 carries out the following steps (note that steps 1 through 5 refer to the AWS Data Exchange API while steps 6 and 7 refer to the AWS Marketplace Catalog API):

  1. Submits a request to create a new revision for the corresponding data set via CreateRevision.
  2. Adds the file that was responsible for triggering the Lambda function to the newly created revision via CreateJob using the IMPORT_ASSETS_FROM_S3 job type. To submit this job, we need to supply a few values: the S3 bucket and key values for the file are pulled from the Lambda event message, while the RevisionID argument comes from the response to the CreateRevision call in the previous step.
  3. Kicks off the job with StartJob, sourcing the JobID argument from the response to the CreateJob call in the previous step.
  4. Polls the job’s status via GetJob (using the job ID from the response to the StartJob call in the previous step) to check that our file (the asset) was successfully added to the revision.
  5. Finalizes the revision via UpdateRevision.
  6. Requests a description of the marketplace entity using DescribeEntity, passing in the product ID stored in our hardcoded mappings as the EntityID
  7. Kicks off the entity ChangeSet via StartChangeSet, passing in the entity ID from the previous step, the entity ID from the DescribeEntity response in the previous step as EntityID, the revision ARN parsed from the response to our earlier call to CreateRevision as RevisionArn, and the data set ARN as DataSetArn, which we fetch at the start of the code’s runtime using AWS Data Exchange API’s GetDataSet.

Here’s a thin wrapper class we wrote to carry out the steps detailed above:

from time import sleep
import logging
import json

import boto3

from config import (
    DATA_EXCHANGE_REGION,
    MARKETPLACE_CATALOG_REGION,
    LambdaS3TriggerMappings
)

logger = logging.getLogger()


class CustomDataExchangeClient:
    def __init__(self):
        self._de_client = boto3.client('dataexchange', region_name=DATA_EXCHANGE_REGION)
        self._mc_client = boto3.client('marketplace-catalog', region_name=MARKETPLACE_CATALOG_REGION)
    
    def _get_s3_data_source(self, bucket, prefix):
        return LambdaS3TriggerMappings[(bucket, prefix)]

    # Job State can be one of: WAITING | IN_PROGRESS | ERROR | COMPLETED | CANCELLED | TIMED_OUT
    def _wait_for_de_job_completion(self, job_id):
        while True:
            get_job_resp = self._de_client.get_job(JobId=job_id)
            if get_job_resp['State'] == 'COMPLETED':
                logger.info(f"Job '{job_id}' succeeded:\n\t{get_job_resp}")
                break
            elif get_job_resp['State'] in ('ERROR', 'CANCELLED'):
                raise Exception(f"Job '{job_id}' failed:\n\t{get_job_resp}")
            else:
                sleep(5)
                logger.info(f"Still waiting on job {job_id}...")
        return get_job_resp

    # ChangeSet Status can be one of: PREPARING | APPLYING | SUCCEEDED | CANCELLED | FAILED
    def _wait_for_mc_change_set_completion(self, change_set_id):
        while True:
            describe_change_set_resp = self._mc_client.describe_change_set(
                Catalog='AWSMarketplace',
                ChangeSetId=change_set_id
                )
            if describe_change_set_resp['Status'] == 'SUCCEEDED':
                logger.info(
                    f"ChangeSet '{change_set_id}' succeeded:\n\t{describe_change_set_resp}"
                )
                break
            elif describe_change_set_resp['Status'] in ('FAILED', 'CANCELLED'):
                raise Exception(
                    f"ChangeSet '{change_set_id}' failed:\n\t{describe_change_set_resp}"
                )
            else:
                sleep(1)
                logger.info(f"Still waiting on ChangeSet {change_set_id}...")
        return describe_change_set_resp

    def process_s3_event(self, s3_event):
        source_bucket = s3_event['Records'][0]['s3']['bucket']['name']
        source_key = s3_event['Records'][0]['s3']['object']['key']
        source_prefix = '/'.join(source_key.split('/')[0:-1])
        s3_data_source = self._get_s3_data_source(source_bucket, source_prefix)
        obj_name = source_key.split('/')[-1]
        
        s3_data_source.validate_object_name(obj_name)
        
        for data_set in s3_data_source.lambda_s3_trigger_target_data_sets:
            # Create revision
            create_revision_resp = self._de_client.create_revision(
                DataSetId=data_set.id,
                Comment=obj_name
            )
            logger.debug(create_revision_resp)
            revision_id = create_revision_resp['Id']
            revision_arn = create_revision_resp['Arn']

            # Create job
            create_job_resp = self._de_client.create_job(
                Type='IMPORT_ASSETS_FROM_S3',
                Details={
                    'ImportAssetsFromS3': {
                      'AssetSources': [
                          {
                              'Bucket': source_bucket,
                              'Key': source_key
                          },
                      ],
                      'DataSetId': data_set.id,
                      'RevisionId': revision_id
                    }
                }
            )
            logger.debug(create_job_resp)

            # Start job
            job_id = create_job_resp['Id']
            start_job_resp = self._de_client.start_job(JobId=job_id)
            logger.debug(start_job_resp)

            # Wait for Data Exchange job completion
            get_job_resp = self._wait_for_de_job_completion(job_id)
            logger.debug(get_job_resp)

            # Finalize revision
            update_revision_resp = self._de_client.update_revision(
                DataSetId=data_set.id,
                RevisionId=revision_id,
                Finalized=True
            )
            logger.debug(update_revision_resp)

            # Ensure revision finalization succeeded
            finalized_status = update_revision_resp['Finalized']
            if finalized_status is not True:
                raise Exception(f"Failed to finalize revision:\n{update_revision_resp}")

            # Publish the new revision to each product associated with the data set
            for product in data_set.products:
                # Describe the AWS Marketplace entity corresponding to the Data Exchange product
                describe_entity_resp = self._mc_client.describe_entity(
                    Catalog='AWSMarketplace',
                    EntityId=product.id
                )
                logger.debug(describe_entity_resp)

                entity_type = describe_entity_resp['EntityType']
                entity_id = describe_entity_resp['EntityIdentifier']

                # Isolate the target data set in the DescribeEntity response
                describe_entity_resp_data_sets = json.loads(describe_entity_resp['Details'])['DataSets']
                describe_entity_resp_data_set = list(
                    filter(lambda ds: ds['DataSetArn'] == data_set.arn, describe_entity_resp_data_sets)
                )
                # We should get the data set of interest in describe_entity_resp and only that data set
                assert len(describe_entity_resp_data_set) == 1

                # Start a ChangeSet to add the newly finalized revision to an existing product
                start_change_set_resp = self._mc_client.start_change_set(
                    Catalog='AWSMarketplace',
                    ChangeSet=[
                        {
                            "ChangeType": "AddRevisions",
                            "Entity": {
                                "Identifier": entity_id,
                                "Type": entity_type
                            },
                            "Details": json.dumps({
                                "DataSetArn": data_set.arn,
                                "RevisionArns": [revision_arn]
                            })
                        }
                    ]
                )
                logger.debug(start_change_set_resp)

                # Wait for the ChangeSet workflow to complete
                change_set_id = start_change_set_resp['ChangeSetId']
                describe_change_set_resp = self._wait_for_mc_change_set_completion(change_set_id)
                logger.debug(describe_change_set_resp)

The following screenshot shows the S3 trigger for Job 3.

The following screenshot shows an example of CloudWatch logs for Job 3.

The following screenshot shows a CloudWatch alarm for Job 3.

Finally, we can verify that our revisions were successfully added to their corresponding data sets and products through the AWS console.

AWS Data Exchange allows you to create private offers for your AWS account IDs, providing a convenient means of checking that revisions show up in each product as expected.

Conclusion

This post demonstrated how you can integrate AWS Data Exchange into an existing data pipeline frictionlessly. We’re pleased to have been invited to participate in the AWS Data Exchange private preview, and even more pleased with the service itself, which has proven to be a sophisticated yet natural extension of our system.

I want to offer special thanks to both Kyle Patsen and Rafic Melhem of the AWS Data Exchange team for generously fielding my questions (and patiently enduring my ramblings) for the better part of the past year. I also want to thank Lucas Adams for helping me design the system discussed in this post and, more importantly, for his unwavering vote of confidence.

If you are interested in learning more about FPG, don’t hesitate to contact us.

 

AWS ECS Cluster Auto Scaling is Now Generally Available

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/aws-ecs-cluster-auto-scaling-is-now-generally-available/

Today, we have launched AWS ECS Cluster Auto Scaling. This new capability improves your cluster scaling experience by increasing the speed and reliability of cluster scale-out, giving you control over the amount of spare capacity maintained in your cluster, and automatically managing instance termination on cluster scale-in.

To enable ECS Cluster Auto Scaling, you will need to create a new ECS resource type called a Capacity Provider. A Capacity Provider can be associated with an EC2 Auto Scaling Group (ASG). When you associate an ECS Capacity Provider with an ASG and add the Capacity Provider to an ECS cluster, the cluster can now scale your ASG automatically by using two new features of ECS:

  1. Managed scaling, with an automatically-created scaling policy on your ASG, and a new scaling metric (Capacity Provider Reservation) that the scaling policy uses; and
  2. Managed instance termination protection, which enables container-aware termination of instances in the ASG when scale-in happens.

These new features will give customers greater control of when and how Amazon ECS clusters scale-in and scale-out.

Capacity Provider Reservation
The new metric, called capacity provider reservation, measures the total percentage of cluster resources needed by all ECS workloads in the cluster, including existing workloads, new workloads, and changes in workload size. This metric enables the scaling policy to scale out quicker and more reliably than it could when using CPU or memory reservation metrics. Customers can also use this metric to reserve spare capacity in their clusters. Reserving spare capacity allows customers to run more containers immediately if needed, without waiting for new instances to start.

Managed Instance Termination Protection
With instance termination protection, ECS controls which instances the scaling policy is allowed to terminate on scale-in, to minimize disruptions of running containers. These improvements help customers achieve lower operational costs and higher availability of their container workloads running on ECS.

How This Help Customers
Customers running scalable container workloads on ECS often use metric-based scaling policies to automatically scale their ECS clusters. These scaling policies use generic metrics such as average cluster CPU and memory reservation percentages to determine when the policy should add or remove cluster instances.

Clusters running a single workload, or workloads that scale-out slowly, often work well with such policies. However, customers running multiple workloads in the same cluster, or workloads that scale-out rapidly, are more likely to experience problems with cluster scaling. Ideally, increases in workload size that cannot be accommodated by the current cluster should trigger the policy to scale the cluster out to a larger size.

Because the existing metrics are not container-specific and account only for resources already in use, this may happen slowly or be unreliable. Furthermore, because the scaling policy does not know where containers are running in the cluster, it can unnecessarily terminate containers when scaling in. These issues can reduce the availability of container workloads. Mitigations such as over-provisioning, custom tooling, or manual intervention often impose high operational costs.

Enough Talk, Let’s Scale
To understand these new features more clearly, I think it’s helpful to work through an example.

Amazon ECS Cluster Auto Scaling can be set up and configured using the AWS Management Console, AWS CLI, or Amazon ECS API. I’m going to open up my terminal and create a cluster.

Firstly, I create two files. The first file is called demo-launchconfig.json and defines the instance configuration for the Amazon Elastic Compute Cloud (EC2) instances that will make up my auto scaling group.

{
    "LaunchConfigurationName": "demo-launchconfig",
    "ImageId": "ami-01f07b3fa86406c96",
    "SecurityGroups": [
        "sg-0fa5be8c3749f3aa0"
    ],
    "InstanceType": "t2.micro",
    "BlockDeviceMappings": [
        {
            "DeviceName": "/dev/xvdcz",
            "Ebs": {
                "VolumeSize": 22,
                "VolumeType": "gp2",
                "DeleteOnTermination": true,
                "Encrypted": true
                }
        }
    ],
    "InstanceMonitoring": {
        "Enabled": false
    },
    "IamInstanceProfile": "arn:aws:iam::365489315573:role/ecsInstanceRole",
    "AssociatePublicIpAddress": true
}

The second file is demo-userdata.txt, and it contains the user data that will be added to each EC2 instance. The ECS_CLUSTER name included in the file must be the same as the name of the cluster we are going to create. In my case, the name is demo-news-blog-scale.

#!/bin/bash
echo ECS_CLUSTER=demo-news-blog-scale >> /etc/ecs/ecs.config

Using the create-launch-configuration command, I pass the two files I created as inputs, this will create the launch configuration that I will use in my auto scaling group.

aws autoscaling create-launch-configuration --cli-input-json file://demo-launchconfig.json --user-data file://demo-userdata.txt

Next, I create a file called demo-asgconfig.json and define my requirements.

{
    "LaunchConfigurationName": "demo-launchconfig", 
    "MinSize": 0,
    "MaxSize": 100,
    "DesiredCapacity": 0,
    "DefaultCooldown": 300,
    "AvailabilityZones": [ 
        "ap-southeast-1c" ], 
    "HealthCheckType": "EC2", 
    "HealthCheckGracePeriod": 300, 
    "VPCZoneIdentifier": "subnet-abcd1234", 
    "TerminationPolicies": [ 
        "DEFAULT" 
    ],
    "NewInstancesProtectedFromScaleIn": true, 
    "ServiceLinkedRoleARN": "arn:aws:iam::111122223333:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
} 

I then use the create-auto-scaling-group command to create an auto scaling group called demo-asg using the above file as an input.

aws autoscaling create-auto-scaling-group --auto-scaling-group-name demo-asg --cli-input-json file://demo-asgconfig.json

I am now ready to create a capacity provider. I create a file called demo-capacityprovider.json, importantly, I set the managedTerminationProtection property to ENABLED.

{
    "name": "demo-capacityprovider", "autoScalingGroupProvider": {
    "autoScalingGroupArn": "arn:aws:autoscaling:ap-southeast-1:365489315573:autoScalingGroup:e9c2f0c4-9a4c-428e-b81e-b22411a52954:autoScalingGroupName/demo-ASG",
            "managedScaling": {
                "status": "ENABLED",
                "targetCapacity": 100,
                "minimumScalingStepSize": 1,
                "maximumScalingStepSize": 100
            },
            "managedTerminationProtection": "ENABLED"
    }
}

I then use the new create-capacity-provider command to create a provider using the file as an input.

aws ecs create-capacity-provider --cli-input-json file://demo-capacityprovider.json

Now all the components have been created, I can finally create a cluster. I add the capacity provider and set the default capacity provider for the cluster as demo-capacityprovider.

aws ecs create-cluster --cluster-name demo-news-blog-scale --capacity-providers demo-capacityprovider --default-capacity-provider-strategy<br />capacityProvider=demo-capacityprovider,weight=1

I now need to wait until the cluster has moved into the active state. I use the following command to get details about the cluster.

aws ecs describe-clusters --clusters demo-news-blog-scale --include ATTACHMENTS

Now that my cluster is set up, I can register some tasks. Firstly I will need to create a task definition. Below is a file I. have created called demo-sleep-taskdef.json. All this definition does is define a container that sleeps for infinity.

{
    "family": "demo-sleep-taskdef",
    "containerDefinitions": [
        {
            "name": "sleep",
            "image": "amazonlinux:2",
            "memory": 20,
            "essential": true,
            "command": [
                "sh",
                "-c",
                "sleep infinity"] 
        }],
    "requiresCompatibilities": [
        "EC2"] 
} 

I then register the task definition using the register-task-definition command.

aws ecs register-task-definition --cli-input-json file://demo-sleep-taskdef.json

Finally, I can create my tasks. In this case, I have created 5 tasks based on the demo-sleep-taskdef:1 definition that I just registered.

aws ecs run-task --cluster demo-news-blog-scale --count 5 --task-definition demo-sleep-taskdef:1

Now because instances are not yet available to run the tasks, the tasks go into a provisioning state, which means they are waiting for capacity to become available. The capacity provider I configured will now scale-out the auto scaling group so that instances start up and join the cluster – at which point the tasks get placed on the instances. This gives a true “scale from zero” capability, which did not previously exist.

Things To Know
AWS ECS Cluster Auto Scaling is now available in all regions where Amazon ECS and AWS Auto Scaling are available – check the region table for the latest list.

Happy Scaling!

— Martin

 

Amazon EKS on AWS Fargate Now Generally Available

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/amazon-eks-on-aws-fargate-now-generally-available/

Starting today, you can start using Amazon Elastic Kubernetes Service to run Kubernetes pods on AWS Fargate. EKS and Fargate make it straightforward to run Kubernetes-based applications on AWS by removing the need to provision and manage infrastructure for pods.

With AWS Fargate, customers don’t need to be experts in Kubernetes operations to run a cost-optimized and highly-available cluster. Fargate eliminates the need for customers to create or manage EC2 instances for their Amazon EKS clusters.

Customers no longer have to worry about patching, scaling, or securing a cluster of EC2 instances to run Kubernetes applications in the cloud. Using Fargate, customers define and pay for resources at the pod-level. This makes it easy to right-size resource utilization for each application and allow customers to clearly see the cost of each pod.

I’m now going to use the rest of this blog to explore this new feature further and deploy a simple Kubernetes-based application using Amazon EKS on Fargate.

Let’s Build a Cluster
The simplest way to get a cluster set up is to use eksctl, the official CLI tool for EKS. The command below creates a cluster called demo-newsblog with no worker nodes.

eksctl create cluster --name demo-newsblog --region eu-west-1 --fargate

This single command did quite a lot under the hood. Not only did it create a cluster for me, amongst other things, it also created a Fargate profile.

A Fargate profile, lets me specify which Kubernetes pods I want to run on Fargate, which subnets my pods run in, and provides the IAM execution role used by the Kubernetes agent to download container images to the pod and perform other actions on my behalf.

Understanding Fargate profiles is key to understanding how this feature works. So I am going to delete the Fargate profile that was automatically created for me and recreate it manually.

To create a Fargate profile, I head over to the Amazon Elastic Kubernetes Service console and choose the cluster demo-newsblog. On the details, Under Fargate profiles, I choose Add Fargate profile.

I then need to configure my new Fargate profile. For the name, I enter demo-default.

In the Pod execution role, only IAM roles with the eks-fargate-pods.amazonaws.com service principal are shown. The eksctl tool creates an IAM role called AmazonEKSFargatePodExecutionRole, the documentation shows how this role can be created from scratch.

In the Subnets section, by default, all subnets in my cluster’s VPC are selected. However, only private subnets are supported for Fargate pods, so I deselect the two public subnets.

When I click next, I am taken to the Pod selectors screen. Here it asks me to enter a namespace. I add default, meaning that I want any pods that are created in the default Kubernetes namespace to run on Fargate. It’s important to understand that I don’t have to modify my Kubernetes app to get the pods running on Fargate, I just need a Fargate Profile – if a pod in my Kubernetes app matches the namespace defined in my profile, that pod will run on Fargate.

There is also a Match labels feature here, which I am not using. This allows you to specify the labels of the pods that you want to select, so you can get even more specific with which pods run on this profile.

Finally, I click Next and then Create. It takes a minute for the profile to create and become active.

In this demo, I also want everything to run on Fargate, including the CoreDNS pods that are part of Kubernetes. To get them running on Fargate, I will add a second Fargate profile for everything in the kube-system namespace. This time, to add a bit of variety to the demo, I will use the command line to create my profile.

Technically, I do not need to create a second profile for this. I could have added an additional namespace to the first profile, but this way, I get to explore an alternative way of creating a profile.

First, I create the file below and save it as demo-kube-system-profile.json.

{
    "fargateProfileName": "demo-kube-system",
    "clusterName": "demo-news-blog",
    "podExecutionRoleArn": "arn:aws:iam::xxx:role/AmazonEKSFargatePodExecutionRole",
    "subnets": [
        "subnet-0968a124a4e4b0afe",
        "subnet-0723bbe802a360eb9"
    ],
    "selectors": [
        {
            "namespace": "kube-system"
        }
    ]
}

I then navigate to the folder that contains the file above and run the create-fargate-profile command in my terminal.

aws eks create-fargate-profile --cli-input-json file://demo-kube-system-profile.json

I am now ready to deploy a container to my cluster. To keep things simple, I deploy a single instance of nginx using the following kubectl command.

kubectl create deployment demo-app --image=nginx

I then check to see the state of my pods by running the get pods command.

kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
demo-app-6dbfc49497-67dxk   0/1     Pending   0          13s

If I run get nodes  I have three nodes (two for coreDNS and one for nginx). These nodes represent the compute resources that have instantiated for me to run my pods.

kubectl get nodes
NAME                                                   STATUS   ROLES    AGE     VERSION
fargate-ip-192-168-218-51.eu-west-1.compute.internal   Ready    <none>   4m45s   v1.14.8-eks
fargate-ip-192-168-221-91.eu-west-1.compute.internal   Ready    <none>   2m20s   v1.14.8-eks
fargate-ip-192-168-243-74.eu-west-1.compute.internal   Ready    <none>   4m40s   v1.14.8-eks

After a short time, I rerun the get pods command, and my demo-app now has a status of Running. Meaning my container has been successfully deployed onto Fargate.

kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
demo-app-6dbfc49497-67dxk   1/1     Running   0          3m52s

Pricing and Limitations
With AWS Fargate, you pay only for the amount of vCPU and memory resources that your pod needs to run. This includes the resources the pod requests in addition to a small amount of memory needed to run Kubernetes components alongside the pod. Pods running on Fargate follow the existing pricing model. vCPU and memory resources are calculated from the time your pod’s container images are pulled until the pod terminates, rounded up to the nearest second. A minimum charge for 1 minute applies. Additionally, you pay the standard cost for each EKS cluster you run, $0.20 per hour.

There are currently a few limitations that you should be aware of:

  • There is a maximum of 4 vCPU and 30Gb memory per pod.
  • Currently there is no support for stateful workloads that require persistent volumes or file systems.
  • You cannot run Daemonsets, Privileged pods, or pods that use HostNetwork or HostPort.
  • The only load balancer you can use is an Application Load Balancer.

Get Started Today
If you want to explore Amazon EKS on AWS Fargate yourself, you can try it now by heading on over to the EKS console in the following regions: US East (N. Virginia), US East (Ohio), Europe (Ireland), and Asia Pacific (Tokyo).

— Martin

Sharing automated blueprints for Amazon ECS continuous delivery using AWS Service Catalog

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/sharing-automated-blueprints-for-amazon-ecs-continuous-delivery-using-aws-service-catalog/

This post is contributed by Mahmoud ElZayet | Specialist SA – Dev Tech, AWS

 

Modern application development processes enable organizations to improve speed and quality continually. In this innovative culture, small, autonomous teams own the entire application life cycle. While such nimble, autonomous teams speed product delivery, they can also impose costs on compliance, quality assurance, and code deployment infrastructures.

Standardized tooling and application release code helps share best practices across teams, reduce duplicated code, speed on-boarding, create consistent governance, and prevent resource over-provisioning.

 

Overview

In this post, I show you how to use AWS Service Catalog to provide standardized and automated deployment blueprints. This helps accelerate and improve your product teams’ application release workflows on Amazon ECS. Follow my instructions to create a sample blueprint that your product teams can use to release containerized applications on ECS. You can also apply the blueprint concept to other technologies, such as serverless or Amazon EC2–based deployments.

The sample templates and scripts provided here are for demonstration purposes and should not be used “as-is” in your production environment. After you become familiar with these resources, create customized versions for your production environment, taking account of in-house tools and team skills, as well as all applicable standards and restrictions.

 

Prerequisites

To use this solution, you need the following resources:

 

Sample scenario

Example Corp. has various product teams that develop applications and services on AWS. Example Corp. teams have expressed interest in deploying their containerized applications managed by AWS Fargate on ECS. As part of Example Corp’s central tooling team, you want to enable teams to quickly release their applications on Fargate. However, you also make sure that they comply with all best practices and governance requirements.

For convenience, I also assume that you have supplied product teams working on the same domain, application, or project with a shared AWS account for service deployment. Using this account, they all deploy to the same ECS cluster.

In this scenario, you can author and provide these teams with a shared deployment blueprint on ECS Fargate. Using AWS Service Catalog, you can share the blueprint with teams as follows:

  1. Every time that a product team wants to release a new containerized application on ECS, they retrieve a new AWS Service Catalog ECS blueprint product. This enables them to obtain the required infrastructure, permissions, and tools. As a prerequisite, the ECS blueprint requires building blocks such as a git repository or an AWS CodeBuild project. Again, you can acquire those blocks through another AWS Service Catalog product.
  2. The product team completes the ECS blueprint’s required parameters, such as the desired number of ECS tasks and application name. As an administrator, you can constrain the value of some parameters such as the VPC and the cluster name. For more information, see AWS Service Catalog Template Constraints.
  3. The ECS blueprint product deploys all the required ECS resources, configured according to best practices. You can also use the AWS Cloud Development Kit (CDK) to maintain and provision pre-defined constructs for your infrastructure.
  4. A standardized CI/CD pipeline also generates, enabling your product teams to publish their application to ECS automatically. Ideally, this pipeline should have all stages, practices, security checks, and standards required for application release. Product teams must still author application code, create a Dockerfile, build specifications, run automated tests and deployment scripts, and complete other tasks required for application release.
  5. The ECS blueprint can be continually updated based on organization-wide feedback and to support new use cases. Your product team can always access the latest version through AWS Service Catalog. I recommend retaining multiple, customizable blueprints for various technologies.

 

For simplicity’s sake, my explanation envisions your environment as consisting of one AWS account. In practice, you can use IAM controls to segregate teams’ access to each other’s resources, even when they share an account. However, I recommend having at least two AWS accounts, one for testing and one for production purposes.

To see an example framework that helps deploy your AWS Service Catalog products to multiple accounts, see AWS Deployment Framework (ADF). This framework can also help you create cross-account pipelines that cater to different product teams’ needs, even when these teams deploy to the same technology stack.

To set up shared deployment blueprints for your production teams, follow the steps outlined in the following sections.

 

Set up the environment

In this section, I explain how to create a central ECS cluster in the appropriate VPC where teams can deploy their containers. I provide an AWS CloudFormation template to help you set up these resources. This template also creates an IAM role to be used by AWS Service Catalog later.

To run the CloudFormation template:

1. Use a git client to clone the following GitHub repository to a local directory. This will be the directory where you will run all the subsequent AWS CLI commands.

2. Using the AWS CLI, run the following commands. Replace <Application_Name> with a lowercase string with no spaces representing the application or microservice that your product team plans to release—for example, myapp.

aws cloudformation create-stack --stack-name "fargate-blueprint-prereqs" --template-body file://environment-setup.yaml --capabilities CAPABILITY_NAMED_IAM --parameters ParameterKey=ApplicationName,ParameterValue=<Application_Name>

3. Keep running the following command until the output reads CREATE_COMPLETE:

aws cloudformation describe-stacks --stack-name "fargate-blueprint-prereqs" --query Stacks[0].StackStatus

4. In case of error, use the describe-events CLI command or review error details on the console.

5. When the stack creation reads CREATE_COMPLETE, run the following command, and make a note of the output values in an editor of your choice. You need this information for a later step:

aws cloudformation describe-stacks  --stack-name fargate-blueprint-prereqs --query Stacks[0].Outputs

6. Run the following commands to copy those CloudFormation templates to Amazon S3. Replace <Template_Bucket_Name> with the template bucket output value you just copied into your editor of choice:

aws s3 cp core-build-tools.yml s3://<Template_Bucket_Name>/core-build-tools.yml

aws s3 cp ecs-fargate-deployment-blueprint.yml s3://<Template_Bucket_Name>/ecs-fargate-deployment-blueprint.yml

Create AWS Service Catalog products

In this section, I show you how to create two AWS Service Catalog products for teams to use in publishing their containerized app:

  1. Core Build Tools
  2. ECS Fargate Deployment Blueprint

To create an AWS Service Catalog portfolio that includes these products:

1. Using the AWS CLI, run the following command, replacing <Application_Name>
with the application name you defined earlier and replacing <Template_Bucket_Name>
with the template bucket output value you copied into your editor of choice:

aws cloudformation create-stack --stack-name "fargate-blueprint-catalog-products" --template-body file://catalog-products.yaml --parameters ParameterKey=ApplicationName,ParameterValue=<Application_Name> ParameterKey=TemplateBucketName,ParameterValue=<Template_Bucket_Name>

2. After a few minutes, check the stack creation completion. Run the following command until the output reads CREATE_COMPLETE:

aws cloudformation describe-stacks --stack-name "fargate-blueprint-catalog-products" --query Stacks[0].StackStatus

3. In case of error, use the describe-events CLI command or check error details in the console.

Your AWS Service Catalog configuration should now be ready.

 

Test product teams experience

In this section, I show you how to use IAM roles to impersonate a product team member and simulate their first experience of containerized application deployment.

 

Assume team role

To assume the role that you created during the environment setup step

1.     In the Management console, follow the instructions in Switching a Role.

  • For Account, enter the account ID used in the sample solution. To learn more about how to find an AWS account ID, see Your AWS Account ID and Its Alias.
  • For Role, enter <Application_Name>-product-team-role, where <Application_Name> is the same application name you defined in Environment Setup section.
  • (Optional) For Display name, enter a custom session value.

You are now logged in as a member of the product team.

 

Provision core build product

Next, provision the core build tools for your blueprint:

  1. In the Service Catalog console, you should now see the two products created earlier listed under Products.
  2. Select the first product, Core Build Tools.
  3. Choose LAUNCH PRODUCT.
  4. Name the product something such as <Application_Name>-build-tools, replacing <Application_Name> with the name previously defined for your application.
  5. Provide the same application name you defined previously.
  6. Leave the ContainerBuild parameter default setting as yes, as you are building a container requiring a container repository and its associated permissions.
  7. Choose NEXT three times, then choose LAUNCH.
  8. Under Events, watch the Status property. Keep refreshing until the status reads Succeeded. In case of failure, choose the URL value next to the key CloudformationStackARN. This choice takes you to the CloudFormation console, where you can find more information on the errors.

Now you have the following build tools created along with the required permissions:

  • AWS CodeCommit repository to store your code
  • CodeBuild project to build your container image and test your application code
  • Amazon ECR repository to store your container images
  • Amazon S3 bucket to store your build and release artifacts

 

Provision ECS Fargate deployment blueprint

In the Service Catalog console, follow the same steps to deploy the blueprint for ECS deployment. Here are the product provisioning details:

  • Product Name: <Application_Name>-fargate-blueprint.
  • Provisioned Product Name: <Application_Name>-ecs-fargate-blueprint.
  • For the parameters Subnet1, Subnet2, VpcId, enter the output values you copied earlier into your editor of choice in the Setup Environment section.
  • For other parameters, enter the following:
    • ApplicationName: The same application name you defined previously.
    • ClusterName: Enter the value example-corp-ecs-cluster, which is the name chosen in the template for the central cluster.
  • Leave the DesiredCount and LaunchType parameters to their default values.

After the blueprint product creation completes, you should have an ECS service with a sample task definition for your product team. The build tools created earlier include the permissions required for deploying to the ECS service. Also, a CI/CD pipeline has been created to guide your product teams as they publish their application to the ECS service. Ideally, this pipeline should have all stages, practices, security checks, and standards required for application release.

Product teams still have to author application code, create a Dockerfile, build specifications, run automated tests and deployment scripts, and perform other tasks required for application release. The blueprint product can provide wiki links to reference examples for these steps, or access to pre-provisioned sample pipelines.

 

Test your pipeline

Now, upload a sample app to test your pipeline:

  1. Log in with the product team role.
  2. In the CodeCommit console, select the repository with the application name that you defined in the environment setup section.
  3. Scroll down, choose Add file, Create file.
  4. Paste the following in the page editor, which is a script to build the container image and push it to the ECR repository:
version: 0.2
phases:
  pre_build:
    commands:
      - $(aws ecr get-login --no-include-email)
      - TAG="$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | head -c 8)"
      - IMAGE_URI="${REPOSITORY_URI}:${TAG}"
  build:
    commands:
      - docker build --tag "$IMAGE_URI" .
  post_build:
    commands:
      - docker push "$IMAGE_URI"      
      - printf '[{"name":"%s","imageUri":"%s"}]' "$APPLICATION_NAME" "$IMAGE_URI" > images.json
artifacts:
  files: 
    - images.json
    - '**/*'

5. For File name, enter buildspec.yml.

6. For Author name and Email address, enter your name and your preferred email address for the commit. Although optional, the addition of a commit message is a good practice.

7. Choose Commit changes.

8. Repeat the same steps for the Dockerfile. The sample Dockerfile creates a straightforward PHP application. Typically, you add your application content to that image.

File name: Dockerfile

File content:

FROM ubuntu:12.04

# Install dependencies
RUN apt-get update -y
RUN apt-get install -y git curl apache2 php5 libapache2-mod-php5 php5-mcrypt php5-mysql

# Configure apache
RUN a2enmod rewrite
RUN chown -R www-data:www-data /var/www
ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_LOG_DIR /var/log/apache2

EXPOSE 80

CMD ["/usr/sbin/apache2", "-D",  "FOREGROUND"]

Your pipeline should now be ready to run successfully. Although you can list all current pipelines in the Region, you can only describe and modify pipelines that have a prefix matching your application name. To confirm:

  1. In the AWS CodePipeline console, select the pipeline <Application_Name>-ecs-fargate-pipeline.
  2. The pipeline should now be running.

Because you performed two commits to the repository from the console, you must wait for the second run to complete before successful deployment to ECS Fargate.

 

Clean up

To clean up the environment, run the following commands in the AWS CLI, replacing <Application_Name>
with your application name, <Account_Id> with your AWS Account ID with no hyphens and <Template_Bucket_Name>
with the template bucket output value you copied into your editor of choice:

aws ecr delete-repository --repository-name <Application_Name> --force

aws s3 rm s3://<Application_Name>-artifactbucket-<Account_Id> --recursive

aws s3 rm s3://<Template_Bucket_Name> --recursive

 

To remove the AWS Service Catalog products:

  1. Log in with the Product team role
  2. In the console, follow the instructions at Deleting Provisioned Products.
  3. Delete the AWS Service Catalog products in reverse order, starting with the blueprint product.

Run the following commands to delete the administrative resources:

aws cloudformation delete-stack --stack-name fargate-blueprint-catalog-products

aws cloudformation delete-stack --stack-name fargate-blueprint-prereqs

Conclusion

In this post, I showed you how to design and build ECS Fargate deployment blueprints. I explained how these accelerate and standardize the release of containerized applications on AWS. Your product teams can keep getting the latest standards and coded best practices through those automated blueprints.

As always, AWS welcomes feedback. Please submit comments or questions below.

Wag!: Why Even Your Dog Loves a Canary Deployment

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/wag-why-even-your-dog-loves-a-canary-deployment/

Since August 26 was National Dog Day, we thought it might be a great time to talk about why Wag!,an on-demand dog-walking, boarding, and pet-setting service that’s available in 43 states and 100 cities, deployed blue-green (or canary) architecture for increased availability and reduced risk using Amazon ECS.

Last June, Dave Bullock, Director of Engineering from Wag Labs Inc., talked with AWS Senior Solutions Architect Peter Tilsen about how the company needed to find a faster solution for updates and rollbacks than what the previously solution, AWS OpsWorks, could provide. What used to take up to 10 minutes for a rolling deployment for all of their cluster instances, now is done in a few minutes.

Listen to Dave as he shows us how Wag! runs canary deployments (a technique for releasing applications by shifting traffic between two identical environments running different versions of the application) of containerized applications with ECS.

*Check out more This Is My Architecture video series.

About the author

Annik StahlAnnik Stahl is a Senior Program Manager in AWS, specializing in blog and magazine content as well as customer ratings and satisfaction. Having been the face of Microsoft Office for 10 years as the Crabby Office Lady columnist, she loves getting to know her customers and wants to hear from you.

Optimizing Amazon ECS task density using awsvpc network mode

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/optimizing-amazon-ecs-task-density-using-awsvpc-network-mode/

This post is contributed by Tony Pujals | Senior Developer Advocate, AWS

 

AWS recently increased the number of elastic network interfaces available when you run tasks on Amazon ECS. Use the account setting called awsvpcTrunking. If you use the Amazon EC2 launch type and task networking (awsvpc network mode), you can now run more tasks on an instance—5 to 17 times as many—as you did before.

As more of you embrace microservices architectures, you deploy increasing numbers of smaller tasks. AWS now offers you the option of more efficient packing per instance, potentially resulting in smaller clusters and associated savings.

 

Overview

To manage your own cluster of EC2 instances, use the EC2 launch type. Use task networking to run ECS tasks using the same networking properties as if tasks were distinct EC2 instances.

Task networking offers several benefits. Every task launched with awsvpc network mode has its own attached network interface, a primary private IP address, and an internal DNS hostname. This simplifies container networking and gives you more control over how tasks communicate, both with each other and with other services within their virtual private clouds (VPCs).

Task networking also lets you take advantage of other EC2 networking features like VPC Flow Logs. This feature lets you monitor traffic to and from tasks. It also provides greater security control for containers, allowing you to use security groups and network monitoring tools at a more granular level within tasks. For more information, see Introducing Cloud Native Networking for Amazon ECS Containers.

However, if you run container tasks on EC2 instances with task networking, you can face a networking limit. This might surprise you, particularly when an instance has plenty of free CPU and memory. The limit reflects the number of network interfaces available to support awsvpc network mode per container instance.

 

Raise network interface density limits with trunking

The good news is that AWS raised network interface density limits by implementing a networking feature on ECS called “trunking.” This is a technique for multiplexing data over a shared communication link.

If you’re migrating to microservices using AWS App Mesh, you should optimize network interface density. App Mesh requires awsvpc networking to provide routing control and visibility over an ever-expanding array of running tasks. In this context, increased network interface density might save money.

By opting for network interface trunking, you should see a significant increase in capacity—from 5 to 17 times more than the previous limit. For more information on the new task limits per container instance, see Supported Amazon EC2 Instance Types.

Applications with tasks not hitting CPU or memory limits also benefit from this feature through the more cost-effective “bin packing” of container instances.

 

Trunking is an opt-in feature

AWS chose to make the trunking feature opt-in due to the following factors:

  • Instance registration: While normal instance registration is straightforward with trunking, this feature increases the number of asynchronous instance registration steps that can potentially fail. Any such failures might add extra seconds to launch time.
  • Available IP addresses: The “trunk” belongs to the same subnet in which the instance’s primary network interface originates. This effectively reduces the available IP addresses and potentially the ability to scale out on other EC2 instances sharing the same subnet. The trunk consumes an IP address. With a trunk attached, there are two assigned IP addresses per instance, one for the primary interface and one for the trunk.
  • Differing customer preferences and infrastructure: If you have high CPU or memory workloads, you might not benefit from trunking. Or, you may not want awsvpc networking.

Consequently, AWS leaves it to you to decide if you want to use this feature. AWS might revisit this decision in the future, based on customer feedback. For now, your account roles or users must opt in to the awsvpcTrunking account setting to gain the benefits of increased task density per container instance.

 

Enable trunking

Enable the ECS elastic network interface trunking feature to increase the number of network interfaces that can be attached to supported EC2 container instance types. You must meet the following prerequisites before you can launch a container instance with the increased network interface limits:

  • Your account must have the AWSServiceRoleForECS service-linked role for ECS.
  • You must opt into the awsvpcTrunking  account setting.

 

Make sure that a service-linked role exists for ECS

A service-linked role is a unique type of IAM role linked to an AWS service (such as ECS). This role lets you delegate the permissions necessary to call other AWS services on your behalf. Because ECS is a service that manages resources on your behalf, you need this role to proceed.

In most cases, you won’t have to create a service-linked role. If you created or updated an ECS cluster, ECS likely created the service-linked role for you.

You can confirm that your service-linked role exists using the AWS CLI, as shown in the following code example:

$ aws iam get-role --role-name AWSServiceRoleForECS
{
    "Role": {
        "Path": "/aws-service-role/ecs.amazonaws.com/",
        "RoleName": "AWSServiceRoleForECS",
        "RoleId": "AROAJRUPKI7I2FGUZMJJY",
        "Arn": "arn:aws:iam::226767807331:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS",
        "CreateDate": "2018-11-09T21:27:17Z",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "ecs.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        },
        "Description": "Role to enable Amazon ECS to manage your cluster.",
        "MaxSessionDuration": 3600
    }
}

If the service-linked role does not exist, create it manually with the following command:

aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

For more information, see Using Service-Linked Roles for Amazon ECS.

 

Opt in to the awsvpcTrunking account setting

Your account, IAM user, or role must opt in to the awsvpcTrunking account setting. Select this setting using the AWS CLI or the ECS console. You can opt in for an account by making awsvpcTrunking  its default setting. Or, you can enable this setting for the role associated with the instance profile with which the instance launches. For instructions, see Account Settings.

 

Other considerations

After completing the prerequisites described in the preceding sections, launch a new container instance with increased network interface limits using one of the supported EC2 instance types.

Keep the following in mind:

  • It’s available with the latest variant of the ECS-optimized AMI.
  • It only affects creation of new container instances after opting into awsvpcTrunking.
  • It only affects tasks created with awsvpc network mode and EC2 launch type. Tasks created with the AWS Fargate launch type always have a dedicated network interface, no matter how many you launch.

For details, see ENI Trunking Considerations.

 

Summary

If you seek to optimize the usage of your EC2 container instances for clusters that you manage, enable the increased network interface density feature with awsvpcTrunking. By following the steps outlined in this post, you can launch tasks using significantly fewer EC2 instances. This is especially useful if you embrace a microservices architecture, with its increasing numbers of lighter tasks.

Hopefully, you found this post informative and the proposed solution intriguing. As always, AWS welcomes all feedback or comment.

Using AWS App Mesh with Fargate

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/using-aws-app-mesh-with-fargate/

This post is contributed by Tony Pujals | Senior Developer Advocate, AWS

 

AWS App Mesh is a service mesh, which provides a framework to control and monitor services spanning multiple AWS compute environments. My previous post provided a walkthrough to get you started. In it, I showed deploying a simple microservice application to Amazon ECS and configuring App Mesh to provide traffic control and observability.

In this post, I show more advanced techniques using AWS Fargate as an ECS launch type. I show you how to deploy a specific version of the colorteller service from the previous post. Finally, I move on and explore distributing traffic across other environments, such as Amazon EC2 and Amazon EKS.

I simplified this example for clarity, but in the real world, creating a service mesh that bridges different compute environments becomes useful. Fargate is a compute service for AWS that helps you run containerized tasks using the primitives (the tasks and services) of an ECS application. This lets you work without needing to directly configure and manage EC2 instances.

 

Solution overview

This post assumes that you already have a containerized application running on ECS, but want to shift your workloads to use Fargate.

You deploy a new version of the colorteller service with Fargate, and then begin shifting traffic to it. If all goes well, then you continue to shift more traffic to the new version until it serves 100% of all requests. Use the labels “blue” to represent the original version and “green” to represent the new version. The following diagram shows programmer model of the Color App.

You want to begin shifting traffic over from version 1 (represented by colorteller-blue in the following diagram) over to version 2 (represented by colorteller-green).

In App Mesh, every version of a service is ultimately backed by actual running code somewhere, in this case ECS/Fargate tasks. Each service has its own virtual node representation in the mesh that provides this conduit.

The following diagram shows the App Mesh configuration of the Color App.

 

 

After shifting the traffic, you must physically deploy the application to a compute environment. In this demo, colorteller-blue runs on ECS using the EC2 launch type and colorteller-green runs on ECS using the Fargate launch type. The goal is to test with a portion of traffic going to colorteller-green, ultimately increasing to 100% of traffic going to the new green version.

 

AWS compute model of the Color App.

Prerequisites

Before following along, set up the resources and deploy the Color App as described in the previous walkthrough.

 

Deploy the Fargate app

To get started after you complete your Color App, configure it so that your traffic goes to colorteller-blue for now. The blue color represents version 1 of your colorteller service.

Log into the App Mesh console and navigate to Virtual routers for the mesh. Configure the HTTP route to send 100% of traffic to the colorteller-blue virtual node.

The following screenshot shows routes in the App Mesh console.

Test the service and confirm in AWS X-Ray that the traffic flows through the colorteller-blue as expected with no errors.

The following screenshot shows racing the colorgateway virtual node.

 

Deploy the new colorteller to Fargate

With your original app in place, deploy the send version on Fargate and begin slowly increasing the traffic that it handles rather than the original. The app colorteller-green represents version 2 of the colorteller service. Initially, only send 30% of your traffic to it.

If your monitoring indicates a healthy service, then increase it to 60%, then finally to 100%. In the real world, you might choose more granular increases with automated rollout (and rollback if issues arise), but this demonstration keeps things simple.

You pushed the gateway and colorteller images to ECR (see Deploy Images) in the previous post, and then launched ECS tasks with these images. For this post, launch an ECS task using the Fargate launch type with the same colorteller and envoy images. This sets up the running envoy container as a sidecar for the colorteller container.

You don’t have to manually configure the EC2 instances in a Fargate launch type. Fargate automatically colocates the sidecar on the same physical instance and lifecycle as the primary application container.

To begin deploying the Fargate instance and diverting traffic to it, follow these steps.

 

Step 1: Update the mesh configuration

You can download updated AWS CloudFormation templates located in the repo under walkthroughs/fargate.

This updated mesh configuration adds a new virtual node (colorteller-green-vn). It updates the virtual router (colorteller-vr) for the colorteller virtual service so that it distributes traffic between the blue and green virtual nodes at a 2:1 ratio. That is, the green node receives one-third of the traffic.

$ ./appmesh-colorapp.sh
...
Waiting for changeset to be created..
Waiting for stack create/update to complete
...
Successfully created/updated stack - DEMO-appmesh-colorapp
$

Step 2: Deploy the green task to Fargate

The fargate-colorteller.sh script creates parameterized template definitions before deploying the fargate-colorteller.yaml CloudFormation template. The change to launch a colorteller task as a Fargate task is in fargate-colorteller-task-def.json.

$ ./fargate-colorteller.sh
...

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - DEMO-fargate-colorteller
$

 

Verify the Fargate deployment

The ColorApp endpoint is one of the CloudFormation template’s outputs. You can view it in the stack output in the AWS CloudFormation console, or fetch it with the AWS CLI:

$ colorapp=$(aws cloudformation describe-stacks --stack-name=$ENVIRONMENT_NAME-ecs-colorapp --query="Stacks[0
].Outputs[?OutputKey=='ColorAppEndpoint'].OutputValue" --output=text); echo $colorapp> ].Outputs[?OutputKey=='ColorAppEndpoint'].OutputValue" --output=text); echo $colorapp
http://DEMO-Publi-YGZIJQXL5U7S-471987363.us-west-2.elb.amazonaws.com

Assign the endpoint to the colorapp environment variable so you can use it for a few curl requests:

$ curl $colorapp/color
{"color":"blue", "stats": {"blue":1}}
$

The 2:1 weight of blue to green provides predictable results. Clear the histogram and run it a few times until you get a green result:

$ curl $colorapp/color/clear
cleared

$ for ((n=0;n<200;n++)); do echo "$n: $(curl -s $colorapp/color)"; done

0: {"color":"blue", "stats": {"blue":1}}
1: {"color":"green", "stats": {"blue":0.5,"green":0.5}}
2: {"color":"blue", "stats": {"blue":0.67,"green":0.33}}
3: {"color":"green", "stats": {"blue":0.5,"green":0.5}}
4: {"color":"blue", "stats": {"blue":0.6,"green":0.4}}
5: {"color":"gre
en", "stats": {"blue":0.5,"green":0.5}}
6: {"color":"blue", "stats": {"blue":0.57,"green":0.43}}
7: {"color":"blue", "stats": {"blue":0.63,"green":0.38}}
8: {"color":"green", "stats": {"blue":0.56,"green":0.44}}
...
199: {"color":"blue", "stats": {"blue":0.66,"green":0.34}}

This reflects the expected result for a 2:1 ratio. Check everything on your AWS X-Ray console.

The following screenshot shows the X-Ray console map after the initial testing.

The results look good: 100% success, no errors.

You can now increase the rollout of the new (green) version of your service running on Fargate.

Using AWS CloudFormation to manage your stacks lets you keep your configuration under version control and simplifies the process of deploying resources. AWS CloudFormation also gives you the option to update the virtual route in appmesh-colorapp.yaml and deploy the updated mesh configuration by running appmesh-colorapp.sh.

For this post, use the App Mesh console to make the change. Choose Virtual routers for appmesh-mesh, and edit the colorteller-route. Update the HTTP route so colorteller-blue-vn handles 33.3% of the traffic and colorteller-green-vn now handles 66.7%.

Run your simple verification test again:

$ curl $colorapp/color/clear
cleared
fargate $ for ((n=0;n<200;n++)); do echo "$n: $(curl -s $colorapp/color)"; done
0: {"color":"green", "stats": {"green":1}}
1: {"color":"blue", "stats": {"blue":0.5,"green":0.5}}
2: {"color":"green", "stats": {"blue":0.33,"green":0.67}}
3: {"color":"green", "stats": {"blue":0.25,"green":0.75}}
4: {"color":"green", "stats": {"blue":0.2,"green":0.8}}
5: {"color":"green", "stats": {"blue":0.17,"green":0.83}}
6: {"color":"blue", "stats": {"blue":0.29,"green":0.71}}
7: {"color":"green", "stats": {"blue":0.25,"green":0.75}}
...
199: {"color":"green", "stats": {"blue":0.32,"green":0.68}}
$

If your results look good, double-check your result in the X-Ray console.

Finally, shift 100% of your traffic over to the new colorteller version using the same App Mesh console. This time, modify the mesh configuration template and redeploy it:

appmesh-colorteller.yaml
  ColorTellerRoute:
    Type: AWS::AppMesh::Route
    DependsOn:
      - ColorTellerVirtualRouter
      - ColorTellerGreenVirtualNode
    Properties:
      MeshName: !Ref AppMeshMeshName
      VirtualRouterName: colorteller-vr
      RouteName: colorteller-route
      Spec:
        HttpRoute:
          Action:
            WeightedTargets:
              - VirtualNode: colorteller-green-vn
                Weight: 1
          Match:
            Prefix: "/"
$ ./appmesh-colorapp.sh
...
Waiting for changeset to be created..
Waiting for stack create/update to complete
...
Successfully created/updated stack - DEMO-appmesh-colorapp
$

Again, repeat your verification process in both the CLI and X-Ray to confirm that the new version of your service is running successfully.

 

Conclusion

In this walkthrough, I showed you how to roll out an update from version 1 (blue) of the colorteller service to version 2 (green). I demonstrated that App Mesh supports a mesh spanning ECS services that you ran as EC2 tasks and as Fargate tasks.

In my next walkthrough, I will demonstrate that App Mesh handles even uncontainerized services launched directly on EC2 instances. It provides a uniform and powerful way to control and monitor your distributed microservice applications on AWS.

If you have any questions or feedback, feel free to comment below.

Learning AWS App Mesh

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/learning-aws-app-mesh/

This post is contributed by Geremy Cohen | Solutions Architect, Strategic Accounts, AWS

At re:Invent 2018, AWS announced AWS App Mesh, a service mesh that provides application-level networking. App Mesh makes it easy for your services to communicate with each other across multiple types of compute infrastructure, including:

App Mesh standardizes how your services communicate, giving you end-to-end visibility and ensuring high availability for your applications. Service meshes like App Mesh help you run and monitor HTTP and TCP services at scale.

Using the open source Envoy proxy, App Mesh gives you access to a wide range of tools from AWS partners and the open source community. Because all traffic in and out of each service goes through the Envoy proxy, all traffic can be routed, shaped, measured, and logged. This extra level of indirection lets you build your services in any language desired without having to use a common set of communication libraries.

In this six-part series of the post, I walk you through setup and configuration of App Mesh for popular platforms and use cases, beginning with EKS. Here’s the list of the parts:

  1. Part 1: Introducing service meshes.
  2. Part 2: Prerequisites for running on EKS.
  3. Part 3: Creating example microservices on Amazon EKS.
  4. Part 4: Installing the sidecar injector and CRDs.
  5. Part 5: Configuring existing microservices.
  6. Part 6: Deploying with the canary technique.

Overview

Throughout the post series, I use diagrams to help describe what’s being built. In the following diagram:

  • The circle represents the container in which your app (microservice) code runs.
  • The dome alongside the circle represents the App Mesh (Envoy) proxy running as a sidecar container. When there is no dome present, no service mesh functionality is implemented for the pod.
  • The arrows show communications traffic between the application container and the proxy, as well as between the proxy and other pods.

PART 1: Introducing service meshes

Life without a service mesh

Best practices call for implementing observability, analytics, and routing capabilities across your microservice infrastructure in a consistent manner.

Between any two interacting services, it’s critical to implement logging, tracing, and metrics gathering—not to mention dynamic routing and load balancing—with minimal impact to your actual application code.

Traditionally, to provide these capabilities, you would compile each service with one or more SDKs that provided this logic. This is known as the “in-process design pattern,” because this logic runs in the same process as the service code.

When you only run a small number of services, running multiple SDKs alongside your application code may not be a huge undertaking. If you can find SDKs that provide the required functionality on the platforms and languages on which you are developing, compiling it into your service code is relatively straightforward.

As your application matures, the in-process design pattern becomes increasingly complex:

  • The number of engineers writing code grows, so each engineer must learn the in-process SDKs in use. They must also spend time integrating the SDKs with their own service logic and the service logic of others.
  • In shops where polyglot development is prevalent, as the number of engineers grow, so may the number of coding languages in use. In these scenarios, you’ll need to make sure that your SDKs are supported on these new languages.
  • The platforms that your engineering teams deploy services to may also increase and become disparate. You may have begun with Node.js containers on Kubernetes, but now, new microservices are being deployed with AWS Lambda, EC2, and other managed services. You’ll need to make sure that the SDK solution that you’ve chosen is compatible with these common platforms.
  • If you’re fortunate to have platform and language support for the SDKs you’re using, inconsistencies across the various SDK languages may creep in. This is especially true when you find a gap in language or platform support and implement custom operational logic for a language or platform that is unsupported.
  • Assuming you’ve accommodated for all the previous caveats, by using SDKs compiled into your service logic, you’re tightly coupling your business logic with your operations logic.

 

Enter the service mesh

Considering the increasing complexity as your application matures, the true value of service meshes becomes clear. With a service mesh, you can decouple your microservices’ observability, analytics, and routing logic from the underlying infrastructure and application layers.

The following diagram combines the previous two. Instead of incorporating these features at the code level (in-process), an out-of-process “sidecar proxy” container (represented by the pink dome) runs alongside your application code’s container in each pod.

 

In this model, consistent and decoupled analytics, logging, tracing, and routing logic capabilities are running alongside each microservice in your infrastructure as a sidecar proxy. Each sidecar proxy is configured by a unique configuration ruleset, based on the services it’s responsible for proxying. With 100% of the communications between pods and services proxied, 100% of the traffic is now observable and actionable.

 

App Mesh as the service mesh

App Mesh implements this sidecar proxy via the production-proven Envoy proxy. Envoy is arguably the most popular open-source service proxy. Created at Lyft in 2016, Envoy is a stable OSS project with wide community support. It’s defined as a “Graduated Project” by the Cloud Native Computing Foundation (CNCF). Envoy is a popular proxy solution due to its lightweight C++-based design, scalable architecture, and successful deployment record.

In the following diagram, a sidecar runs alongside each container in your application to provide its proxying logic, syncing each of their unique configurations from the App Mesh control plane.

Each one of these proxies must have its own unique configuration ruleset pushed to it to operate correctly. To achieve this, DevOps teams can push their intended ruleset configuration to the App Mesh API. From there, the App Mesh control plane reliably keeps all proxy instances up-to-date with their desired configurations. App Mesh dynamically scales to hundreds of thousands of pods, tasks, EC2 instances, and Lambda functions, adjusting configuration changes accordingly as instances scale up, down, and restart.

 

App Mesh components

App Mesh is made up of the following components:

  • Service mesh: A logical boundary for network traffic between the services that reside within it.
  • Virtual nodes: A logical pointer to a Kubernetes service, or an App Mesh virtual service.
  • Virtual routers: Handles traffic for one or more virtual services within your mesh.
  • Routes: Associated with a virtual router, it directs traffic that matches a service name prefix to one or more virtual nodes.
  • Virtual services: An abstraction of a real service that is either provided by a virtual node directly, or indirectly by means of a virtual router.
  • App Mesh sidecar: The App Mesh sidecar container configures your pods to use the App Mesh service mesh traffic rules set up for your virtual routers and virtual nodes.
  • App Mesh injector: Makes it easy to auto-inject the App Mesh sidecars into your pods.
  • App Mesh custom resource definitions: (CRD) Provided to implement App Mesh CRUD and configuration operations directly from the kubectl CLI.  Alternatively, you may use the latest version of the AWS CLI.

 

In the following parts, I walk you through the setup and configuration of each of these components.

 

Conclusion of Part 1

In this first part, I discussed in detail the advantages that service meshes provide, and the specific components that make up the App Mesh service mesh. I hope the information provided helps you to understand the benefit of all services meshes, regardless of vendor.

If you’re intrigued by what you’ve learned so far, don’t stop now!

For even more background on the components of AWS App Mesh, check out the official AWS App Mesh documentation, and when you’re ready, check out part 2 in this post where I guide you through completing the prerequisite steps to run App Mesh in your own environment.

 

 

PART 2: Setting up AWS App Mesh on Amazon EKS

 

In part 1 of this series, I discussed the functionality of service meshes like AWS App Mesh provided on Kubernetes and other services. In this post, I walk you through completing the prerequisites required to install and run App Mesh in your own Amazon EKS-based Kubernetes environment.

When you have the environment set up, be sure to leave it intact if you plan on experimenting in the future with App Mesh on your own (or throughout this series of posts).

 

Prerequisites

To run App Mesh, your environment must meet the following requirements.

  • An AWS account
  • The AWS CLI installed and configured
    • The minimal version supported is 1.16.133. You should have a Region set via the aws configure command. For this tutorial, it should work against all Regions where App Mesh and Amazon EKS are supported. Use us-west-2 if you don’t have a preference or are in doubt:
      aws configure set region us-west-2
  • The jq utility
    • The utility is required by scripts executed in this series. Make sure that you have it installed on the machine from which to run the tutorial steps.
  • Kubernetes and kubectl
    • The minimal Kubernetes and kubectl versions supported are 1.11. You need a Kubernetes cluster deployed on Amazon Elastic Compute Cloud (Amazon EC2) or on an Amazon EKS cluster. Although the steps in this tutorial demonstrate using App Mesh on Amazon EKS, the instructions also work on upstream k8s running on Amazon EC2.

Amazon EKS makes it easy to run Kubernetes on AWS. Start by creating an EKS cluster using eksctl.  For more information about how to use eksctl to spin up an EKS cluster for this exercise, see eksworkshop.com. That site has a great tutorial for getting up and running quickly with an account, as well as an EKS cluster.

 

Clone the tutorial repository

Clone the tutorial’s repository by issuing the following command in a directory of your choice:

git clone https://github.com/aws/aws-app-mesh-examples

Next, navigate to the repo’s /djapp examples directory:

cd aws-app-mesh-examples/examples/apps/djapp/

All the steps in this tutorial are executed out of this directory.

 

IAM permissions for the user and k8s worker nodes

Both k8s worker nodes and any principals (including yourself) running App Mesh AWS CLI commands must have the proper permissions to access the App Mesh service, as shown in the following code example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "appmesh:DescribeMesh",
                "appmesh:DescribeVirtualNode",
                "appmesh:DescribeVirtualService",
                "appmesh:DescribeVirtualRouter",
                "appmesh:DescribeRoute",
                "appmesh:CreateMesh",
                "appmesh:CreateVirtualNode",
                "appmesh:CreateVirtualService",
                "appmesh:CreateVirtualRouter",
                "appmesh:CreateRoute",
                "appmesh:UpdateMesh",
                "appmesh:UpdateVirtualNode",
                "appmesh:UpdateVirtualService",
                "appmesh:UpdateVirtualRouter",
                "appmesh:UpdateRoute",
                "appmesh:ListMeshes",
                "appmesh:ListVirtualNodes",
                "appmesh:ListVirtualServices",
                "appmesh:ListVirtualRouters",
                "appmesh:ListRoutes",
                "appmesh:DeleteMesh",
                "appmesh:DeleteVirtualNode",
                "appmesh:DeleteVirtualService",
                "appmesh:DeleteVirtualRouter",
                "appmesh:DeleteRoute"
            ],
            "Resource": "*"
        }
    ]
}

To provide users with the correct permissions, add the previous policy to the user’s role or group, or create it as an inline policy.

To verify as a user that you have the correct permissions set for App Mesh, issue the following command:

aws appmesh list-meshes

If you have the proper permissions and haven’t yet created a mesh, you should get back an empty response like the following. If you did have a mesh created, you get a slightly more verbose response.

{
"meshes": []
}

If you do not have the proper permissions, you’ll see a response similar to the following:

An error occurred (AccessDeniedException) when calling the ListMeshes operation: User: arn:aws:iam::123abc:user/foo is not authorized to perform: appmesh:ListMeshes on resource: *

As a user, these permissions (or even the Administrator Access role) enable you to complete this tutorial, but it’s critical to implement least-privileged access for production or internet-facing deployments.

 

Adding the permissions for EKS worker nodes

If you’re using an Amazon EKS-based cluster to follow this tutorial (suggested), you can easily add the previous permissions to your k8s worker nodes with the following steps.

First, get the role under which your k8s workers are running:

INSTANCE_PROFILE_NAME=$(aws iam list-instance-profiles | jq -r '.InstanceProfiles[].InstanceProfileName' | grep nodegroup)
ROLE_NAME=$(aws iam get-instance-profile --instance-profile-name $INSTANCE_PROFILE_NAME | jq -r '.InstanceProfile.Roles[] | .RoleName')
echo $ROLE_NAME

Upon running that command, the $ROLE_NAME environment variable should be output similar to:

eksctl-blog-nodegroup-ng-1234-NodeInstanceRole-abc123

Copy and paste the following code to add the permissions as an inline policy to your worker node instances:

cat << EoF > k8s-appmesh-worker-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "appmesh:DescribeMesh",
        "appmesh:DescribeVirtualNode",
        "appmesh:DescribeVirtualService",
        "appmesh:DescribeVirtualRouter",
        "appmesh:DescribeRoute",
        "appmesh:CreateMesh",
        "appmesh:CreateVirtualNode",
        "appmesh:CreateVirtualService",
        "appmesh:CreateVirtualRouter",
        "appmesh:CreateRoute",
        "appmesh:UpdateMesh",
        "appmesh:UpdateVirtualNode",
        "appmesh:UpdateVirtualService",
        "appmesh:UpdateVirtualRouter",
        "appmesh:UpdateRoute",
        "appmesh:ListMeshes",
        "appmesh:ListVirtualNodes",
        "appmesh:ListVirtualServices",
        "appmesh:ListVirtualRouters",
        "appmesh:ListRoutes",
        "appmesh:DeleteMesh",
        "appmesh:DeleteVirtualNode",
        "appmesh:DeleteVirtualService",
        "appmesh:DeleteVirtualRouter",
        "appmesh:DeleteRoute"
  ],
      "Resource": "*"
    }
  ]
}
EoF

aws iam put-role-policy --role-name $ROLE_NAME --policy-name AppMesh-Policy-For-Worker --policy-document file://k8s-appmesh-worker-policy.json

To verify that the policy was attached to the role, run the following command:

aws iam get-role-policy --role-name $ROLE_NAME --policy-name AppMesh-Policy-For-Worker

To test that your worker nodes are able to use these permissions correctly, run the following job from the project’s directory.

NOTE: The following YAML is configured for the us-west-2 Region. If you are running your cluster and App Mesh out of a different Region, modify the –region value found in the command attribute (not in the image attribute) in the YAML before proceeding, as shown below:

command: ["aws","appmesh","list-meshes","—region","us-west-2"]

Execute the job by running the following command:

kubectl apply -f awscli.yaml

Make sure that the job is completed by issuing the command:

kubectl get jobs

You should see that the desired and successful values are both one:

NAME     DESIRED   SUCCESSFUL   AGE
awscli   1         1            1m

Inspect the output of the job:

kubectl logs jobs/awscli

Similar to the list-meshes call, the output of this command shows whether your nodes can make App Mesh API calls successfully.

This output shows that the workers have proper access:

{
"meshes": []
}

While this output shows that they don’t:

An error occurred (AccessDeniedException) when calling the ListMeshes operation: User: arn:aws:iam::123abc:user/foo is not authorized to perform: appmesh:ListMeshes on resource: *

If you have to troubleshoot further, you must first delete the job before you run it again to test it:

kubectl delete jobs/awscli

After you’ve verified that you have the proper permissions set, you are ready to move forward and understand more about the demo application you’re going to build on top of App Mesh.

 

Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:

./cleanup.sh

This script does not delete any nodes in your k8s cluster. It only deletes the DJ App and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.

 

Conclusion of Part 2

In this second part of the series, I walked you through the prerequisites required to install and run App Mesh in an Amazon EKS-based Kubernetes environment. In part 3 , I show you how to create a simple microservice that can be implemented on an App Mesh service mesh.

 

 

PART 3: Creating example microservices on Amazon EKS

 

In part 2 of this series, I walked you through completing the setup steps needed to configure your environment to run AWS App Mesh. In this post, I walk you through creating three Amazon EKS-based microservices. These microservices work together to form an app called DJ App, which you use later to demonstrate App Mesh functionality.

 

Prerequisites

Make sure that you’ve completed parts 1 and 2 of this series before running through the steps in this post.

 

Overview of DJ App

I’ll now walk you through creating an example app on App Mesh called DJ App, which is used for a cloud-based music service. This application is composed of the following three microservices:

  • dj
  • metal-v1
  • jazz-v1

The dj service makes requests to either the jazz or metal backends for artist lists. If the dj service requests from the jazz backend, then musical artists such as Miles Davis or Astrud Gilberto are returned. Requests made to the metal backend return artists such as Judas Priest or Megadeth.

Today, the dj service is hardwired to make requests to the metal-v1 service for metal requests and to the jazz-v1 service for jazz requests. Each time there is a new metal or jazz release, a new version of dj must also be rolled out to point to its new upstream endpoints. Although it works for now, it’s not an optimal configuration to maintain for the long term.

App Mesh can be used to simplify this architecture. By virtualizing the metal and jazz service via kubectl or the AWS CLI, routing changes can be made dynamically to the endpoints and versions of your choosing. That minimizes the need for the complete re-deployment of DJ App each time there is a new metal or jazz service release.

 

Create the initial architecture

To begin, I’ll walk you through creating the initial application architecture. As the following diagram depicts, in the initial architecture, there are three k8s services:

  • The dj service, which serves as the DJ App entrypoint
  • The metal-v1 service backend
  • The jazz-v1 service backend

As depicted by the arrows, the dj service will make requests to either the metal-v1, or jazz-v1 backends.

First, deploy the k8s components that make up this initial architecture. To keep things organized, create a namespace for the app called prod, and deploy all of the DJ App components into that namespace. To create the prod namespace, issue the following command:

kubectl apply -f 1_create_the_initial_architecture/1_prod_ns.yaml

The output should be similar to the following:

namespace/prod created

Now that you’ve created the prod namespace, deploy the DJ App (the dj, metal, and jazz microservices) into it. Create the DJ App deployment in the prod namespace by issuing the following command:

kubectl apply -nprod -f 1_create_the_initial_architecture/1_initial_architecture_deployment.yaml

The output should be similar to:

deployment.apps "dj" created
deployment.apps "metal-v1" created
deployment.apps "jazz-v1" created

Create the services that front these deployments by issuing the following command:

kubectl apply -nprod -f 1_create_the_initial_architecture/1_initial_architecture_services.yaml

The output should be similar to:

service "dj" created
service "metal-v1" created
service "jazz-v1" created

Now, verify that everything has been set up correctly by getting all resources from the prod namespace. Issue this command:

kubectl get all -nprod

The output should display the dj, jazz, and metal pods, and the services, deployments, and replica sets, similar to the following:

NAME                            READY   STATUS    RESTARTS   AGE
pod/dj-5b445fbdf4-qf8sv         1/1     Running   0          1m
pod/jazz-v1-644856f4b4-mshnr    1/1     Running   0          1m
pod/metal-v1-84bffcc887-97qzw   1/1     Running   0          1m

NAME               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/dj         ClusterIP   10.100.247.180   <none>        9080/TCP   15s
service/jazz-v1    ClusterIP   10.100.157.174   <none>        9080/TCP   15s
service/metal-v1   ClusterIP   10.100.187.186   <none>        9080/TCP   15s

NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/dj         1         1         1            1           1m
deployment.apps/jazz-v1    1         1         1            1           1m
deployment.apps/metal-v1   1         1         1            1           1m

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/dj-5b445fbdf4         1         1         1       1m
replicaset.apps/jazz-v1-644856f4b4    1         1         1       1m
replicaset.apps/metal-v1-84bffcc887   1         1         1       1m

When you’ve verified that all resources have been created correctly in the prod namespace, test out this initial version of DJ App. To do that, exec into the DJ pod, and issue a curl request out to the jazz-v1 and metal-v1 backends. Get the name of the DJ pod by listing all the pods with the dj app selector:

kubectl get pods -nprod -l app=dj

The output should be similar to:

NAME                  READY     STATUS    RESTARTS   AGE
dj-5b445fbdf4-8xkwp   1/1       Running   0          32s

Next, exec into the DJ pod:

kubectl exec -nprod -it <your-dj-pod-name> bash

The output should be similar to:

[email protected]:/usr/src/app#

Now that you have a root prompt into the DJ pod, issue a curl request to the jazz-v1 backend service:

curl jazz-v1.prod.svc.cluster.local:9080;echo

The output should be similar to:

["Astrud Gilberto","Miles Davis"]

Try it again, but this time issue the command to the metal-v1.prod.svc.cluster.local backend on port 9080:

curl metal-v1.prod.svc.cluster.local:9080;echo

You should get a list of heavy metal bands:

["Megadeth","Judas Priest"]

When you’re done exploring this vast world of music, press CTRL-D, or type exit to exit the container’s shell:

[email protected]:/usr/src/app# exit
command terminated with exit code 1
$

Congratulations on deploying the initial DJ App architecture!

 

Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:

./cleanup.sh

This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.

 

Conclusion of Part 3

In this third part of the series, I demonstrated how to create three simple Kubernetes-based microservices, which working together, form an app called DJ App. This app is later used to demonstrate App Mesh functionality.

In part 4, I show you how to install the App Mesh sidecar injector and CRDs, which make defining and configuring App Mesh components easy.

 

 

PART 4: Installing the sidecar injector and CRDs

 

In part 3 of this series, I walked you through setting up a basic microservices-based application called DJ App on Kubernetes with Amazon EKS. In this post, I demonstrate how to set up and configure the AWS App Mesh sidecar injector and custom resource definitions (CRDs).  As you will see later, the sidecar injector and CRD components make defining and configuring DJ App’s service mesh more convenient.

 

Prerequisites

Make sure that you’ve completed parts 1–3 of this series before running through the steps in this post.

 

Installing the App Mesh sidecar

As decoupled logic, an App Mesh sidecar container must run alongside each pod in the DJ App deployment. This can be set up in few different ways:

  1. Before installing the deployment, you could modify the DJ App deployment’s container specs to include App Mesh sidecar containers. When the app is deployed, it would run the sidecar.
  2. After installing the deployment, you could patch the deployment to include the sidecar container specs. Upon applying this patch, the old pods are torn down, and the new pods come up with the sidecar.
  3. You can implement the App Mesh injector controller, which watches for new pods to be created and automatically adds the sidecar data to the pods as they are deployed.

For this tutorial, I walk you through the App Mesh injector controller option, as it enables subsequent pod deployments to automatically come up with the App Mesh sidecar. This is not only quicker in the long run, but it also reduces the chances of typos that manual editing may introduce.

 

Creating the injector controller

To create the injector controller, run a script that creates a namespace, generates certificates, and then installs the injector deployment.

From the base repository directory, change to the injector directory:

cd 2_create_injector

Next, run the create.sh script:

./create.sh

The output should look similar to the following:

namespace/appmesh-inject created
creating certs in tmpdir /var/folders/02/qfw6pbm501xbw4scnk20w80h0_xvht/T/tmp.LFO95khQ
Generating RSA private key, 2048 bit long modulus
.........+++
..............................+++
e is 65537 (0x10001)
certificatesigningrequest.certificates.k8s.io/aws-app-mesh-inject.appmesh-inject created
NAME                                 AGE   REQUESTOR          CONDITION
aws-app-mesh-inject.appmesh-inject   0s    kubernetes-admin   Pending
certificatesigningrequest.certificates.k8s.io/aws-app-mesh-inject.appmesh-inject approved
secret/aws-app-mesh-inject created

processing templates
Created injector manifest at:/2_create_injector/inject.yaml

serviceaccount/aws-app-mesh-inject-sa created
clusterrole.rbac.authorization.k8s.io/aws-app-mesh-inject-cr unchanged
clusterrolebinding.rbac.authorization.k8s.io/aws-app-mesh-inject-binding configured
service/aws-app-mesh-inject created
deployment.apps/aws-app-mesh-inject created
mutatingwebhookconfiguration.admissionregistration.k8s.io/aws-app-mesh-inject unchanged

Waiting for pods to come up...

App Inject Pods and Services After Install:

NAME                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
aws-app-mesh-inject   ClusterIP   10.100.165.254   <none>        443/TCP   16s
NAME                                   READY   STATUS    RESTARTS   AGE
aws-app-mesh-inject-5d84d8c96f-gc6bl   1/1     Running   0          16s

If you’re seeing this output, the injector controller has been installed correctly. By default, the injector doesn’t act on any pods—you must give it the criteria on what to act on. For the purpose of this tutorial, you’ll next configure it to inject the App Mesh sidecar into any new pods created in the prod namespace.

Return to the repo’s base directory:

cd ..

Run the following command to label the prod namespace:

kubectl label namespace prod appmesh.k8s.aws/sidecarInjectorWebhook=enabled

The output should be similar to the following:

namespace/prod labeled

Next, verify that the injector controller is running:

kubectl get pods -nappmesh-inject

You should see output similar to the following:

NAME                                   READY   STATUS    RESTARTS   AGE
aws-app-mesh-inject-78c59cc699-9jrb4   1/1     Running   0          1h

With the injector portion of the setup complete, I’ll now show you how to create the App Mesh components.

 

Choosing a way to create the App Mesh components

There are two ways to create the components of the App Mesh service mesh:

For this tutorial, I show you how to use kubectl to define the App Mesh components.  To do this, add the CRDs and the App Mesh controller logic that syncs your Kubernetes cluster’s CRD state with the AWS Cloud App Mesh control plane.

 

Adding the CRDs and App Mesh controller

To add the CRDs, run the following commands from the repository base directory:

kubectl apply -f 3_add_crds/mesh-definition.yaml
kubectl apply -f 3_add_crds/virtual-node-definition.yaml
kubectl apply -f 3_add_crds/virtual-service-definition.yaml

The output should be similar to the following:

customresourcedefinition.apiextensions.k8s.io/meshes.appmesh.k8s.aws created
customresourcedefinition.apiextensions.k8s.io/virtualnodes.appmesh.k8s.aws created
customresourcedefinition.apiextensions.k8s.io/virtualservices.appmesh.k8s.aws created

Next, add the controller by executing the following command:

kubectl apply -f 3_add_crds/controller-deployment.yaml

The output should be similar to the following:

namespace/appmesh-system created
deployment.apps/app-mesh-controller created
serviceaccount/app-mesh-sa created
clusterrole.rbac.authorization.k8s.io/app-mesh-controller created
clusterrolebinding.rbac.authorization.k8s.io/app-mesh-controller-binding created

Run the following command to verify that the App Mesh controller is running:

kubectl get pods -nappmesh-system

You should see output similar to the following:

NAME                                   READY   STATUS    RESTARTS   AGE
app-mesh-controller-85f9d4b48f-j9vz4   1/1     Running   0          7m

NOTE: The CRD and injector are AWS-supported open source projects. If you plan to deploy the CRD or injector for production projects, always build them from the latest AWS GitHub repos and deploy them from your own container registry. That way, you stay up-to-date on the latest features and bug fixes.

 

Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:

./cleanup.sh

This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.

 

Conclusion of Part 4

In this fourth part of the series, I walked you through setting up the App Mesh sidecar injector and CRD components. In part 5, I show you how to define the App Mesh components required to run DJ App on a service mesh.

 

 

PART 5: Configuring existing microservices

 

In part 4 of this series, I demonstrated how to set up the AWS App Mesh Sidecar Injector and CRDs. In this post, I’ll show how to configure the DJ App microservices to run on top of App Mesh by creating the required App Mesh components.

 

Prerequisites

Make sure that you’ve completed parts 1–4 of this series before running through the steps in this post.

 

DJ App revisited

As shown in the following diagram, the dj service is hardwired to make requests to either the metal-v1 or jazz-v1 backends.

The service mesh-enabled version functionally does exactly what the current version does. The only difference is that you use App Mesh to create two new virtual services called metal and jazz. The dj service now makes a request to these metal or jazz virtual services, which route to their metal-v1 and jazz-v1 counterparts accordingly, based on the virtual services’ routing rules. The following diagram depicts this process.

By virtualizing the metal and jazz services, you can dynamically configure routing rules to the versioned backends of your choosing. That eliminates the need to re-deploy the entire DJ App each time there’s a new metal or jazz service version release.

 

Now that you have a better idea of what you’re building, I’ll show you how to create the mesh.

 

Creating the mesh

The mesh component, which serves as the App Mesh foundation, must be created first. Call the mesh dj-app, and define it in the prod namespace by executing the following command from the repository’s base directory:

kubectl create -f 4_create_initial_mesh_components/mesh.yaml

You should see output similar to the following:

mesh.appmesh.k8s.aws/dj-app created

Because an App Mesh mesh is a custom resource, kubectl can be used to view it using the get command. Run the following command:

kubectl get meshes -nprod

This yields the following:

NAME     AGE
dj-app   1h

As is the case for any of the custom resources you interact with in this tutorial, you can also view App Mesh resources using the AWS CLI:

aws appmesh list-meshes

{
    "meshes": [
        {
            "meshName": "dj-app",
            "arn": "arn:aws:appmesh:us-west-2:123586676:mesh/dj-app"
        }
    ]
}

aws appmesh describe-mesh --mesh-name dj-app

{
    "mesh": {
        "status": {
            "status": "ACTIVE"
        },
        "meshName": "dj-app",
        "metadata": {
            "version": 1,
            "lastUpdatedAt": 1553233281.819,
            "createdAt": 1553233281.819,
            "arn": "arn:aws:appmesh:us-west-2:123586676:mesh/dj-app",
            "uid": "10d86ae0-ece7-4b1d-bc2d-08064d9b55e1"
        }
    }
}

NOTE: If you do not see dj-app returned from the previous list-meshes command, then your user account (as well as your worker nodes) may not have the correct IAM permissions to access App Mesh resources. Verify that you and your worker nodes have the correct permissions per part 2 of this series.

 

Creating the virtual nodes and virtual services

With the foundational mesh component created, continue onward to define the App Mesh virtual node and virtual service components. All physical Kubernetes services that interact with each other in App Mesh must first be defined as virtual node objects.

Abstracting out services as virtual nodes helps App Mesh build rulesets around inter-service communication. In addition, as you define virtual service objects, virtual nodes may be referenced as inputs and target endpoints for those virtual services. Because of this, it makes sense to define the virtual nodes first.

Based on the first App Mesh-enabled architecture, the physical service dj makes requests to two new virtual services—metal and jazz. These services route requests respectively to the physical services metal-v1 and jazz-v1, as shown in the following diagram.

Because there are three physical services involved in this configuration, you’ll need to define three virtual nodes. To do that, enter the following:

kubectl create -nprod -f 4_create_initial_mesh_components/nodes_representing_physical_services.yaml

The output should be similar to:

virtualnode.appmesh.k8s.aws/dj created
virtualnode.appmesh.k8s.aws/jazz-v1 created
virtualnode.appmesh.k8s.aws/metal-v1 created

If you open up the YAML in your favorite editor, you may notice a few things about these virtual nodes.

They’re both similar, but for the purposes of this tutorial, examine just the metal-v1.prod.svc.cluster.local VirtualNode:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualNode
metadata:
  name: metal-v1
  namespace: prod
spec:
  meshName: dj-app
  listeners:
    - portMapping:
        port: 9080
        protocol: http
  serviceDiscovery:
    dns:
      hostName: metal-v1.prod.svc.cluster.local

...

According to this YAML, this virtual node points to a service (spec.serviceDiscovery.dns.hostName: metal-v1.prod.svc.cluster.local) that listens on a given port for requests (spec.listeners.portMapping.port: 9080).

You may notice that jazz-v1 and metal-v1 are similar to the dj virtual node, with one key difference; the dj virtual node contains a backend attribute:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualNode
metadata:
  name: dj
  namespace: prod
spec:
  meshName: dj-app
  listeners:
    - portMapping:
        port: 9080
        protocol: http
  serviceDiscovery:
    dns:
      hostName: dj.prod.svc.cluster.local
  backends:
    - virtualService:
        virtualServiceName: jazz.prod.svc.cluster.local
    - virtualService:
        virtualServiceName: metal.prod.svc.cluster.local

The backend attribute specifies that dj is allowed to make requests to the jazz and metal virtual services only.

At this point, you’ve created three virtual nodes:

kubectl get virtualnodes -nprod

NAME            AGE
dj              6m
jazz-v1         6m
metal-v1        6m

The last step is to create the two App Mesh virtual services that intercept and route requests made to jazz and metal. To do this, run the following command:

kubectl apply -nprod -f 4_create_initial_mesh_components/virtual-services.yaml

The output should be similar to:

virtualservice.appmesh.k8s.aws/jazz.prod.svc.cluster.local created
virtualservice.appmesh.k8s.aws/metal.prod.svc.cluster.local created

If you inspect the YAML, you may notice that it created two virtual service resources. Requests made to jazz.prod.svc.cluster.local are intercepted by App Mesh and routed to the virtual node jazz-v1.

Similarly, requests made to metal.prod.svc.cluster.local are routed to the virtual node metal-v1:

apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualService
metadata:
  name: jazz.prod.svc.cluster.local
  namespace: prod
spec:
  meshName: dj-app
  virtualRouter:
    name: jazz-router
  routes:
    - name: jazz-route
      http:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeName: jazz-v1
              weight: 100

---
apiVersion: appmesh.k8s.aws/v1beta1
kind: VirtualService
metadata:
  name: metal.prod.svc.cluster.local
  namespace: prod
spec:
  meshName: dj-app
  virtualRouter:
    name: metal-router
  routes:
    - name: metal-route
      http:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeName: metal-v1
              weight: 100

NOTE: Remember to use fully qualified DNS names for the virtual service’s metadata.name field to prevent the chance of name collisions when using App Mesh cross-cluster.

With these virtual services defined, to access them by name, clients (in this case, the dj container) first perform a DNS lookup to jazz.prod.svc.cluster.local or metal.prod.svc.cluster.local before making the HTTP request.

If the dj container (or any other client) cannot resolve that name to an IP, the subsequent HTTP request fails with a name lookup error.

The existing physical services (jazz-v1, metal-v1, dj) are defined as physical Kubernetes services, and therefore have resolvable names:

kubectl get svc -nprod

NAME       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
dj         ClusterIP   10.100.247.180   <none>        9080/TCP   16h
jazz-v1    ClusterIP   10.100.157.174   <none>        9080/TCP   16h
metal-v1   ClusterIP   10.100.187.186   <none>        9080/TCP   16h

However, the new jazz and metal virtual services we just created don’t (yet) have resolvable names.

To provide the jazz and metal virtual services with resolvable IP addresses and hostnames, define them as Kubernetes services that do not map to any deployments or pods. Do this by creating them as k8s services without defining selectors for them. Because App Mesh is intercepting and routing requests made for them, they don’t have to map to any pods or deployments on the k8s-side.

To register the placeholder names and IP addresses for these virtual services, run the following command:

kubectl create -nprod -f 4_create_initial_mesh_components/metal_and_jazz_placeholder_services.yaml

The output should be similar to:

service/jazz created
service/metal created

You can now use kubectl to get the registered metal and jazz virtual services:

kubectl get -nprod virtualservices

NAME                           AGE
jazz.prod.svc.cluster.local    10m
metal.prod.svc.cluster.local   10m

You can also get the virtual service placeholder IP addresses and physical service IP addresses:

kubectl get svc -nprod

NAME       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
dj         ClusterIP   10.100.247.180   <none>        9080/TCP   17h
jazz       ClusterIP   10.100.220.118   <none>        9080/TCP   27s
jazz-v1    ClusterIP   10.100.157.174   <none>        9080/TCP   17h
metal      ClusterIP   10.100.122.192   <none>        9080/TCP   27s
metal-v1   ClusterIP   10.100.187.186   <none>        9080/TCP   17h

As such, when name lookup requests are made to your virtual services alongside their physical service counterparts, they resolve.

Currently, if you describe any of the pods running in the prod namespace, they are running with just one container (the same one with which you initially deployed it):

kubectl get pods -nprod

NAME                        READY   STATUS    RESTARTS   AGE
dj-5b445fbdf4-qf8sv         1/1     Running   0          3h
jazz-v1-644856f4b4-mshnr    1/1     Running   0          3h
metal-v1-84bffcc887-97qzw   1/1     Running   0          3h

kubectl describe pods/dj-5b445fbdf4-qf8sv -nprod

...
Containers:
  dj:
    Container ID:   docker://76e6d5f7101dfce60158a63cf7af9fcb3c821c087db360e87c5e2fb8850b7aa9
    Image:          970805265562.dkr.ecr.us-west-2.amazonaws.com/hello-world:latest
    Image ID:       docker-pullable://970805265562.dkr.ecr.us-west-2.amazonaws.com/[email protected]:581fe44cf2413a48f0cdf005b86b025501eaff6cafc7b26367860e07be060753
    Port:           9080/TCP
    Host Port:      0/TCP
    State:          Running
...

The injector controller installed earlier watches for new pods to be created and ensures that any new pods created in the prod namespace are injected with the App Mesh sidecar. Because the dj pods were already running before the injector was created, you’ll now force them to be re-created, this time with the sidecars auto-injected into them.

In production, there are more graceful ways to do this. For the purpose of this tutorial, an easy way to have the deployment re-create the pods in an innocuous fashion is to patch a simple date annotation into the deployment.

To do that with your current deployment, first get all the prod namespace pod names:

kubectl get pods -nprod

The output is the pod names:

NAME                        READY   STATUS    RESTARTS   AGE
dj-5b445fbdf4-qf8sv         1/1     Running   0          3h
jazz-v1-644856f4b4-mshnr    1/1     Running   0          3h
metal-v1-84bffcc887-97qzw   1/1     Running   0          3h

 

Under the READY column, you see 1/1, which indicates that one container is running for each pod.

Next, run the following commands to add a date label to each dj, jazz-v1, and metal-1 deployment, forcing the pods to be re-created:

kubectl patch deployment dj -nprod -p "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}"
kubectl patch deployment metal-v1 -nprod -p "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}"
kubectl patch deployment jazz-v1 -nprod -p "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}"

Again, get the pods:

kubectl get pods -nprod

Under READY, you see 2/2, which indicates that two containers for each pod are running:

NAME                        READY   STATUS    RESTARTS   AGE
dj-6cfb85cdd9-z5hsp         2/2     Running   0          10m
jazz-v1-79d67b4fd6-hdrj9    2/2     Running   0          16s
metal-v1-769b58d9dc-7q92q   2/2     Running   0          18s

NOTE: If you don’t see this exact output, wait about 10 seconds (your redeployment is underway), and re-run the command.

Now describe the new dj pod to get more detail:

...
Containers:
  dj:
    Container ID:   docker://bef63f2e45fb911f78230ef86c2a047a56c9acf554c2272bc094300c6394c7fb
    Image:          970805265562.dkr.ecr.us-west-2.amazonaws.com/hello-world:latest
    ...
  envoy:
    Container ID:   docker://2bd0dc0707f80d436338fce399637dcbcf937eaf95fed90683eaaf5187fee43a
    Image:          111345817488.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.8.0.2-beta
    ...

Both the original container and the auto-injected sidecar are running for any new pods created in the prod namespace.

Testing the App Mesh architecture

To test if the new architecture is working as expected, exec into the dj container. Get the name of your dj pod by listing all pods with the dj selector:

kubectl get pods -nprod -lapp=dj

The output should be similar to the following:

NAME                  READY     STATUS    RESTARTS   AGE
dj-5b445fbdf4-8xkwp   1/1       Running   0          32s

Next, exec into the dj pod returned from the last step:

kubectl exec -nprod -it <your-dj-pod-name> bash

The output should be similar to:

[email protected]:/usr/src/app#

Now that you have a root prompt into the dj pod, make a curl request to the virtual service jazz on port 9080. Your request simulates what would happen if code running in the same pod made a request to the jazz backend:

curl jazz.prod.svc.cluster.local:9080;echo

The output should be similar to the following:

["Astrud Gilberto","Miles Davis"]

Try it again, but issue the command to the virtual metal service:

curl metal.prod.svc.cluster.local:9080;echo

You should get a list of heavy metal bands:

["Megadeth","Judas Priest"]

When you’re done exploring this vast, service-mesh-enabled world of music, press CTRL-D, or type exit to exit the container’s shell:

[email protected]:/usr/src/app# exit
command terminated with exit code 1
$

 

Cleaning up

When you’re done experimenting and want to delete all the resources created during this series, run the cleanup script via the following command line:

./cleanup.sh

This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own or throughout this series of posts.

Conclusion of Part 5

In this fifth part of the series, you learned how to enable existing microservices to run on App Mesh. In part 6, I demonstrate the true power of App Mesh by walking you through adding new versions of the metal and jazz services and demonstrating how to route between them.

 

 

PART 6: Deploying with the canary technique

In part 5 of this series, I demonstrated how to configure an existing microservices-based application (DJ App) to run on AWS App Mesh. In this post, I demonstrate how App Mesh can be used to deploy new versions of Amazon EKS-based microservices using the canary technique.

Prerequisites

Make sure that you’ve completed parts 1–5 of this series before running through the steps in this post.

Canary testing with v2

A canary release is a method of slowly exposing a new version of software. The theory is that by serving the new version of the software to a small percentage of requests, any problems only affect the small percentage of users before they’re discovered and rolled back.

So now, back to the DJ App scenario. Version 2 of the metal and jazz services is out, and they now include the city that each artist is from in the response. You’ll now release v2 versions of the metal and jazz services in a canary fashion using App Mesh. When you complete this process, requests to the metal and jazz services are distributed in a weighted fashion to both the v1 and v2 versions.

The following diagram shows the final (v2) seven-microservices-based application, running on an App Mesh service mesh.

 

 

To begin, roll out the v2 deployments, services, and virtual nodes with a single YAML file:

kubectl apply -nprod -f 5_canary/jazz_v2.yaml

The output should be similar to the following:

deployment.apps/jazz-v2 created
service/jazz-v2 created
virtualnode.appmesh.k8s.aws/jazz-v2 created

Next, update the jazz virtual service by modifying the route to spread traffic 50/50 across the two versions. Look at it now, and see that the current route points 100% to jazz-v1:

kubectl describe virtualservice jazz -nprod

Name:         jazz.prod.svc.cluster.local
Namespace:    prod
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:

{"apiVersion":"appmesh.k8s.aws/v1beta1","kind":"VirtualService","metadata":{"annotations":{},"name":"jazz.prod.svc.cluster.local","namesp...
API Version:  appmesh.k8s.aws/v1beta1
Kind:         VirtualService
Metadata:
  Creation Timestamp:  2019-03-23T00:15:08Z
  Generation:          3
  Resource Version:    2851527
  Self Link:           /apis/appmesh.k8s.aws/v1beta1/namespaces/prod/virtualservices/jazz.prod.svc.cluster.local
  UID:                 b76eed59-4d00-11e9-87e6-06dd752b96a6
Spec:
  Mesh Name:  dj-app
  Routes:
    Http:
      Action:
        Weighted Targets:
          Virtual Node Name:  jazz-v1
          Weight:             100
      Match:
        Prefix:  /
    Name:        jazz-route
  Virtual Router:
    Name:  jazz-router
Status:
  Conditions:
Events:  <none>

Apply the updated service definition:

kubectl apply -nprod -f 5_canary/jazz_service_update.yaml

When you describe the virtual service again, you see the updated route:

kubectl describe virtualservice jazz -nprod

Name:         jazz.prod.svc.cluster.local
Namespace:    prod
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:

{"apiVersion":"appmesh.k8s.aws/v1beta1","kind":"VirtualService","metadata":{"annotations":{},"name":"jazz.prod.svc.cluster.local","namesp...
API Version:  appmesh.k8s.aws/v1beta1
Kind:         VirtualService
Metadata:
  Creation Timestamp:  2019-03-23T00:15:08Z
  Generation:          4
  Resource Version:    2851774
  Self Link:           /apis/appmesh.k8s.aws/v1beta1/namespaces/prod/virtualservices/jazz.prod.svc.cluster.local
  UID:                 b76eed59-4d00-11e9-87e6-06dd752b96a6
Spec:
  Mesh Name:  dj-app
  Routes:
    Http:
      Action:
        Weighted Targets:
          Virtual Node Name:  jazz-v1
          Weight:             90
          Virtual Node Name:  jazz-v2
          Weight:             10
      Match:
        Prefix:  /
    Name:        jazz-route
  Virtual Router:
    Name:  jazz-router
Status:
  Conditions:
Events:  <none>

To deploy metal-v2, perform the same steps. Roll out the v2 deployments, services, and virtual nodes with a single YAML file:

kubectl apply -nprod -f 5_canary/metal_v2.yaml

The output should be similar to the following:

deployment.apps/metal-v2 created
service/metal-v2 created
virtualnode.appmesh.k8s.aws/metal-v2 created

Update the metal virtual service by modifying the route to spread traffic 50/50 across the two versions:

kubectl apply -nprod -f 5_canary/metal_service_update.yaml

When you describe the virtual service again, you see the updated route:

kubectl describe virtualservice metal -nprod

Name:         metal.prod.svc.cluster.local
Namespace:    prod
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:

{"apiVersion":"appmesh.k8s.aws/v1beta1","kind":"VirtualService","metadata":{"annotations":{},"name":"metal.prod.svc.cluster.local","names...
API Version:  appmesh.k8s.aws/v1beta1
Kind:         VirtualService
Metadata:
  Creation Timestamp:  2019-03-23T00:15:08Z
  Generation:          2
  Resource Version:    2852282
  Self Link:           /apis/appmesh.k8s.aws/v1beta1/namespaces/prod/virtualservices/metal.prod.svc.cluster.local
  UID:                 b784e824-4d00-11e9-87e6-06dd752b96a6
Spec:
  Mesh Name:  dj-app
  Routes:
    Http:
      Action:
        Weighted Targets:
          Virtual Node Name:  metal-v1
          Weight:             50
          Virtual Node Name:  metal-v2
          Weight:             50
      Match:
        Prefix:  /
    Name:        metal-route
  Virtual Router:
    Name:  metal-router
Status:
  Conditions:
Events:  <none>

Testing the v2 jazz and metal services

Now that the v2 services are deployed, it’s time to test them out. To test if it’s working as expected, exec into the DJ pod. To do that, get the name of your dj pod by listing all pods with the dj selector:

kubectl get pods -nprod -l app=dj

The output should be similar to the following:

NAME                  READY     STATUS    RESTARTS   AGE
dj-5b445fbdf4-8xkwp   1/1       Running   0          32s

Next, exec into the DJ pod by running the following command:

kubectl exec -nprod -it <your dj pod name> bash

The output should be similar to the following:

[email protected]:/usr/src/app#

Now that you have a root prompt into the DJ pod, issue a curl request to the metal virtual service:

while [ 1 ]; do curl http://metal.prod.svc.cluster.local:9080/;echo; done

The output should loop about 50/50 between the v1 and v2 versions of the metal service, similar to:

...
["Megadeth","Judas Priest"]
["Megadeth (Los Angeles, California)","Judas Priest (West Bromwich, England)"]
["Megadeth","Judas Priest"]
["Megadeth (Los Angeles, California)","Judas Priest (West Bromwich, England)"]
...

Press CTRL-C to stop the looping.

Next, perform a similar test, but against the jazz service. Issue a curl request to the jazz virtual service from within the dj pod:

while [ 1 ]; do curl http://jazz.prod.svc.cluster.local:9080/;echo; done

The output should loop about in a 90/10 ratio between the v1 and v2 versions of the jazz service, similar to the following:

...
["Astrud Gilberto","Miles Davis"]
["Astrud Gilberto","Miles Davis"]
["Astrud Gilberto","Miles Davis"]
["Astrud Gilberto (Bahia, Brazil)","Miles Davis (Alton, Illinois)"]
["Astrud Gilberto","Miles Davis"]
...

Press CTRL-C to stop the looping, and then type exit to exit the pod’s shell.

Cleaning up

When you’re done experimenting and want to delete all the resources created during this tutorial series, run the cleanup script via the following command line:

./cleanup.sh

This script does not delete any nodes in your k8s cluster. It only deletes the DJ app and App Mesh components created throughout this series of posts.

Make sure to leave the cluster intact if you plan on experimenting in the future with App Mesh on your own.

Conclusion of Part 6

In this final part of the series, I demonstrated how App Mesh can be used to roll out new microservice versions using the canary technique. Feel free to experiment further with the cluster by adding or removing microservices, and tweaking routing rules by changing weights and targets.

 

Geremy is a solutions architect at AWS.  He enjoys spending time with his family, BBQing, and breaking and fixing things around the house.

 

Securing credentials using AWS Secrets Manager with AWS Fargate

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/securing-credentials-using-aws-secrets-manager-with-aws-fargate/

This post is contributed by Massimo Re Ferre – Principal Developer Advocate, AWS Container Services.

Cloud security at AWS is the highest priority and the work that the Containers team is doing is a testament to that. A month ago, the team introduced an integration between AWS Secrets Manager and AWS Systems Manager Parameter Store with AWS Fargate tasks. Now, Fargate customers can easily consume secrets securely and parameters transparently from their own task definitions.

In this post, I show you an example of how to use Secrets Manager and Fargate integration to ensure that your secrets are never exposed in the wild.

Overview

AWS has engineered Fargate to be highly secure, with multiple, important security measures. One of these measures is ensuring that each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with other tasks.

Another area of security focus is the Amazon VPC networking integration, which ensures that tasks can be protected the way that an Amazon EC2 instance can be protected from a networking perspective.

This specific announcement, however, is important in the context of our shared responsibility model. For example, DevOps teams building and running solutions on the AWS platform require proper tooling and functionalities to securely manage secrets, passwords, and sensitive parameters at runtime in their application code. Our job is to empower them with platform capabilities to do exactly that and make it as easy as possible.

Sometimes, in a rush to get things out the door quick, we have seen some users trading off some security aspects for agility, from embedding AWS credentials in source code pushed to public repositories all the way to embedding passwords in clear text in privately stored configuration files. We have solved this problem for developers consuming various AWS services by letting them assign IAM roles to Fargate tasks so that their AWS credentials are transparently handled.

This was useful for consuming native AWS services, but what about accessing services and applications that are outside of the scope of IAM roles and IAM policies? Often, the burden of having to deal with these credentials is pushed onto the developers and AWS users in general. It doesn’t have to be this way. Enter the Secrets Manager and Fargate integration!

Starting with Fargate platform version 1.3.0 and later, it is now possible for you to instruct Fargate tasks to securely grab secrets from Secrets Manager so that these secrets are never exposed in the wild—not even in private configuration files.

In addition, this frees you from the burden of having to implement the undifferentiated heavy lifting of securing these secrets. As a bonus, because Secrets Manager supports secrets rotation, you also gain an additional level of security with no additional effort.

Twitter matcher example

In this example, you create a Fargate task that reads a stream of data from Twitter, matches a particular pattern in the messages, and records some information about the tweet in a DynamoDB table.

To do this, use a Python Twitter library called Tweepy to read the stream from Twitter and the AWS Boto 3 Python library to write to Amazon DynamoDB.

The following diagram shows the high-level flow:

The objective of this example is to show a simple use case where you could use IAM roles assigned to tasks to consume AWS services (such as DynamoDB). It also includes consuming external services (such as Twitter), for which explicit non-AWS credentials need to be stored securely.

This is what happens when you launch the Fargate task:

  • The task starts and inherits the task execution role (1) and the task role (2) from IAM.
  • It queries Secrets Manager (3) using the credentials inherited by the task execution role to retrieve the Twitter credentials and pass them onto the task as variables.
  • It reads the stream from Twitter (4) using the credentials that are stored in Secrets Manager.
  • It matches the stream with a configurable pattern and writes to the DynamoDB table (5) using the credentials inherited by the task role.
  • It matches the stream with a configurable pattern and writes to the DynamoDB table (5) and logs to CloudWatch (6) using the credentials inherited by the task role.

As a side note, while for this specific example I use Twitter as an external service that requires sensitive credentials, any external service that has some form of authentication using passwords or keys is acceptable. Modify the Python script as needed to capture relevant data from your own service to write to the DynamoDB table.

Here are the solution steps:

  • Create the Python script
  • Create the Dockerfile
  • Build the container image
  • Create the image repository
  • Create the DynamoDB table
  • Store the credentials securely
  • Create the IAM roles and IAM policies for the Fargate task
  • Create the Fargate task
  • Clean up

Prerequisites

To be able to execute this exercise, you need an environment configured with the following dependencies:

You can also skip this configuration part and launch an AWS Cloud9 instance.

For the purpose of this example, I am working with the AWS CLI, configured to work with the us-west-2 Region. You can opt to work in a different Region. Make sure that the code examples in this post are modified accordingly.

In addition to the list of AWS prerequisites, you need a Twitter developer account. From there, create an application and use the credentials provided that allow you to connect to the Twitter APIs. We will use them later in the blog post when we will add them to AWS Secrets Manager.

Note: many of the commands suggested in this blog post use $REGION and $AWSACCOUNT in them. You can either set environmental variables that point to the region you want to deploy to and to your own account or you can replace those in the command itself with the region and account number. Also, there are some configuration files (json) that use the same patterns; for those the easiest option is to replace the $REGION and $AWSACCOUNT placeholders with the actual region and account number.

Create the Python script

This script is based on the Tweepy streaming example. I modified the script to include the Boto 3 library and instructions that write data to a DynamoDB table. In addition, the script prints the same data to standard output (to be captured in the container log).

This is the Python script:

from __future__ import absolute_import, print_function from tweepy.streaming import StreamListener from tweepy import OAuthHandler from tweepy import Stream import json import boto3 import os

# DynamoDB table name and Region dynamoDBTable=os.environ['DYNAMODBTABLE'] region_name=os.environ['AWSREGION'] # Filter variable (the word for which to filter in your stream) filter=os.environ['FILTER'] # Go to http://apps.twitter.com and create an app. # The consumer key and secret are generated for you after consumer_key=os.environ['CONSUMERKEY'] consumer_secret=os.environ['CONSUMERSECRETKEY'] # After the step above, you are redirected to your app page. # Create an access token under the "Your access token" section access_token=os.environ['ACCESSTOKEN'] access_token_secret=os.environ['ACCESSTOKENSECRET'] class StdOutListener(StreamListener): """ A listener handles tweets that are received from the stream. This is a basic listener that prints received tweets to stdout. """ def on_data(self, data): j = json.loads(data) tweetuser = j['user']['screen_name'] tweetdate = j['created_at'] tweettext = j['text'].encode('ascii', 'ignore').decode('ascii') print(tweetuser) print(tweetdate) print(tweettext) dynamodb = boto3.client('dynamodb',region_name) dynamodb.put_item(TableName=dynamoDBTable, Item={'user':{'S':tweetuser},'date':{'S':tweetdate},'text':{'S':tweettext}}) return True def on_error(self, status): print(status) if __name__ == '__main__': l = StdOutListener() auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) stream = Stream(auth, l) stream.filter(track=[filter]) 

Save this file in a directory and call it twitterstream.py.

This image requires seven parameters, which are clearly visible at the beginning of the script as system variables:

  • The name of the DynamoDB table
  • The Region where you are operating
  • The word or pattern for which to filter
  • The four keys to use to connect to the Twitter API services. Later, I explore how to pass these variables to the container, keeping in mind that some are more sensitive than others.

Create the Dockerfile

Now onto building the actual Docker image. To do that, create a Dockerfile that contains these instructions:

FROM amazonlinux:2
RUN yum install shadow-utils.x86_64 -y
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python get-pip.py
RUN pip install tweepy
RUN pip install boto3
COPY twitterstream.py .
RUN groupadd -r twitterstream && useradd -r -g twitterstream twitterstream
USER twitterstream
CMD ["python", "-u", "twitterstream.py"]

Save it as Dockerfile in the same directory with the twitterstream.py file.

Build the container image

Next, create the container image that you later instantiate as a Fargate task. Build the container image running the following command in the same directory:

docker build -t twitterstream:latest .

Don’t overlook the period (.) at the end of the command: it tells Docker to find the Dockerfile in the current directory.

You now have a local Docker image that, after being properly parameterized, can eventually read from the Twitter APIs and save data in a DynamoDB table.

Create the image repository

Now, store this image in a proper container registry. Create an Amazon ECR repository with the following command:

aws ecr create-repository --repository-name twitterstream --region $REGION

You should see something like the following code example as a result:

{
"repository": {
"registryId": "012345678910",
"repositoryName": "twitterstream",
"repositoryArn": "arn:aws:ecr:us-west-2:012345678910:repository/twitterstream",
"createdAt": 1554473020.0,
"repositoryUri": "012345678910.dkr.ecr.us-west-2.amazonaws.com/twitterstream"
}
}

Tag the local image with the following command:

docker tag twitterstream:latest $AWSACCOUNT.dkr.ecr.$REGION.amazonaws.com/twitterstream:latest

Make sure that you refer to the proper repository by using your AWS account ID and the Region to which you are deploying.

Grab an authorization token from AWS STS:

$(aws ecr get-login --no-include-email --region $REGION)

Now, push the local image to the ECR repository that you just created:

docker push $AWSACCOUNT.dkr.ecr.$REGION.amazonaws.com/twitterstream:latest

You should see something similar to the following result:

The push refers to repository [012345678910.dkr.ecr.us-west-2.amazonaws.com/twitterstream]
435b608431c6: Pushed
86ced7241182: Pushed
e76351c39944: Pushed
e29c13e097a8: Pushed
e55573178275: Pushed
1c729a602f80: Pushed
latest: digest: sha256:010c2446dc40ef2deaedb3f344f12cd916ba0e96877f59029d047417d6cb1f95 size: 1582

Now the image is safely stored in its ECR repository.

Create the DynamoDB table

Now turn to the backend DynamoDB table. This is where you store the extract of the Twitter stream being generated. Specifically, you store the user that published the Tweet, the date when the Tweet was published, and the text of the Tweet.

For the purpose of this example, create a table called twitterStream. This can be customized as one of the parameters that you have to pass to the Fargate task.

Run this command to create the table:

aws dynamodb create-table --region $REGION --table-name twitterStream \
                          --attribute-definitions AttributeName=user,AttributeType=S AttributeName=date,AttributeType=S \
                          --key-schema AttributeName=user,KeyType=HASH AttributeName=date,KeyType=RANGE \
                          --billing-mode PAY_PER_REQUEST

Store the credentials securely

As I hinted earlier, the Python script requires the Fargate task to pass some information as variables. You pass the table name, the Region, and the text to filter as standard task variables. Because this is not sensitive information, it can be shared without raising any concern.

However, other configurations are sensitive and should not be passed over in plaintext, like the Twitter API key. For this reason, use Secrets Manager to store that sensitive information and then read them within the Fargate task securely. This is what the newly announced integration between Fargate and Secrets Manager allows you to accomplish.

You can use the Secrets Manager console or the CLI to store sensitive data.

If you opt to use the console, choose other types of secrets. Under Plaintext, enter your consumer key. Under Select the encryption key, choose DefaultEncryptionKey, as shown in the following screenshot. For more information, see Creating a Basic Secret.

For this example, however, it is easier to use the AWS CLI to create the four secrets required. Run the following commands, but customize them with your own Twitter credentials:

aws secretsmanager create-secret --region $REGION --name CONSUMERKEY \
    --description "Twitter API Consumer Key" \
    --secret-string <your consumer key here> 
aws secretsmanager create-secret --region $REGION --name CONSUMERSECRETKEY \
    --description "Twitter API Consumer Secret Key" \
    --secret-string <your consumer secret key here> 
aws secretsmanager create-secret --region $REGION --name ACCESSTOKEN \
    --description "Twitter API Access Token" \
    --secret-string <your access token here> 
aws secretsmanager create-secret --region $REGION --name ACCESSTOKENSECRET \
    --description "Twitter API Access Token Secret" \
    --secret-string <your access token secret here> 

Each of those commands reports a message confirming that the secret has been created:

{
"VersionId": "7d950825-7aea-42c5-83bb-0c9b36555dbb",
"Name": "CONSUMERSECRETKEY",
"ARN": "arn:aws:secretsmanager:us-west-2:01234567890:secret:CONSUMERSECRETKEY-5D0YUM"
}

From now on, these four API keys no longer appear in any configuration.

The following screenshot shows the console after the commands have been executed:

Create the IAM roles and IAM policies for the Fargate task

To run the Python code properly, your Fargate task must have some specific capabilities. The Fargate task must be able to do the following:

  1. Pull the twitterstream container image (created earlier) from ECR.
  2. Retrieve the Twitter credentials (securely stored earlier) from Secrets Manager.
  3. Log in to a specific Amazon CloudWatch log group (logging is optional but a best practice).
  4. Write to the DynamoDB table (created earlier).

The first three capabilities should be attached to the ECS task execution role. The fourth should be attached to the ECS task role. For more information, see Amazon ECS Task Execution IAM Role.

In other words, the capabilities that are associated with the ECS agent and container instance need to be configured in the ECS task execution role. Capabilities that must be available from within the task itself are configured in the ECS task role.

First, create the two IAM roles that are eventually attached to the Fargate task.

Create a file called ecs-task-role-trust-policy.json with the following content (make sure you replace the $REGION, $AWSACCOUNT placeholders as well as the proper secrets ARNs):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Now, run the following commands to create the twitterstream-task-role role, as well as the twitterstream-task-execution-role:

aws iam create-role --region $REGION --role-name twitterstream-task-role --assume-role-policy-document file://ecs-task-role-trust-policy.json

aws iam create-role --region $REGION --role-name twitterstream-task-execution-role --assume-role-policy-document file://ecs-task-role-trust-policy.json

Next, create a JSON file that codifies the capabilities required for the ECS task role (twitterstream-task-role):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:$REGION:$AWSACCOUNT:table/twitterStream"
            ]
        }
    ]
}

Save the file as twitterstream-iam-policy-task-role.json.

Now, create a JSON file that codifies the capabilities required for the ECS task execution role (twitterstream-task-execution-role):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:CONSUMERKEY-XXXXXX",
                "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:CONSUMERSECRETKEY-XXXXXX",
                "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:ACCESSTOKEN-XXXXXX",
                "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:ACCESSTOKENSECRET-XXXXXX"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        }
    ]
}

Save the file as twitterstream-iam-policy-task-execution-role.json.

The following two commands create IAM policy documents and associate them with the IAM roles that you created earlier:

aws iam put-role-policy --region $REGION --role-name twitterstream-task-role --policy-name twitterstream-iam-policy-task-role --policy-document file://twitterstream-iam-policy-task-role.json

aws iam put-role-policy --region $REGION --role-name twitterstream-task-execution-role --policy-name twitterstream-iam-policy-task-execution-role --policy-document file://twitterstream-iam-policy-task-execution-role.json

Create the Fargate task

Now it’s time to tie everything together. As a recap, so far you have:

  • Created the container image that contains your Python code.
  • Created the DynamoDB table where the code is going to save the extract from the Twitter stream.
  • Securely stored the Twitter API credentials in Secrets Manager.
  • Created IAM roles with specific IAM policies that can write to DynamoDB and read from Secrets Manager (among other things).

Now you can tie everything together by creating a Fargate task that executes the container image. To do so, create a file called twitterstream-task.json and populate it with the following configuration:

{
    "family": "twitterstream", 
    "networkMode": "awsvpc", 
    "executionRoleArn": "arn:aws:iam::$AWSACCOUNT:role/twitterstream-task-execution-role",
    "taskRoleArn": "arn:aws:iam::$AWSACCOUNT:role/twitterstream-task-role",
    "containerDefinitions": [
        {
            "name": "twitterstream", 
            "image": "$AWSACCOUNT.dkr.ecr.$REGION.amazonaws.com/twitterstream:latest", 
            "essential": true,
            "environment": [
                {
                    "name": "DYNAMODBTABLE",
                    "value": "twitterStream"
                },
                {
                    "name": "AWSREGION",
                    "value": "$REGION"
                },                
                {
                    "name": "FILTER",
                    "value": "Cloud Computing"
                }
            ],    
            "secrets": [
                {
                    "name": "CONSUMERKEY",
                    "valueFrom": "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:CONSUMERKEY-XXXXXX"
                },
                {
                    "name": "CONSUMERSECRETKEY",
                    "valueFrom": "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:CONSUMERSECRETKEY-XXXXXX"
                },
                {
                    "name": "ACCESSTOKEN",
                    "valueFrom": "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:ACCESSTOKEN-XXXXXX"
                },
                {
                    "name": "ACCESSTOKENSECRET",
                    "valueFrom": "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:ACCESSTOKENSECRET-XXXXXX"
                }
            ],
            "logConfiguration": {
                    "logDriver": "awslogs",
                    "options": {
                            "awslogs-group": "twitterstream",
                            "awslogs-region": "$REGION",
                            "awslogs-stream-prefix": "twitterstream"
                    }
            }
        }
    ], 
    "requiresCompatibilities": [
        "FARGATE"
    ], 
    "cpu": "256", 
    "memory": "512"
}

To tweak the search string, change the value of the FILTER variable (currently set to “Cloud Computing”).

The Twitter API credentials are never exposed in clear text in these configuration files. There is only a reference to the Amazon Resource Names (ARNs) of the secret names. For example, this is the system variable CONSUMERKEY in the Fargate task configuration:

"secrets": [
                {
                    "name": "CONSUMERKEY",
                    "valueFrom": "arn:aws:secretsmanager:$REGION:$AWSACCOUNT:secret:CONSUMERKEY-XXXXXX"
                }

This directive asks the ECS agent running on the Fargate instance (that has assumed the specified IAM execution role) to do the following:

  • Connect to Secrets Manager.
  • Get the secret securely.
  • Assign its value to the CONSUMERKEY system variable to be made available to the Fargate task.

Register this task by running the following command:

aws ecs register-task-definition --region $REGION --cli-input-json file://twitterstream-task.json

In preparation to run the task, create the CloudWatch log group with the following command:

aws logs create-log-group --log-group-name twitterstream --region $REGION

If you don’t create the log group upfront, the task fails to start.

Create the ECS cluster

The last step before launching the Fargate task is creating an ECS cluster. An ECS cluster has two distinct dimensions:

  • The EC2 dimension, where the compute capacity is managed by the customer as ECS container instances)
  • The Fargate dimension, where the compute capacity is managed transparently by AWS.

For this example, you use the Fargate dimension, so you are essentially using the ECS cluster as a logical namespace.

Run the following command to create a cluster called twitterstream_cluster (change the name as needed). If you have a default cluster already created in your Region of choice, you can use that, too.

aws ecs create-cluster --cluster-name "twitterstream_cluster" --region $REGION

Now launch the task in the ECS cluster just created (in the us-west-2 Region) with a Fargate launch type. Run the following command:

aws ecs run-task --region $REGION \
  --cluster "twitterstream_cluster" \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=["subnet-6a88e013","subnet-6a88e013"],securityGroups=["sg-7b45660a"],assignPublicIp=ENABLED}" \
  --task-definition twitterstream:1

A few things to pay attention to with this command:

  • If you created more than one revision of the task (by re-running the aws ecs register-task-definition command), make sure to run the aws ecs run-task command with the proper revision number at the end.
  • Customize the network section of the command for your own environment:
    • Use the default security group in your VPC, as the Fargate task only needs outbound connectivity.
    • Use two public subnets in which to start the Fargate task.

The Fargate task comes up in a few seconds and you can see it from the ECS console, as shown in the following screenshot:

Similarly, the DynamoDB table starts being populated with the information collected by the script running in the task, as shown in the following screenshot:

Finally, the Fargate task logs all the activities in the CloudWatch Log group, as shown in the following screenshot:

The log may take a few minutes to populate and be consolidated in CloudWatch.

Clean up

Now that you have completed the walkthrough, you can tear down all the resources that you created to avoid incurring future charges.

First, stop the ECS task that you started:

aws ecs stop-task --cluster twitterstream_cluster --region $REGION --task 4553111a-748e-4f6f-beb5-f95242235fb5

Your task number is different. You can grab it either from the ECS console or from the AWS CLI. This is how you read it from the AWS CLI:

aws ecs list-tasks --cluster twitterstream_cluster --family twitterstream --region $REGION  
{
"taskArns": [
"arn:aws:ecs:us-west-2:693935722839:task/4553111a-748e-4f6f-beb5-f95242235fb5 "
]
}

Then, delete the ECS cluster that you created:

aws ecs delete-cluster --cluster "twitterstream_cluster" --region $REGION

Next, delete the CloudWatch log group:

aws logs delete-log-group --log-group-name twitterstream --region $REGION

The console provides a fast workflow to delete the IAM roles. In the IAM console, choose Roles and filter your search for twitter. You should see the two roles that you created:

Select the two roles and choose Delete role.

Cleaning up the secrets created is straightforward. Run a delete-secret command for each one:

aws secretsmanager delete-secret --region $REGION --secret-id CONSUMERKEY
aws secretsmanager delete-secret --region $REGION --secret-id CONSUMERSECRETKEY
aws secretsmanager delete-secret --region $REGION --secret-id ACCESSTOKEN
aws secretsmanager delete-secret --region $REGION --secret-id ACCESSTOKENSECRET

The next step is to delete the DynamoDB table:

aws dynamodb delete-table --table-name twitterStream --region $REGION

The last step is to delete the ECR repository. By default, you cannot delete a repository that still has container images in it. To address that, add the –force directive:

aws ecr delete-repository --region $REGION --repository-name twitterstream --force

You can de-register the twitterstream task definition by following this procedure in the ECS console. The task definitions remain inactive but visible in the system.

With this, you have deleted all the resources that you created.

Conclusion

In this post, I demonstrated how Fargate can interact with Secrets Manager to retrieve sensitive data (for example, Twitter API credentials). You can securely make the sensitive data available to the code running in the container inside the Fargate task.

I also demonstrated how a Fargate task with a specific IAM role can access other AWS services (for example, DynamoDB).

 

Enabling DNS resolution for Amazon EKS cluster endpoints

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/enabling-dns-resolution-for-amazon-eks-cluster-endpoints/

This post is contributed by Jeremy Cowan – Sr. Container Specialist Solution Architect, AWS

By default, when you create an Amazon EKS cluster, the Kubernetes cluster endpoint is public. While it is accessible from the internet, access to the Kubernetes cluster endpoint is restricted by AWS Identity and Access Management (IAM) and Kubernetes role-based access control (RBAC) policies.

At some point, you may need to configure the Kubernetes cluster endpoint to be private.  Changing your Kubernetes cluster endpoint access from public to private completely disables public access such that it can no longer be accessed from the internet.

In fact, a cluster that has been configured to only allow private access can only be accessed from the following:

  • The VPC where the worker nodes reside
  • Networks that have been peered with that VPC
  • A network that has been connected to AWS through AWS Direct Connect (DX) or a virtual private network (VPN)

However, the name of the Kubernetes cluster endpoint is only resolvable from the worker node VPC, for the following reasons:

  • The Amazon Route 53 private hosted zone that is created for the endpoint is only associated with the worker node VPC.
  • The private hosted zone is created in a separate AWS managed account and cannot be altered.

For more information, see Working with Private Hosted Zones.

This post explains how to use Route 53 inbound and outbound endpoints to resolve the name of the cluster endpoints when a request originates outside the worker node VPC.

Route 53 inbound and outbound endpoints

Route 53 inbound and outbound endpoints allow you to simplify the configuration of hybrid DNS.  DNS queries for AWS resources are resolved by Route 53 resolvers and DNS queries for on-premises resources are forwarded to an on-premises DNS resolver. However, you can also use these Route 53 endpoints to resolve the names of endpoints that are only resolvable from within a specific VPC, like the EKS cluster endpoint.

The following diagrams show how the solution works:

  • A Route 53 inbound endpoint is created in each worker node VPC and associated with a security group that allows inbound DNS requests from external subnets/CIDR ranges.
  • If the requests for the Kubernetes cluster endpoint originate from a peered VPC, those requests must be routed through a Route 53 outbound endpoint.
  • The outbound endpoint, like the inbound endpoint, is associated with a security group that allows inbound requests that originate from the peered VPC or from other VPCs in the Region.
  • A forwarding rule is created for each Kubernetes cluster endpoint.  This rule routes the request through the outbound endpoint to the IP addresses of the inbound endpoints in the worker node VPC, where it is resolved by Route 53.
  • The results of the DNS query for the Kubernetes cluster endpoint are then returned to the requestor.

If the request originates from an on-premises environment, you forego creating the outbound endpoints. Instead, you create a forwarding rule to forward requests for the Kubernetes cluster endpoint to the IP address of the Route 53 inbound endpoints in the worker node VPC.

Solution overview

For this solution, follow these steps:

  • Create an inbound endpoint in the worker node VPC.
  • Create an outbound endpoint in a peered VPC.
  • Create a forwarding rule for the outbound endpoint that sends requests to the Route 53 resolver for the worker node VPC.
  • Create a security group rule to allow inbound traffic from a peered network.
  • (Optional) Create a forwarding rule in your on-premises DNS for the Kubernetes cluster endpoint.

Prerequisites

EKS requires that you enable DNS hostnames and DNS resolution in each worker node VPC when you change the cluster endpoint access from public to private.  It is also a prerequisite for this solution and for all solutions that uses Route 53 private hosted zones.

In addition, you need a route that connects your on-premises network or VPC with the worker node VPC.  In a multi-VPC environment, this can be accomplished by creating a peering connection between two or more VPCs and updating the route table in those VPCs. If you’re connecting from an on-premises environment across a DX or an IPsec VPN, you need a route to the worker node VPC.

Configuring the inbound endpoint

When you provision an EKS cluster, EKS automatically provisions two or more cross-account elastic network interfaces onto two different subnets in your worker node VPC.  These network interfaces are primarily used when the control plane must initiate a connection with your worker nodes, for example, when you use kubectl exec or kubectl proxy. However, they can also be used by the workers to communicate with the Kubernetes API server.

When you change the EKS endpoint access to private, EKS associates a Route 53 private hosted zone with your worker node VPC.  Within this private hosted zone, EKS creates resource records for the cluster endpoint. These records correspond to the IP addresses of the two cross-account elastic network interfaces that were created in your VPC when you provisioned your cluster.

When the IP addresses of these cross-account elastic network interfaces change, for example, when EKS replaces unhealthy control plane nodes, the resource records for the cluster endpoint are automatically updated. This allows your worker nodes to continue communicating with the cluster endpoint when you switch to private access.  If you update the cluster to enable public access and disable private access, your worker nodes revert to using the public Kubernetes cluster endpoint.

By creating a Route 53 inbound endpoint in the worker node VPC, DNS queries are sent to the VPC DNS resolver of worker node VPC.  This endpoint is now capable of resolving the cluster endpoint.

Create an inbound endpoint in the worker node VPC

  1. In the Route 53 console, choose Inbound endpoints, Create Inbound endpoint.
  2. For Endpoint Name, enter a value such as <cluster_name>InboundEndpoint.
  3. For VPC in the Region, choose the VPC ID of the worker node VPC.
  4. For Security group for this endpoint, choose a security group that allows clients or applications from other networks to access this endpoint. For an example, see the Route 53 resolver diagram shown earlier in the post.
  5. Under IP addresses section, choose an Availability Zone that corresponds to a subnet in your VPC.
  6. For IP address, choose Use an IP address that is selected automatically.
  7. Repeat steps 7 and 8 for the second IP address.
  8. Choose Submit.

Or, run the following AWS CLI command:

export DATE=$(date +%s)
export INBOUND_RESOLVER_ID=$(aws route53resolver create-resolver-endpoint --name 
<name> --direction INBOUND --creator-request-id $DATE --security-group-ids <sgs> \
--ip-addresses SubnetId=<subnetId>,Ip=<IP address> SubnetId=<subnetId>,Ip=<IP address> \
| jq -r .ResolverEndpoint.Id)
aws route53resolver list-resolver-endpoint-ip-addresses --resolver-endpoint-id \
$INBOUND_RESOLVER_ID | jq .IpAddresses[].Ip

This outputs the IP addresses assigned to the inbound endpoint.

When you are done creating the inbound endpoint, select the endpoint from the console and choose View details.  This shows you a summary of the configuration for the endpoint.  Record the two IP addresses that were assigned to the inbound endpoint, as you need them later when configuring the forwarding rule.

Connecting from a peered VPC

An outbound endpoint is used to send DNS requests that cannot be resolved “locally” to an external resolver based on a set of rules.

If you are connecting to the EKS cluster from a peered VPC, create an outbound endpoint and forwarding rule in that VPC or expose an outbound endpoint from another VPC. For more information, see Forwarding Outbound DNS Queries to Your Network.

Create an outbound endpoint

  1. In the Route 53 console, choose Outbound endpoints, Create outbound endpoint.
  2. For Endpoint name, enter a value such as <cluster_name>OutboundEnpoint.
  3. For VPC in the Region, select the VPC ID of the VPC where you want to create the outbound endpoint, for example the peered VPC.
  4. For Security group for this endpoint, choose a security group that allows clients and applications from this or other network VPCs to access this endpoint. For an example, see the Route 53 resolver diagram shown earlier in the post.
  5. Under the IP addresses section, choose an Availability Zone that corresponds to a subnet in the peered VPC.
  6. For IP address, choose Use an IP address that is selected automatically.
  7. Repeat steps 7 and 8 for the second IP address.
  8. Choose Submit.

Or, run the following AWS CLI command:

export DATE=$(date +%s)
export OUTBOUND_RESOLVER_ID=$(aws route53resolver create-resolver-endpoint --name 
<name> --direction OUTBOUND --creator-request-id $DATE --security-group-ids <sgs> \
--ip-addresses SubnetId=<subnetId>,Ip=<IP address> SubnetId=<subnetId>,Ip=<Ip address> \
| jq -r .ResolverEndpoint.Id)
aws route53resolver list-resolver-endpoint-ip-addresses --resolver-endpoint-id \
$OUTBOUND_RESOLVER_ID | jq .IpAddresses[].Ip

This outputs the IP addresses that get assigned to the outbound endpoint.

Create a forwarding rule for the cluster endpoint

A forwarding rule is used to send DNS requests that cannot be resolved by the local resolver to another DNS resolver.  For this solution to work, create a forwarding rule for each cluster endpoint to resolve through the outbound endpoint. For more information, see Values That You Specify When You Create or Edit Rules.

  1. In the Route 53 console, choose Rules, Create rule.
  2. Give your rule a name, such as <cluster_name>Rule.
  3. For Rule type, choose Forward.
  4. For Domain name, type the name of the cluster endpoint for your EKS cluster.
  5. For VPCs that use this rule, select all of the VPCs to which this rule should apply.  If you have multiple VPCs that must access the cluster endpoint, include them in the list of VPCs.
  6. For Outbound endpoint, select the outbound endpoint to use to send DNS requests to the inbound endpoint of the worker node VPC.
  7. Under the Target IP addresses section, enter the IP addresses of the inbound endpoint that corresponds to the EKS endpoint that you entered in the Domain name field.
  8. Choose Submit.

Or, run the following AWS CLI command:

export DATE=$(date +%s)
aws route53resolver create-resolver-rule --name <name> --rule-type FORWARD \
--creator-request-id $DATE --domain-name <cluster_endpoint> --target-ips \
Ip=<IP of inbound endpoint>,Port=53 --resolver-endpoint-id <Id of outbound endpoint>

Accessing the cluster endpoint

After creating the inbound and outbound endpoints and the DNS forwarding rule, you should be able to resolve the name of the cluster endpoints from the peered VPC.

$ dig 9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com 

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.58.amzn1 <<>> 9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7168
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com. IN A
;; ANSWER SECTION:
9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com. 60 IN A 192.168.109.77
9FF86DB0668DC670F27F426024E7CDBD.sk1.us-east-1.eks.amazonaws.com. 60 IN A 192.168.74.42
;; Query time: 12 msec
;; SERVER: 172.16.0.2#53(172.16.0.2)
;; WHEN: Mon Apr 8 22:39:05 2019
;; MSG SIZE rcvd: 114

Before you can access the cluster endpoint, you must add the IP address range of the peered VPCs to the EKS control plane security group. For more information, see Tutorial: Creating a VPC with Public and Private Subnets for Your Amazon EKS Cluster.

Add a rule to the EKS cluster control plane security group

  1. In the EC2 console, choose Security Groups.
  2. Find the security group associated with the EKS cluster control plane.  If you used eksctl to provision your cluster, the security group is named as follows: eksctl-<cluster_name>-cluster/ControlPlaneSecurityGroup.
  3. Add a rule that allows port 443 inbound from the CIDR range of the peered VPC.
  4. Choose Save.

Run kubectl

With the proper security group rule in place, you should now be able to issue kubectl commands from a machine in the peered VPC against the cluster endpoint.

$ kubectl get nodes
NAME                             STATUS    ROLES     AGE       VERSION
ip-192-168-18-187.ec2.internal   Ready     <none>    22d       v1.11.5
ip-192-168-61-233.ec2.internal   Ready     <none>    22d       v1.11.5

Connecting from an on-premises environment

To manage your EKS cluster from your on-premises environment, configure a forwarding rule in your on-premises DNS to forward DNS queries to the inbound endpoint of the worker node VPCs. I’ve provided brief descriptions for how to do this for BIND, dnsmasq, and Windows DNS below.

Create a forwarding zone in BIND for the cluster endpoint

Add the following to the BIND configuration file:

zone "<cluster endpoint FQDN>" {
    type forward;
    forwarders { <inbound endpoint IP #1>; <inbound endpoint IP #2>; };
};

Create a forwarding zone in dnsmasq for the cluster endpoint

If you’re using dnsmasq, add the --server=/<cluster endpoint FQDN>/<inbound endpoint IP> flag to the startup options.

Create a forwarding zone in Windows DNS for the cluster endpoint

If you’re using Windows DNS, create a conditional forwarder.  Use the cluster endpoint FQDN for the DNS domain and the IPs of the inbound endpoints for the IP addresses of the servers to which to forward the requests.

Add a security group rule to the cluster control plane

Follow the steps in Adding A Rule To The EKS Cluster Control Plane Security Group. This time, use the CIDR of your on-premises network instead of the peered VPC.

Conclusion

When you configure the EKS cluster endpoint to be private only, its name can only be resolved from the worker node VPC. To manage the cluster from another VPC or your on-premises network, you can use the solution outlined in this post to create an inbound resolver for the worker node VPC.

This inbound endpoint is a feature that allows your DNS resolvers to easily resolve domain names for AWS resources. That includes the private hosted zone that gets associated with your VPC when you make the EKS cluster endpoint private. For more information, see Resolving DNS Queries Between VPCs and Your Network.  As always, I welcome your feedback about this solution.

Anatomy of CVE-2019-5736: A runc container escape!

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/anatomy-of-cve-2019-5736-a-runc-container-escape/

This post is courtesy of Samuel Karp, Senior Software Development Engineer — Amazon Container Services.

On Monday, February 11, CVE-2019-5736 was disclosed.  This vulnerability is a flaw in runc, which can be exploited to escape Linux containers launched with Docker, containerd, CRI-O, or any other user of runc.  But how does it work?  Dive in!

This concern has already been addressed for AWS, and no customer action is required. For more information, see the security bulletin.

A review of Linux process management

Before I explain the vulnerability, here’s a review of some Linux basics.

  • Processes and syscalls
  • What is /proc?
  • Dynamic linking

Processes and syscalls

Processes form the core unit of running programs on Linux. Every launched program is represented by one or more processes.  Processes contain a variety of data about the running program, including a process ID (pid), a table tracking in-use memory, a pointer to the currently executing instruction, a set of descriptors for open files, and so forth.

Processes interact with the operating system to perform a variety of operations (for example, reading and writing files, taking input, communicating on the network, etc.) via system calls, or syscalls.  Syscalls can perform a variety of actions. The ones I’m interested in today involve creating other processes (typically through fork(2) or clone(2)) and changing the currently running program into something else (execve(2)).

File descriptors are how a process interacts with files, as managed by the Linux kernel.  File descriptors are short identifiers (numbers) that are passed to the appropriate syscalls for interacting with files: read(2), write(2), close(2), and so forth.

Sometimes a process wants to spawn another process.  That might be a shell running a program you typed at the terminal, a daemon that needs a helper, or even concurrent processing without threads.  When this happens, the process typically uses the fork(2) or clone(2) syscalls.

These syscalls have some differences, but they both operate by creating another copy of the currently executing process and sharing some state.  That state can include things like the memory structures (either shared memory segments or copies of the memory) and file descriptors.

After the new process is started, it’s the responsibility of both processes to figure out which one they are (am I the parent? Am I the child?). Then, they take the appropriate action.  In many cases, the appropriate action is for the child to do some setup, and then execute the execve(2) syscall.

The following example shows the use of fork(2), in pseudocode:

func main() {
    child_pid= fork();
    if (child_pid > 0) {
        // This is the parent process, since child_pid is the pid of the child
        // process.
    } else if (child_pid == 0) {
        // This is the child process. It can retrieve its own pid via getpid(2),
        // if desired.  This child process still sees all the variables in memory
        // and all the open file descriptors.
    }
}

The execve(2) syscall instructs the Linux kernel to replace the currently executing program with another program, in-place.  When called, the Linux kernel loads the new executable as specified and pass the specified arguments.  Because this is done in place, the pid is preserved and a variety of other contextual information is carried over, including environment variables, the current working directory, and any open files.

func main() {
    // execl(3) is a wrapper around the execve(2) syscall that accepts the
    // arguments for the executed program as a list.
    // The first argument to execl(3) is the path of the executable to
    // execute, which in this case is the pwd(1) utility for printing out
    // the working directory.
    // The next argument to execl(3) is the first argument passed through
    // to the new program (in a C program, this would be the first element
    // of the argc array, or argc[0]).  By convention, this is the same as
    // the path of the executable.
    // The remaining arguments to execl(3) are the other arguments visible
    // in the new program's argc array, terminated by NULL.  As you're
    // not passing any additional arguments, just pass NULL here.
    execl("/bin/pwd", "/bin/pwd", NULL);
    // Nothing after this point executes, since the running process has been
    // replaced by the new pwd(1) program.
}

Wait…open files?  By default, open files are passed across the execve(2) boundary.  This is useful in cases where the new program can’t open the file, for example if there’s a new mount covering the existing path.  This is also the mechanism by which the standard I/O streams (stdin, stdout, and stderr) are made available to the new program.

While convenient in some use cases, it’s not always desired to preserve open file descriptors in the new program. This behavior can be changed by passing the O_CLOEXEC flag to open(2) when opening the file or by setting the FD_CLOEXEC flag with fcntl(2).  Using O_CLOEXEC or FD_CLOEXEC (which are both short for close-on-exec) prevents the new program from having access to the file descriptor.

func main() {
    // open(2) opens a file.  The first argument to open is the path of the file
    // and the second argument is a bitmask of flags that describe options
    // applied to the file that's opened.  open(2) then returns a file
    // descriptor, which can be used in subsequent syscalls to represent this
    // file.
    // For this example, open /dev/urandom, which is a file containing random
    // bytes.  Pass two flags: O_RDONLY and O_CLOEXEC; O_RDONLY indicates that
    // the file should be open for reading but not writing, and O_CLOEXEC
    // indicates that the file descriptor should not pass through the execve(2)
    // boundary.
    fd = open("/dev/urandom", O_RDONLY | O_CLOEXEC);
    // All valid file descriptors are positive integers, so a returned value < 0
    // indicates that an error occurred.
    if (fd < 0) {
        // perror(3) is a function to print out the last error that occurred.
        error("could not open /dev/urandom");
        // exit(3) causes a process to exit with a given exit code. Return 1
        // here to indicate that an error occurred.
        exit(1);
    }
}

What is /proc?
/proc (or proc(5)) is a pseudo-filesystem that provides access to a number of Linux kernel data structures.  Every process in Linux has a directory available for it called /proc/[pid].  This directory stores a bunch of information about the process, including the arguments it was given when the program started, the environment variables visible to it, and the open file descriptors.

The special files inside /proc/[pid]/fd describe the file descriptors that the process has open.  They look like symbolic links (symlinks), and you can see the original path of the file, but they aren’t exactly symlinks. You can pass them to open(2) even if the original path is inaccessible and get another working file descriptor.

Another file inside /proc/[pid] is called exe. This file is like the ones in /proc/[pid]/fd except that it points to the binary program that is executing inside that process.

/proc/[pid] also has a companion directory, /proc/self.  This directory is always the same as /proc/[pid] of the process that is accessing it. That is, you can always read your own /proc data from /proc/self without knowing your pid.

Dynamic linking

When writing programs, software developers typically use libraries—collections of previously written code intended to be reused.  Libraries can cover all sorts of things, from high-level concerns like machine learning to lower-level concerns like basic data structures or interfaces with the operating system.

In the code example above, you can see the use of a library through a call to a function defined in a library (fork).

Libraries are made available to programs through linking: a mechanism for resolving symbols (types, functions, variables, etc.) to their definition.  On Linux, programs can be statically linked, in which case all the linking is done at compile time and all symbols are fully resolved. Or they can be dynamically linked, in which case at least some symbols are unresolved until a runtime linker makes them available.

Dynamic linking makes it possible to replace some parts of the resulting code without recompiling the whole application. This is typically used for upgrading libraries to fix bugs, enhance performance, or to address security concerns.  In contrast, static linking requires re-compiling and re-linking each program that uses a given library to affect the same change.

On Linux, runtime linking is typically performed by ld-linux.so(8), which is provided by the GNU project toolchain.  Dynamically linked libraries are specified by a name embedded into the compiled binary.  This dynamic linker reads those names and then performs a search across a standard set of paths to find the associated library file (a shared object file, or .so).

The dynamic linker’s search path can be influenced by the LD_LIBRARY_PATH environment variable.  The LD_PRELOAD environment variable can tell the linker to load additional, user-specified libraries before all others. This is useful in debugging scenarios to allow selective overriding of symbols without having to rebuild a library entirely.

The vulnerability

Now that the cast of characters is set (fork(2), execve(2), open(2), proc(5), file descriptors, and linking), I can start talking about the vulnerability in runc.

runc is a container runtime.  Like a shell, its primary purpose is to launch other programs. However, it does so after manipulating Linux resources like cgroups, namespaces, mounts, seccomp, and capabilities to make what is referred to as a “container.”

The primary mechanism for setting up some of these resources, like namespaces, is through flags to the clone(2) syscall that take effect in the new process.  The target of the final execve(2) call is the program the user requested. It With a container, the target of the final execve(2) call can be specified in the container image or through explicit arguments.

The CVE announcement states:

The vulnerability allows a malicious container to […] overwrite the host runc binary […].  The level of user interaction is being able to run any command […] as root within a container [when creating] a new container using an attacker-controlled image.

The operative parts of this are: being able to overwrite the host runc binary (that seems bad) by running a command (that’s…what runc is supposed to do…).  Note too that the vulnerability is as simple as running a command and does not require running a container with elevated privileges or running in a non-default configuration.

Don’t containers protect against this?

Containers are, in many ways, intended to isolate the host from a given workload or to isolate a given workload from the host.  One of the main mechanisms for doing this is through a separate view of the filesystem.  With a separate view, the container shouldn’t be able to access the host’s files and should only be able to see its own.  runc accomplishes this using a mount namespace and mounting the container image’s root filesystem as /.  This effectively hides the host’s filesystem.

Even with techniques like this, things can pass through the mount namespace.  For example, the /proc/cmdline file contains the running Linux kernel’s command-line parameters.  One of those parameters typically indicates the host’s root filesystem, and a container with enough access (like CAP_SYS_ADMIN) can remount the host’s root filesystem within the container’s mount namespace.

That’s not what I’m talking about today, as that requires non-default privileges to run.  The interesting thing today is that the /proc filesystem exposes a path to the original program’s file, even if that file is not located in the current mount namespace.

What makes this troublesome is that interacting with Linux primitives like namespaces typically requires you to run as root, somewhere.  In most installations involving runc (including the default configuration in Docker, Kubernetes, containerd, and CRI-O), the whole setup runs as root.

runc must be able to perform a number of operations that require elevated privileges, even if your container is limited to a much smaller set of privileges. For example, namespace creation and mounting both require the elevated capability CAP_SYS_ADMIN, and configuring the network requires the elevated capability CAP_NET_ADMIN. You might see a pattern here.

An alternative to running as root is to leverage a user namespace. User namespaces map a set of UIDs and GIDs inside the namespace (including ones that appear to be root) to a different set of UIDs and GIDs outside the namespace.  Kernel operations that are user-namespace-aware can delineate privileged actions occurring inside the user namespace from those that occur outside.

However, user namespaces are not yet widely employed and are not enabled by default. The set of kernel operations that are user-namespace-aware is still growing, and not everyone runs the newest kernel or user-space software.

So, /proc exposes a path to the original program’s file, and the process that starts the container runs as root.  What if that original program is something important that you knew would run again… like runc?

Exploiting it!

runc’s job is to run commands that you specify.  What if you specified /proc/self/exe?  It would cause runc to spawn a copy of itself, but running inside the context of the container, with the container’s namespaces, root filesystem, and so on.  For example, you could run the following command:

docker run –rm amazonlinux:2 /proc/self/exe

This, by itself, doesn’t get you far—runc doesn’t hurt itself.

Generally, runc is dynamically linked against some libraries that provide implementations for seccomp(2), SELinux, or AppArmor.  If you remember from earlier, ld-linux.so(8) searches a standard set of file paths to provide these implementations at runtime.  If you start runc again inside the container’s context, with its separate filesystem, you have the opportunity to provide other files in place of the expected library files. These can run your own code instead of the standard library code.

There’s an easier way, though.  Instead of having to make something that looks like (for example) libseccomp, you can take advantage of a different feature of the dynamic linker: LD_PRELOAD.  And because runc lets you specify environment variables along with the path of the executable to run, you can specify this environment variable, too.

With LD_PRELOAD, you can specify your own libraries to load first, ahead of the other libraries that get loaded.  Because the original libraries still get loaded, you don’t actually have to have a full implementation of their interface.  Instead, you can selectively override some common functions that you might want and omit others that you don’t.

So now you can inject code through LD_PRELOAD and you have a target to inject it into: runc, by way of /proc/self/exe.  For your code to get run, something must call it.  You could search for a target function to override, but that means inspecting runc’s code to figure out what could get called and how.  Again, there’s an easier way.  Dynamic libraries can specify a “constructor” that is run immediately when the library is loaded.

Using the “constructor” along with LD_PRELOAD and specifying the command as /proc/self/exe, you now have a way to inject code and get it to run.  That’s it, right?  You can now write to /proc/self/exe and overwrite runc!

Not so fast.

The Linux kernel does have a bit of a protection mechanism to prevent you from overwriting the currently running executable.  If you open /proc/self/exe for writing, you get -ETXTBSY.  This error code indicates that the file is busy, where “TXT” refers to the text (code) section of the binary.

You know from earlier that execve(2) is a mechanism to replace the currently running executable with another, which means that the original executable isn’t in use anymore.  So instead of just having a single library that you load with LD_PRELOAD, you also must have another executable that can do the dirty work for you, which you can execve(2).

Normally, doing this would still be unsuccessful due to file permissions. Executables are typically not world-writable.  But because runc runs as root and does not change users, the new runc process that you started through /proc/self/exe and the helper program that you executed are also run as root.

After you gain write access to the runc file descriptor and you’ve replaced the currently executing program with execve(2), you can replace runc’s content with your own.  The other software on the system continues to start runc as part of its normal operation (for example, creating new containers, stopping containers, or performing exec operations inside containers). Your code has the chance to operate instead of runc.  When it gets run this way, your code runs as root, in the host’s context instead of in the container’s context.

Now you’re done!  You’ve successfully escaped the container and have full root access.

Putting that all together, you get something like the following pseudocode:

Preload pseudocode for preload.so

func constructor(){
    // /proc/self/exe is a virtual file pointing to the currently running
    // executable.  Open it here so that it can be passed to the next
    // process invoked.  It must be opened read-only, or the kernel will fail
    // the open syscall with ETXTBSY.  You cannot gain write access to the text
    // portion of a running executable.
    fd = open("/proc/self/exe", O_RDONLY);
    if (fd < 0) {
        error("could not open /proc/self/exe");
        exit(1);
    }
    // /proc/self/fd/%d is a virtual file representing the open file descriptor
    // to /proc/self/exe, which you opened earlier.
    filename = sprintf("/proc/self/fd/%d", fd);
    // execl is a call that executes a new executable, replacing the
    // currently running process and preserving aspects like the process ID.
    // Execute the "rewrite" binary, passing it arguments representing the
    // path of the open file descriptor. Because you did not pass O_CLOEXEC when
    // opening the file, the file descriptor remains open in the replacement
    // program and retains the same descriptor.
    execl("/rewrite", "/rewrite", filename, NULL);
    // execl never returns, except on an error
    error("couldn't execl");
}

Pseudocode for the rewrite program

// rewrite is your cooperating malicious program that takes an argument
// representing a file descriptor path, reopens it as read-write, and
// replaces the contents.  rewrite expects that it is unable to open
// the file on the first try, as the kernel has not closed it yet
func main(argc, *argv[]) {
    fd = 0;
    printf("Running\n");
    for(tries = 0; tries < 10000; tries++) {
        // argv[1] is the argument that contains the path to the virtual file
        // of the read-only file descriptor
        fd = open(argv[1], O_RDWR|O_TRUNC);
        if( fd >= 0 ) {
            printf("open succeeded\n");
            break;
        } else {
            if(errno != ETXTBSY) {
                // You expect a lot of ETXTBSY, so only print when you get something else
                error("open");
            }
        }
    }
    if (fd < 0) {
        error("exhausted all open attempts");
        exit(1);
    }
    dprintf(fd, "CVE-2019-5736\n");
    printf("wrote over runc!\n");
    fflush(stdout);

The above code was written by Noah Meyerhans, iliana weller, and Samuel Karp.

How does the patch work?

If you try the same approach with a patched runc, you instead see that opening the file with O_RDWR is denied.  This means that the patch is working!

The runc patch operates by taking advantage of some Linux kernel features introduced in kernel 3.17, specifically a syscall called memfd_create(2).  This syscall creates a temporary memory-backed file and a file descriptor that can be used to access the file.  This file descriptor has some special semantics:  It is automatically removed when the last reference to it is dropped. It’s in memory, so that just equates to freeing the memory.  It supports another useful feature: file sealing.  File sealing allows the file to be made immutable, even to processes that are running as root.

The runc patch changes the behavior of runc so that it creates a copy of itself in one of these temporary file descriptors, and then seals it.  The next time a process launches (via fork(2)) or a process is replaced (via execve(2)), /proc/self/exe will be this sealed, memory-backed file descriptor.  When your rewrite program attempts to modify it, the Linux kernel prevents it as it’s a sealed file.

Could I have avoided being vulnerable?

Yes, a few different mechanisms were available before the patch that provided mitigation for this vulnerability.  The one that I mentioned earlier is user namespaces. Mapping to a different user namespace inside the container would mean that normal Linux file permissions would effectively prevent runc from becoming writable because the compromised process inside the container is not running as the real root user.

Another mechanism, which is used by Google Container-Optimized OS, is to have the host’s root filesystem mounted as read-only.  A read-only mount of the runc binary itself would also prevent the runc binary from becoming writable.

SELinux, when correctly configured, may also prevent this vulnerability.

A different approach to preventing this vulnerability is to treat the Linux kernel as belonging to a single tenant, and spend your effort securing the kernel through another layer of separation.  This is typically accomplished using a hypervisor or a virtual machine.

Amazon invests heavily in this type of boundary. Amazon EC2 instances, AWS Lambda functions, and AWS Fargate tasks are secured from each other using techniques like these.  Amazon EC2 bare-metal instances leverage the next-generation Nitro platform that allows AWS to offer secure, bare-metal compute with a hardware root-of-trust.  Along with traditional hypervisors, the Firecracker virtual machine manager can be used to implement this technique with function- and container-like workloads.

Further reading

The original researchers who discovered this vulnerability have published their own post, CVE-2019-5736: Escape from Docker and Kubernetes containers to root on host. They describe how they discovered the vulnerability, as well as several other attempts.

Acknowledgements

I’d like to thank the original researchers who discovered the vulnerability for reporting responsibly and the OCI maintainers (and Aleksa Sarai specifically) who coordinated the disclosure. Thanks to Linux distribution maintainers and cloud providers who made updated packages available quickly.  I’d also like to thank the Amazonians who made it possible for AWS customers to be protected:

  • AWS Security who ran the whole process
  • Clare Liguori, Onur Filiz, Noah Meyerhans, iliana weller, and Tom Kirchner who performed analysis and validation of the patch
  • The Amazon Linux team (and iliana weller specifically) who backported the patch and built new Docker RPMs
  • The Amazon ECS, Amazon EKS, and AWS Fargate teams for making patched infrastructure available quickly
  • And all of the other AWS teams who put in extra effort to protect AWS customers

A Guide to Locally Testing Containers with Amazon ECS Local Endpoints and Docker Compose

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/a-guide-to-locally-testing-containers-with-amazon-ecs-local-endpoints-and-docker-compose/

This post is contributed by Wesley Pettit, Software Engineer at AWS.

As more companies adopt containers, developers need easy, powerful ways to test their containerized applications locally, before they deploy to AWS. Today, the containers team is releasing the first tool dedicated to this: Amazon ECS Local Container Endpoints. This is part of an ongoing open source project designed to improve the local development process for Amazon Elastic Container Service (ECS) and AWS Fargate.  This first step allows you to locally simulate the ECS Task Metadata V2 and V3 endpoints and IAM Roles for Tasks.

In this post, I will walk you through the following testing scenarios enabled by Amazon ECS Local Endpoints and Docker Compose:

  •  Testing a container that needs credentials to interact with AWS Services
  • Testing a container which uses Task Metadata
  • Testing a multi-container app which uses the awsvpc or host network mode on Docker For Mac and Docker For Windows.
  • Testing multiple containerized applications using local service discovery

Setup

Your local testing toolkit consists of Docker, Docker Compose, and awslabs/amazon-ecs-local-container-endpoints.  To follow along with the scenarios in this post, you will need to have locally installed the Docker Daemon, the Docker Command Line, and Docker Compose.

Once you have the dependencies installed, create a Docker Compose file called docker-compose.yml. The Compose file defines the settings needed to run your application. If you have never used Docker Compose before, check out Docker’s Getting Started with Compose tutorial. This example file defines a web application:

version: "2"
services:
  app:
    build:
      # Build an image from the Dockerfile in the current directory
      context: .
    ports:
      - 8080:80
    environment:
      PORT: "80"

Make sure to save your docker-compose.yml file: it will be needed for the rest of the scenarios.

Our First Scenario: Testing a container which needs credentials to interact with AWS Services

Say I have a container which I want to test locally that needs AWS credentials. I could accomplish this by providing credentials as environment variables on the container, but that would be a bad practice. Instead, I can use Amazon ECS Local Endpoints to safely vend credentials to a local container.

The following Docker Compose override file template defines a single container that will use credentials. This should be used along with the docker-compose.yml file you created in the setup section. Name this file docker-compose.override.yml, (Docker Compose will know to automatically use both of the files).

Your docker-compose.override.yml file should look like this:

version: "2"
networks:
    # This special network is configured so that the local metadata
    # service can bind to the specific IP address that ECS uses
    # in production
    credentials_network:
        driver: bridge
        ipam:
            config:
                - subnet: "169.254.170.0/24"
                  gateway: 169.254.170.1
services:
    # This container vends credentials to your containers
    ecs-local-endpoints:
        # The Amazon ECS Local Container Endpoints Docker Image
        image: amazon/amazon-ecs-local-container-endpoints
        volumes:
          # Mount /var/run so we can access docker.sock and talk to Docker
          - /var/run:/var/run
          # Mount the shared configuration directory, used by the AWS CLI and AWS SDKs
          # On Windows, this directory can be found at "%UserProfile%\.aws"
          - $HOME/.aws/:/home/.aws/
        environment:
          # define the home folder; credentials will be read from $HOME/.aws
          HOME: "/home"
          # You can change which AWS CLI Profile is used
          AWS_PROFILE: "default"
        networks:
            credentials_network:
                # This special IP address is recognized by the AWS SDKs and AWS CLI 
                ipv4_address: "169.254.170.2"
                
    # Here we reference the application container that we are testing
    # You can test multiple containers at a time, simply duplicate this section
    # and customize it for each container, and give it a unique IP in 'credentials_network'.
    app:
        depends_on:
            - ecs-local-endpoints
        networks:
            credentials_network:
                ipv4_address: "169.254.170.3"
        environment:
          AWS_DEFAULT_REGION: "us-east-1"
          AWS_CONTAINER_CREDENTIALS_RELATIVE_URI: "/creds"

To test your container locally, run:

docker-compose up

Your container will now be running and will be using temporary credentials obtained from your default AWS Command Line Interface Profile.

NOTE: You should not use your production credentials locally. If you provide the ecs-local-endpoints with an AWS Profile that has access to your production account, then your application will be able to access/modify production resources from your local testing environment. We recommend creating separate development and production accounts.

How does this work?

In this example, we have created a User Defined Docker Bridge Network which allows the Local Container Endpoints to listen at the IP Address 169.254.170.2. We have also defined the environment variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI on our application container. The AWS SDKs and AWS CLI are all designed to retrieve credentials by making HTTP requests to  169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI. When containers run in production on ECS, the ECS Agent vends credentials to containers via this endpoint; this is how IAM Roles for Tasks is implemented.

Amazon ECS Local Container Endpoints vends credentials to containers the same way as the ECS Agent does in production. In this case, it vends temporary credentials obtained from your default AWS CLI Profile. It can do that because it mounts your .aws folder, which contains credentials and configuration for the AWS CLI.

Gotchas: Things to Keep in Mind when using ECS Local Container Endpoints and Docker Compose

  • Make sure every container in the credentials_network has a unique IP Address. If you don’t do this, Docker Compose can incorrectly try to assign 169.254.170.2 (the ecs-local-endpoints container IP) to one of the application containers. This will cause your Compose project to fail to start.
  • On Windows, replace $HOME/.aws/ in the volumes declaration for the endpoints container with the correct location of the AWS CLI configuration directory, as explained in the documentation.
  • Notice that the application container is named ‘app’ in both of the example file templates. You must make sure the container names match between your docker-compose.yml and docker-compose.override.yml. When you run docker-compose up, the files will be merged. The settings in each file for each container will be merged, so it’s important to use consistent container names between the two files.

Scenario Two: Testing using Task IAM Role credentials

The endpoints container image can also vend credentials from an IAM Role; this allows you to test your application locally using a Task IAM Role.

NOTE: You should not use your production Task IAM Role locally. Instead, create a separate testing role, with equivalent permissions scoped to testing resources. Modifying the trust boundary of a production role will expand its scope.

In order to use a Task IAM Role locally, you must modify its trust policy. First, get the ARN of the IAM user defined by your default AWS CLI Profile (replace default with a different Profile name if needed):

aws --profile default sts get-caller-identity

Then modify your Task IAM Role so that its trust policy includes the following statement. You can find instructions for modifying IAM Roles in the IAM Documentation.

    {
      "Effect": "Allow",
      "Principal": {
        "AWS": <ARN of the user found with get-caller-identity>
      },
      "Action": "sts:AssumeRole"
    }

To use your Task IAM Role in your docker compose file for local testing, simply change the value of the AWS container credentials relative URI environment variable on your application container:

AWS_CONTAINER_CREDENTIALS_RELATIVE_URI: "/role/<name of your role>"

For example, if your role is named ecs_task_role, then the environment variable should be set to "/role/ecs_task_role". That is all that is required; the ecs-local-endpoints container will now vend credentials obtained from assuming the task role. You can use this to validate that the permissions set on your Task IAM Role are sufficient to run your application.

Scenario Three: Testing a Container that uses Task Metadata endpoints

The Task Metadata endpoints are useful; they allow a container running on ECS to obtain information about itself at runtime. This enables many use cases; my favorite is that it allows you to obtain container resource usage metrics, as shown by this project.

With Amazon ECS Local Container Endpoints, you can locally test applications that use the Task Metadata V2 or V3 endpoints. If you want to use the V2 endpoint, the Docker Compose template shown at the beginning of this post is sufficient. If you want to use V3, simply add another environment variable to each of your application containers:

ECS_CONTAINER_METADATA_URI: "http://169.254.170.2/v3"

This is the environment variable defined by the V3 metadata spec.

Scenario Four: Testing an Application that uses the AWSVPC network mode

Thus far, all or our examples have involved testing containers in a bridge network. But what if you have an application that uses the awsvpc network mode. Can you test these applications locally?

Your local development machine will not have Elastic Network Interfaces. If your ECS Task consists of a single container, then the bridge network used in previous examples will suffice. However, if your application consists of multiple containers that need to communicate, then awsvpc differs significantly from bridge. As noted in the AWS Documentation:

“containers that belong to the same task can communicate over the localhost interface.”

This is one of the benefits of awsvpc; it makes inter-container communication easy. To simulate this locally, a different approach is needed.

If your local development machine is running linux, then you are in luck. You can test your containers using the host network mode, which will allow them to all communicate over localhost. Instructions for how to set up iptables rules to allow your containers to receive credentials and metadata is documented in the ECS Local Container Endpoints Project README.

If you are like me, and do most of your development on Windows or Mac machines, then this option will not work. Docker only supports host mode on Linux. Luckily, this section describes a workaround that will allow you to locally simulate awsvpc on Docker For Mac or Docker For Windows. This also partly serves as a simulation of the host network mode, in the sense that all of your containers will be able to communicate over localhost (from a local testing standpoint, host and awsvpc are functionally the same, the key requirement is that all containers share a single network interface).

In ECS, awsvpc is implemented by first launching a single container, which we call the ‘pause container‘. This container is attached to the Elastic Network Interface, and then all of the containers in your task are launched into the pause container’s network namespace. For the local simulation of awsvpc, a similar approach will be used.

First, create a Dockerfile with the following contents for the ‘local’ pause container.

FROM amazonlinux:latest
RUN yum install -y iptables

CMD iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679 \
 && iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679 \
 && iptables-save \
 && /bin/bash -c 'while true; do sleep 30; done;'

This Dockerfile defines a container image which sets some iptables rules and then sleeps forever. The routing rules will allow requests to the credentials and metadata service to be forwarded from 169.254.170.2:80 to localhost:51679, which is the port ECS Local Container Endpoints will listen at in this setup.

Build the image:

docker build -t local-pause:latest .

Now, edit your docker-compose.override.yml file so that it looks like the following:

version: "2"
services:
    ecs-local-endpoints:
        image: amazon/amazon-ecs-local-container-endpoints
        volumes:
          - /var/run:/var/run
          - $HOME/.aws/:/home/.aws/
        environment:
          ECS_LOCAL_METADATA_PORT: "51679"
          HOME: "/home"
        network_mode: container:local-pause

    app:
        depends_on:
            - ecs-local-endpoints
        network_mode: container:local-pause
        environment:
          ECS_CONTAINER_METADATA_URI: "http://169.254.170.2/v3/containers/app"
          AWS_CONTAINER_CREDENTIALS_RELATIVE_URI: "/creds"

Several important things to note:

  • ECS_LOCAL_METADATA_PORT is set to 51679; this is the port that was used in the iptables rules.
  • network_mode is set to container:local-pause for all the containers, which means that they will use the networking stack of a container named local-pause.
  • ECS_CONTAINER_METADATA_URI is set to http://169.254.170.2/ v3/containers/app. This is important. In bridge mode, the local endpoints container can determine which container a request came from using the IP Address in the request. In simulated awsvpc, this will not work, since all of the containers share the same IP Address. Thus, the endpoints container supports using the container name in the request URI so that it can identify which container the request came from. In this case, the container is named app, so app is appended to the value of the environment variable. If you copy the app container configuration to add more containers to this compose file, make sure you update the value of ECS_CONTAINER_METADATA_URI for each new container.
  • Remove any port declarations from your docker-compose.yml file. These are not valid with the network_mode settings that you will be using. The text below explains how to expose ports in this simulated awsvpc network mode.

Before you run the compose file, you must launch the local-pause container. This container can not be defined in the Docker Compose file, because in Compose there is no way to define that one container must be running before all the others. You might think that the depends_on setting would work, but this setting only determines the order in which containers are started. It is not a robust solution for this case.

One key thing to note; any ports used by your application containers must be defined on the local-pause container. You can not define ports directly on your application containers because their network mode is set to container:local-pause. This is a limitation imposed by Docker.

Assuming that your application containers need to expose ports 8080 and 3306 (replace these with the actual ports used by your applications), run the local pause container with this command:

docker run -d -p 8080:8080 -p 3306:3306 --name local-pause --cap-add=NET_ADMIN local-pause

Then, simply run the docker compose files, and you will have containers which share a single network interface and have access to credentials and metadata!

Scenario Five: Testing multiple applications with local Service Discovery

Thus far, all of the examples have focused on running a single containerized application locally. But what if you want to test multiple applications which run as separate Tasks in production?

Docker Compose allows you to set up DNS aliases for your containers. This allows them to talk to each other using a hostname.

For this example, return to the compose override file with a bridge network shown in scenarios one through three. Here is a docker-compose.override.yml file which implements a simple scenario. There are two applications, frontend and backend. Frontend needs to make requests to backend.

version: "2"
networks:
    credentials_network:
        driver: bridge
        ipam:
            config:
                - subnet: "169.254.170.0/24"
                  gateway: 169.254.170.1
services:
    # This container vends credentials to your containers
    ecs-local-endpoints:
        # The Amazon ECS Local Container Endpoints Docker Image
        image: amazon/amazon-ecs-local-container-endpoints
        volumes:
          - /var/run:/var/run
          - $HOME/.aws/:/home/.aws/
        environment:
          HOME: "/home"
          AWS_PROFILE: "default"
        networks:
            credentials_network:
                ipv4_address: "169.254.170.2"
                aliases:
                    - endpoints # settings for the containers which you are testing
    frontend:
        image: amazonlinux:latest
        command: /bin/bash -c 'while true; do sleep 30; done;'
        depends_on:
            - ecs-local-endpoints
        networks:
            credentials_network:
                ipv4_address: "169.254.170.3"
        environment:
          AWS_DEFAULT_REGION: "us-east-1"
          AWS_CONTAINER_CREDENTIALS_RELATIVE_URI: "/creds"
    backend:
        image: nginx
        networks:
            credentials_network:
                # define an alias for service discovery
                aliases:
                    - backend
                ipv4_address: "169.254.170.4"

With these settings, the frontend container can find the backend container by making requests to http://backend.

Conclusion

In this tutorial, you have seen how to use Docker Compose and awslabs/amazon-ecs-local-container-endpoints to test your Amazon ECS and AWS Fargate applications locally before you deploy.

You have learned how to:

  • Construct docker-compose.yml and docker-compose.override.yml files.
  • Test a container locally with temporary credentials from a local AWS CLI Profile.
  • Test a container locally using credentials from an ECS Task IAM Role.
  • Test a container locally that uses the Task Metadata Endpoints.
  • Locally simulate the awsvpc network mode.
  • Use Docker Compose service discovery to locally test multiple dependent applications.

To follow along with new developments to the local development project, you can head to the public AWS Containers Roadmap on GitHub. If you have questions, comments, or feedback, you can let the team know there!

 

Automatically update instances in an Amazon ECS cluster using the AMI ID parameter

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/automatically-update-instances-in-an-amazon-ecs-cluster-using-the-ami-id-parameter/

This post is contributed by Adam McLean – Solutions Developer at AWS and Chirill Cucereavii – Application Architect at AWS 

In this post, we show you how to automatically refresh the container instances in an active Amazon Elastic Container Service (ECS) cluster with instances built from a newly released AMI.

The Amazon ECS-optimized AMI  comes prepackaged with the ECS container agent, Docker agent, and the ecs-init upstart service. We recommend that you use the Amazon ECS-optimized AMI for your container instances unless your application requires any of the following:

  • A specific operating system
  • Custom security and monitoring agents installed
  • Root volumes encryption enabled
  • A Docker version that is not yet available in the Amazon ECS-optimized AMI

Regardless of the type of AMI that you choose, AWS recommends updating your ECS containers instance fleet with the latest AMI whenever possible. It’s easier than trying to patch existing instances in place.

Solution overview

In this solution, you deploy the ECS cluster and, specify cluster size, instance type, AMI ID, and other parameters. After the ECS cluster has been created and instances registered, you can update the ECS cluster with another AMI ID to trigger the following events:

  1. A new launch configuration is created using the new AMI ID.
  2. The Auto Scaling group adds one new instance using the new launch configuration. This executes the ‘Adding Instances’ process described below.
  3. The adding instances process finishes for the single new node with the new AMI. Then, the removing instances process is started against the oldest instance with the old AMI ID.
  4. After the removing nodes process is finished, steps 2 and 3 are repeated until all nodes in the cluster have been replaced.
  5. If an error is encountered during the rollout, the new launch configuration is deleted, and the old one is put back in place.

Scaling a cluster out (adding instances)

Take a closer look at each step in scaling out a cluster:

  1. A stack update changes the AMI ID parameter.
  2. CloudFormation updates the launch configuration and tells the Auto Scaling group to add an instance.
  3. Auto Scaling launches an instance using new AMI ID to join the ECS cluster.
  4. Auto Scaling invokes the Launch Lambda function.
  5. Lambda asks the ECS cluster if the newly launched instance has joined and is showing healthy.
  6. Lambda tells Auto Scaling whether the launch succeeded or failed.
  7. Auto Scaling tells CloudFormation whether the scale-up has succeeded.
  8. The stack update succeeds, or rolls back.

Scaling a cluster in (removing instances)

Take a closer look at each step in scaling in a cluster:

  1. CloudFormation tells the Auto Scaling group to remove an instance.
  2. Auto Scaling chooses an instance to be terminated.
  3. Auto Scaling invokes the Terminate Lambda function.
  4. The Lambda function performs the following tasks:
    1. Sets the instance to be terminated to DRAINING mode.
    2. Confirms that all ECS tasks are drained from the instance marked for termination.
    3. Confirms that the ECS cluster services and tasks are stable.
  5. Lambda tells Auto Scaling to proceed with termination.
  6. Auto Scaling tells CloudFormation whether the scale-in has succeeded.
  7. The stack update succeeds, or rolls back.

Solution technologies

Here are the technologies used for this solution, with more details.

  • AWS CloudFormation
  • AWS Auto Scaling
  • Amazon CloudWatch Events
  • AWS Systems Manager Parameter Store
  • AWS Lambda

AWS CloudFormation

AWS CloudFormation is used to deploy the stack, and should be used for lifecycle management. Do not directly edit Auto Scaling groups, Lambda functions, and so on. Instead, update the CloudFormation template.

This forces the resolution of the latest AMI, as well as providing an opportunity to change the size or instance type involved in the ECS cluster.

CloudFormation has rollback capabilities to return to the last known good state if errors are encountered. It is the recommended mechanism for management through the clusters lifecycle.

AWS Auto Scaling

For ECS, the primary scaling and rollout mechanism is AWS Auto Scaling. Auto Scaling allows you to define a desired state environment, and keep that desired state as necessary by launching and terminating instances.

When a new AMI has been selected, CloudFormation informs Auto Scaling that it should replace the existing fleet of instances. This is controlled by an Auto Scaling update policy.

This solution rolls a single instance out to the ECS cluster, then drain, and terminate a single instance in response. This cycle continues until all instances in the ECS cluster have been replaced.

Auto Scaling lifecycle hooks

Auto Scaling permits the use of a lifecycle hooks. This is code that executes when a scaling operation occurs. This solution uses a Lambda function that is informed when an instance is launched or terminated.

A lifecycle hook informs Auto Scaling whether it can proceed with the activity or if it should abandon it. In this case, the ECS cluster remains healthy and all tasks have been redistributed before allowing Auto Scaling to proceed.

Lifecycles also have a timeout. In this case, it is 3600 seconds (1 hour) before Auto Scaling gives up. In that case, the default activity is to abandon the operation.

Amazon CloudWatch Events

CloudWatch Events is a mechanism for watching calls made to the AWS APIs, and then activating functions in response. This is the mechanism used to launch the Lambda functions when a lifecycle event occurs. It’s also the mechanism used to re-launch the Lambda function when it times out (Lambda maximum execution time is 15 minutes).

In this solution, four CloudWatch Events are created. Two to pick up the initial scale-up event. Two more to pick up a continuation from the Lambda function.

AWS Systems Manager Parameter Store

AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management.

This solution relies on the AMI IDs stored in Parameter Store. Given a naming standard of /ami/ecs/latest, this always resolves to the latest available AMI for ECS.

CloudFormation now supports using the values stored in Parameter Store as inputs to CloudFormation templates. The template can be simply passed a value—/ami/ecs/latest—and it resolve that to the latest AMI.

AWS Lambda

The Lambda functions are used to handle the Auto Scaling lifecycle hooks. They operate against the ECS cluster to assure it is healthy, and inform Auto Scaling that it can proceed, or to abandon its current operation.

The functions are invoked by CloudWatch Events in response to scaling operations so they are idle unless there are changes happening in the cluster.

They’re written in Python, and use the boto3 SDK to communicate with the ECS cluster, and Auto Scaling.

The launch Lambda function waits until the instance has fully joined the ECS cluster. This is shown by the instance being marked ‘ACTIVE’ by the ECS control plane, and it’s ECS agent status showing as connected. This means that the new instance is ready to run tasks for the cluster.

The terminate Lambda function waits until the instance has fully drained all running tasks. It also checks that all tasks, and services are in a stable state before allowing Auto Scaling to terminate an instance. This assures the instance is truly idle, and the cluster stable before an instance can be removed.

Deployment

Before you begin deployment, you need the following:

  • An AWS account 

You need an AWS account with enough room to accommodate the additional EC2 instances required by the ECS cluster.

  • (Optional) Linux system 

Use AWS CLI and optionally JQ to deploy the solution. Although a Linux system is recommended, it’s not required.

  • IAM user

You need an IAM admin user with permissions to create IAM policies and roles and create and update CloudFormation stacks. The user must also be able to deploy the ECS cluster, Lambda functions, Systems Manager parameters, and other resources.

  • Download the code

Clone or download the project from https://github.com/awslabs/ecs-cluster-manager on GitHub:

git clone [email protected]:awslabs/ecs-cluster-manager.git

AMI ID parameter

Create an Systems Manager parameter where the desired AMI ID is stored.

The first run does not use the latest ECS optimized AMI. Later, you update the ECS cluster to the latest AMI.

Use the AMI released on 2017.09. Run the following commands to create /ami/ecs/latest parameter in Parameter Store with a corresponding AMI value.

AMI_ID=$(aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux/amzn-ami-2017.09.l-amazon-ecs-optimized --region us-east-1 --query "Parameters[].Value" --output text | jq -r .image_id)
aws ssm put-parameter \
  --overwrite \
  --name "/ami/ecs/latest" \
  --type "String" \
  --value $AMI_ID \
  --region us-east-1 \
  --profile devAdmin

Substitute us-east-1 with your desired Region.

In the AWS Management Console, choose AWS Systems Manager, Parameter Store.

You should see the /ami/ecs/latest parameter that you just created.

Select the /ami/ecs/latest parameter and make sure that the AMI ID is present in parameter value. If you are using the us-east-1 Region, you should see the following value:

ami-aff65ad2

Upload the Lambda function code to Amazon S3

The Lambda functions are too large to embed in the CloudFormation template. Therefore, they must be loaded into an S3 bucket before CloudFormation stack is created.

Assuming you’re using an S3 bucket called ecs-deployment,  copy each Lambda function zip file as follows:

cd ./ecs-cluster-manager
aws s3 cp lambda/ecs-lifecycle-hook-launch.zip s3://ecs-deployment
aws s3 cp lambda/ecs-lifecycle-hook-terminate.zip s3://ecs-deployment

Refer to these when running your CloudFormation template later so that CloudFormation knows where to find the Lambda files.

Lambda function role

The Lambda functions that execute require read permissions to EC2, write permissions to ECS, and permissions to submit a result or heartbeat to Auto Scaling.

Create a new LambdaECSScaling IAM policy in your AWS account. Use the following JSON as the policy body:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:CompleteLifecycleAction",
                "autoscaling:DescribeScalingActivities",
                "autoscaling:RecordLifecycleActionHeartbeat",
                "ecs:UpdateContainerInstancesState",
                "ecs:Describe*",
                "ecs:List*"
            ],
            "Resource": "*"
        }
    ]
}

 

Now, create a new LambdaECSScalingRole IAM role. For Trusted Entity, choose AWS Service, Lambda. Attach the following permissions policies:

  • LambdaECSScaling (created in the previous step)
  • ReadOnlyAccess (AWS managed policy)
  • AWSLambdaBasicExecutionRole (AWS managed policy)

ECS cluster instance profile

The ECS cluster nodes must have an instance profile attached that allows them to speak to the ECS service. This profile can also contain any other permissions that they would require (Systems Manager for management and executing commands for example).

These are all AWS managed policies so you only add the role.

Create a new IAM role called EcsInstanceRole, select AWS Service → EC2 as Trusted Entity. Attach the following AWS managed permissions policies:

  • AmazonEC2RoleforSSM
  • AmazonEC2ContainerServiceforEC2Role
  • AWSLambdaBasicExecutionRole

The AWSLambdaBasicExecutionRole policy may look out of place, but this allows the instance to create new CloudWatch Logs groups. These permissions facilitate using CloudWatch Logs as the primary logging mechanism with ECS. This managed policy grants the required permissions without you needing to manage a custom role.

CloudFormation parameter file

We recommend using a parameter file for the CloudFormation template. This documents the desired parameters for the template. It is usually less error prone to do this versus using the console for inputting parameters.

There is a file called blank_parameter_file.json in the source code project. Copy this file to something new and with a more meaningful name (such as dev-cluster.json), then fill out the parameters.

The file looks like this:

[
  {
    "ParameterKey": "EcsClusterName",
    "ParameterValue": ""
  }, 
  {
    "ParameterKey": "EcsAmiParameterKey",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "IamRoleInstanceProfile",
    "ParameterValue": ""
  }, 
  {
    "ParameterKey": "EcsInstanceType",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "EbsVolumeSize",
    "ParameterValue": ""
  }, 
  {
    "ParameterKey": "ClusterSize",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "ClusterMaxSize",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "KeyName",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "SubnetIds",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "SecurityGroupIds",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "DeploymentS3Bucket",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "LifecycleLaunchFunctionZip",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "LifecycleTerminateFunctionZip",
    "ParameterValue": ""
  },
  {
    "ParameterKey": "LambdaFunctionRole",
    "ParameterValue": ""
  }
]

Here are the details for each parameter:

  • EcsClusterName:  The name of the ECS cluster to create.
  • EcsAmiParameterKey:  The Systems Manager parameter that contains the AMI ID to be used. This defaults to /ami/ecs/latest.
  • IamRoleInstanceProfile:  The name of the EC2 instance profile used by the ECS cluster members. Discussed in the prerequisite section.
  • EcsInstanceType:  The instance type to use for the cluster. Use whatever is appropriate for your workloads.
  • EbsVolumeSize:  The size of the Docker storage setup that is created using LVM. ECS typically defaults to 100 GB.
  • ClusterSize:  The desired number of EC2 instances for the cluster.
  • ClusterMaxSize:  This value should always be double the amount contained in ClusterSize. CloudFormation has no ‘math’ operators or we wouldn’t prompt for this. This allows rolling updates to be performed safely by doubling the cluster size, then contracting back.
  • KeyName:  The name of the EC2 key pair to place on the ECS instance to support SSH.
  • SubnetIds: A comma-separated list of subnet IDs that the cluster should be allowed to launch instances into. These should map to at least two zones for a resilient cluster, for example subnet-a70508df,subnet-e009eb89.
  • SecurityGroupIds:  A comma-separated list of security group IDs that are attached to each node, for example sg-bd9d1bd4,sg-ac9127dca (a single value is fine).
  • DeploymentS3Bucket: This is the bucket where the two Lambda functions for scale in/scale out lifecycle hooks can be found.
  • LifecycleLaunchFunctionZip: This is the full path within the DeploymentS3Bucket where the ecs-lifecycle-hook-launch.zip contents can be found.
  • LifecycleTerminateFunctionZip:  The full path within the DeploymentS3Bucket where the ecs-lifecycle-hook-terminate.zip contents can be found.
  • LambdaFunctionRole:  The name of the role that the Lambda functions use. Discussed in the prerequisite section.

A completed parameter file looks like the following:

[
  {
    "ParameterKey": "EcsClusterName",
    "ParameterValue": "DevCluster"
  }, 
  {
    "ParameterKey": "EcsAmiParameterKey",
    "ParameterValue": "/ami/ecs/latest"
  },
  {
    "ParameterKey": "IamRoleInstanceProfile",
    "ParameterValue": "EcsInstanceRole"
  }, 
  {
    "ParameterKey": "EcsInstanceType",
    "ParameterValue": "m4.large"
  },
  {
    "ParameterKey": "EbsVolumeSize",
    "ParameterValue": "100"
  }, 
  {
    "ParameterKey": "ClusterSize",
    "ParameterValue": "2"
  },
  {
    "ParameterKey": "ClusterMaxSize",
    "ParameterValue": "4"
  },
  {
    "ParameterKey": "KeyName",
    "ParameterValue": "dev-cluster"
  },
  {
    "ParameterKey": "SubnetIds",
    "ParameterValue": "subnet-a70508df,subnet-e009eb89"
  },
  {
    "ParameterKey": "SecurityGroupIds",
    "ParameterValue": "sg-bd9d1bd4"
  },
  {
    "ParameterKey": "DeploymentS3Bucket",
    "ParameterValue": "ecs-deployment"
  },
  {
    "ParameterKey": "LifecycleLaunchFunctionZip",
    "ParameterValue": "ecs-lifecycle-hook-launch.zip"
  },
  {
    "ParameterKey": "LifecycleTerminateFunctionZip",
    "ParameterValue": "ecs-lifecycle-hook-terminate.zip"
  },
  {
    "ParameterKey": "LambdaFunctionRole",
    "ParameterValue": "LambdaECSScalingRole"
  }
]

Deployment

Given the CloudFormation template and the parameter file, you can deploy the stack using the AWS CLI or the console.

Here’s an example deploying through the AWS CLI. This example uses a stack named ecs-dev and a parameter file named dev-cluster.json. It also uses the --profile argument to assure that the CLI assumes a role in the right account for deployment. Use the corresponding Region and profile from your local ~/.aws/config file.

This command outputs the stack ID as soon as it is executed, even though the other required resources are still being created.

aws cloudformation create-stack \
  --stack-name ecs-dev \
  --template-body file://./ecs-cluster.yaml \
  --parameters file://./dev-cluster.json \
  --region us-east-1 \
  --profile devAdmin

Use the AWS Management Console to check whether the stack is done creating. Or, run the following command:

aws cloudformation wait stack-create-complete \
  --stack-name ecs-dev \
  --region us-east-1 \
  --profile devAdmin

Use the AWS Management Console to check whether the stack is done creating. Or, run the following command:

aws cloudformation wait stack-create-complete \
  --stack-name ecs-dev \
  --region us-east-1 \
  --profile devAdmin

After the CloudFormation stack has been created, go to the ECS console. and open the DevCluster cluster that you just created. There are no tasks running, although you should see two container instances registered with the cluster.

You also see a warning message indicating that the container instances are not running the latest version of Amazon ECS container agent. The reason is that you did not use the latest available version of the ECS-Optimized AMI.

Fix this issue by updating the container instances AMI.

Update the cluster instances AMI

Run the following commands to set the /ami/ecs/latest parameter to the latest AMI ID.

AMI_ID=$(aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux/recommended --region us-east-1 --query "Parameters[].Value" --output text | jq -r .image_id)

 aws ssm put-parameter \
   --overwrite \
   --name "/ami/ecs/latest" \
   --type "String" \
   --value $AMI_ID
   --region us-east-1 \
   --profile devAdmin

Make sure that the parameter value has been updated in the console.

To update your ECS cluster, run the update-stack command without changing any parameters. CloudFormation evaluates the value stored by /ami/ecs/latest. If it has changed, CloudFormation makes updates as appropriate.

aws cloudformation update-stack \
  --stack-name ecs-dev \
  --template-body file://./ecs-cluster.yaml \
  --parameters file://./dev-cluster.json \
  --region us-east-1 \
  --profile devAdmin

Supervising updates

We recommend supervising your updates to the ECS cluster while they are being deployed. This assures that the cluster remains stable. For the majority of situations, there is no manual intervention required.

  • Keep an eye on Auto Scaling activities. In the Auto Scaling groups section of the EC2 console, select the Auto Scaling group for a cluster and choose Activity History.
  • Keep an eye on the ECS instances to ensure that new instances are joining and draining instances are leaving. In the ECS console, choose Cluster, ECS Instances.
  • Lambda function logs help troubleshoot things that aren’t behaving as expected. In the Lambda console, select the LifeCycleLaunch or LifeCycleTerminate functions, and choose Monitoring, View logs in CloudWatch. Expand the logs for the latest executions and see what’s going on:

When you go back to the ECS cluster page, notice that the “Outdated Amazon ECS container agent” warning message has disappeared.

Select one of the cluster’s EC2 instance IDs and observe that the latest ECS optimized AMI is used.

Summary

In this post, you saw how to use CloudFormation, Lambda, CloudWatch Events, and Auto Scaling lifecycle hooks to update your ECS cluster instances with a new AMI.

The sample code is available on GitHub for you to use and extend. Contributions are always welcome!

 

Scheduling GPUs for deep learning tasks on Amazon ECS

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/scheduling-gpus-for-deep-learning-tasks-on-amazon-ecs/

This post is contributed by Brent Langston – Sr. Developer Advocate, Amazon Container Services

Last week, AWS announced enhanced Amazon Elastic Container Service (Amazon ECS) support for GPU-enabled EC2 instances. This means that now GPUs are first class resources that can be requested in your task definition, and scheduled on your cluster by ECS.

Previously, to schedule a GPU workload, you had to maintain your own custom configured AMI, with a custom configured Docker runtime. You also had to use custom vCPU logic as a stand-in for assigning your GPU workloads to GPU instances. Even when all that was in place, there was still no pinning of a GPU to a task. One task might consume more GPU resources than it should. This could cause other tasks to not have a GPU available.

Now, AWS maintains an ECS-optimized AMI that includes the correct NVIDIA drivers and Docker customizations. You can use this AMI to provision your GPU workloads. With this enhancement, GPUs can also be requested directly in the task definition. Like allocating CPU or RAM to a task, now you can explicitly request a number of GPUs to be allocated to your task. The scheduler looks for matching resources on the cluster to place those tasks. The GPUs are pinned to the task for as long as the task is running, and can’t be allocated to any other tasks.

I thought I’d see how easy it is to deploy GPU workloads to my ECS cluster. I’m working in the US-EAST-2 (Ohio) region, from my AWS Cloud9 IDE, so these commands work for Amazon Linux. Feel free to adapt to your environment as necessary.

If you’d like to run this example yourself, you can find all the code in this GitHub repo. If you run this example in your own account, be aware of the instance pricing, and clean up your resources when your experiment is complete.

Clone the repo using the following command:

git clone https://github.com/brentley/tensorflow-container.git

Setup

You need the latest version of the AWS CLI (for this post, I used 1.16.98):

echo “export PATH=$HOME/.local/bin:$HOME/bin:$PATH” >> ~/.bash_profile
source ~/.bash_profile
pip install --user -U awscli

Provision an ECS cluster, with two C5 instances, and two P3 instances:

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --capabilities CAPABILITY_IAM                            

While AWS CloudFormation is provisioning resources, examine the template used to build your infrastructure. Open `cluster-cpu-gpu.yml`, and you see that you are provisioning a test VPC with two c5.2xlarge instances, and two p3.2xlarge instances. This gives you one NVIDIA Tesla V100 GPU per instance, for a total of two GPUs to run training tasks.

I adapted the TensorFlow benchmark Docker container to create a training workload. I use this container to compare the GPU scheduling and runtime.

When the CloudFormation stack is deployed, register a task definition with the ECS service:

aws ecs register-task-definition --cli-input-json file://gpu-1-taskdef.json

To request GPU resources in the task definition, the only change needed is to include a GPU resource requirement in the container definition:

            "resourceRequirements": [
                {
                    "type": "GPU",
                    "value": "1"
                }
            ],

Including this resource requirement ensures that the ECS scheduler allocates the task to an instance with a free GPU resource.

Launch a single-GPU training workload

Now you’re ready to launch the first GPU workload.

export cluster=$(aws cloudformation describe-stacks --stack-name tensorflow-test --query 
'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' --output text) 
echo $cluster
aws ecs run-task --cluster $cluster --task-definition tensorflow-1-gpu

When you launch the task, the output shows the `guIds` values that are assigned to the task. This GPU is pinned to this task, and can’t be shared with any other tasks. If all GPUs are allocated, you can’t schedule additional GPU tasks until a running task with a GPU completes. That frees the GPU to be scheduled again.

When you look at the log output in Amazon CloudWatch Logs, you see that the container discovered one GPU: `/gpu0` and the training benchmark trained at a rate of 321.16 images/sec.

With your two p3.2xlarge nodes in the cluster, you are limited to two concurrent single GPU based workloads. To scale horizontally, you could add additional p3.2xlarge nodes. This would limit your workloads to a single GPU each.  To scale vertically, you could bump up the instance type,  which would allow you to assign multiple GPUs to a single task.  Now, let’s see how fast your TensorFlow container can train when assigned multiple GPUs.

Launch a multiple-GPU training workload

To begin, replace the p3.2xlarge instances with p3.16xlarge instances. This gives your cluster two instances that each have eight GPUs, for a total of 16 GPUs that can be allocated.

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --parameter-overrides GPUInstanceType=p3.16xlarge --capabilities CAPABILITY_IAM

When the CloudFormation deploy is complete, register two more task definitions to launch your benchmark container requesting more GPUs:

aws ecs register-task-definition --cli-input-json file://gpu-4-taskdef.json  
aws ecs register-task-definition --cli-input-json file://gpu-8-taskdef.json 

Next, launch two TensorFlow benchmark containers, one requesting four GPUs, and one requesting eight GPUs:

aws ecs run-task --cluster $cluster --task-definition tensorflow-4-gpu
aws ecs run-task --cluster $cluster --task-definition tensorflow-8-gpu

With each task request, GPUs are allocated: four in the first request, and eight in the second request. Again, these GPUs are pinned to the task, and not usable by any other task until these tasks are complete.

Check the log output in CloudWatch Logs:

On the “devices” lines, you can see that the container discovered and used four (or eight) GPUs. Also, the total images/sec improved to 1297.41 with four GPUs, and 1707.23 with eight GPUs.

Because you can pin single or multiple GPUs to a task, running advanced GPU based training tasks on Amazon ECS is easier than ever!

Cleanup

To clean up your running resources, delete the CloudFormation stack:

aws cloudformation delete-stack --stack-name tensorflow-test

Conclusion

For more information, see Working with GPUs on Amazon ECS.

If you want to keep up on the latest container info from AWS, please follow me on Twitter and tweet any questions! @brentContained

Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR

Post Syndicated from Nathan Peck original https://aws.amazon.com/blogs/compute/setting-up-aws-privatelink-for-amazon-ecs-and-amazon-ecr/

Amazon ECS and Amazon ECR now have support for AWS PrivateLink. AWS PrivateLink is a networking technology designed to enable access to AWS services in a highly available and scalable manner. It keeps all the network traffic within the AWS network. When you create AWS PrivateLink endpoints for ECR and ECS, these service endpoints appear as elastic network interfaces with a private IP address in your VPC.

Before AWS PrivateLink, your Amazon EC2 instances had to use an internet gateway to download Docker images stored in ECR or communicate to the ECS control plane. Instances in a public subnet with a public IP address used the internet gateway directly. Instances in a private subnet used a network address translation (NAT) gateway hosted in a public subnet. The NAT gateway would then use the internet gateway to talk to ECR and ECS.

Now that AWS PrivateLink support has been added, instances in both public and private subnets can use it to get private connectivity to download images from Amazon ECR. Instances can also communicate with the ECS control plane via AWS PrivateLink endpoints without needing an internet gateway or NAT gateway.

 

This networking architecture is considerably simpler. It enables enhanced security by allowing you to deny your private EC2 instances access to anything other than these AWS services. That’s assuming that you want to block all other outbound internet access for those instances. For this to work, you must create some AWS PrivateLink resources:

  • AWS PrivateLink endpoints for ECR. This allows instances in your VPC to communicate with ECR to download image manifests
  • Gateway VPC endpoint for Amazon S3. This allows instances to download the image layers from the underlying private Amazon S3 buckets that host them.
  • AWS PrivateLink endpoints for ECS. These endpoints allow instances to communicate with the telemetry and agent services in the ECS control plane.

This post explains how to create these resources.

Create an AWS PrivateLink interface endpoint for ECR

ECR requires two interface endpoints:

  • com.amazonaws.region.ecr.api
  • com.amazonaws.region.ecr.dkr

In the VPC console, create the interface VPC endpoints for ECR using the endpoint creation wizard. Choose AWS services and select an endpoint. Substitute your AWS Region of choice.

Next, specify the VPC and subnets to which the AWS PrivateLink interface should be added. Make sure that you select the same VPC in which your ECS cluster is running. To be on the safe side, select every Availability Zone and subnet from the list. Each zone has a list of the subnets available. You can select all the subnets in each Availability Zone.

However, depending on your networking needs, you might also choose to only enable the AWS PrivateLink endpoint in your private subnets from each Availability Zone. Let instances running in a public subnet continue to communicate with ECR via the public subnet’s internet gateway.

Next, enable Private DNS Name, which is required for the endpoint.

com.amazonaws.region.ecr.dkr.

A private hosted zone enables you to access the resources in your VPC using the Amazon ECR default DNS domain names. You don’t need to use the private IPv4 address or the private DNS hostnames provided by Amazon VPC endpoints. The Amazon ECR DNS hostname that the AWS CLI and Amazon ECR SDKs use by default (https://api.ecr.region.amazonaws.com) resolves to your VPC endpoint.

If you enabled a private hosted zone for com.amazonaws.region.ecr.api and you are using an SDK released before January 24, 2019, you must specify the following endpoint when using an SDK or the AWS CLI. Use the following command:

aws --endpoint-url https://api.ecr.region.amazonaws.com

If you don’t enable a private hosted zone, use the following command:

aws --endpoint-url https://VPC_Endpoint_ID.api.ecr.region.vpce.amazonaws.com ecr describe-repositories

If you enabled a private hosted zone and you are using the SDK released on January 24, 2019 or later, use the following command:

aws ecr describe-repositories

Lastly, specify a security group for the interface itself. This is going to control whether each host is able to talk to the interface. The security group should allow inbound connections on port 80 from the instances in your cluster.

You may have a security group that is applied to all the EC2 instances in the cluster, perhaps using an Auto Scaling group. You can create a rule that allows the VPC endpoint to be accessed by any instance in that security group.

Finally, choose Create endpoint. The new endpoint appears in the list.

Add a gateway VPC endpoint for S3

The next step is to create a gateway VPC endpoint for S3. This is necessary because ECR uses S3 to store Docker image layers. When your instances download Docker images from ECR, they must access ECR to get the image manifest and S3 to download the actual image layers.

S3 uses a slightly different endpoint type called a gateway. Be careful about adding an S3 gateway to your VPC if your application is actively using S3. With gateway endpoints, your application’s existing connections to S3 may be briefly interrupted while the gateway is being added. You may have a busy cluster with many active ECS deployments, causing image layer downloads from S3. Or, your application itself may make heavy usage of S3. In that case, it’s best to create a fresh new VPC with an S3 gateway, then migrate your ECS cluster and its containers into that VPC.

To add the S3 gateway endpoint, select com.amazonaws.region.s3 on the list of AWS services and select the VPC hosting your ECS cluster. Gateway endpoints are added to the VPC route table for the subnets. Select each route table associated with the subnet in which the S3 gateway should be.

Instead of using a security group, the gateway endpoint uses an IAM policy document to limit access to the service. This policy is similar to an IAM policy but does not replace the default level of access that your applications have through their IAM role. It just further limits what portions of the service are available via the gateway.

It’s okay to just use the default Full Access policy. Any restrictions you have put on your task IAM roles or other IAM user policies still apply on top of this policy. For information about a minimal access policy, see the Minimum Amazon S3 Bucket Permissions for Amazon ECR.

Choose Create to add this gateway endpoint to your VPC. When you view the route tables in your VPC subnets, you see an S3 gateway that is used whenever ECR Docker image layers are being downloaded from S3.

Create an AWS PrivateLink interface endpoint for ECS

In addition to downloading Docker images from ECR, your EC2 instances must also communicate with the ECS control plane to receive orchestration instructions.

ECS requires three endpoints:

  • com.amazonaws.region.ecs-agent
  • com.amazonaws.region.ecs-telemetry
  • com.amazonaws.region.ecs

Create these three interface endpoints in the same way that you created the endpoint for ECR, by adding each endpoint and setting the subnets and security group for the endpoint.

After the endpoints are created and added to your VPC, there is one additional step. Make sure that your ECS agent is upgraded to version 1.25.1 or higher. For more information, see the instructions for upgrading the ECS agent.

If you are already running the right version of the ECS agent, restart any ECS agents that are currently running in the VPC. The ECS agent uses a persistent web socket connection to the ECS backend and VPC endpoints do not interrupt existing connections. The agent continues to use its existing connection instead of establishing a new connection through the new endpoint, unless you restart it.

To restart the agent with no disruption to your application containers, you can connect using SSH to each EC2 instance in the cluster and issue the following command:

sudo docker restart ecs-agent

This restarts the ECS agent without stopping any of the other application containers on the host. Your application may be stateless and safe to stop at any time, or you may not have or want SSH access to the underlying hosts. In that case, choose to just reboot each EC2 instance in the cluster one at a time. This restarts the agent on that host while also restarting any service launched tasks on that host on a different host.

Conclusion

In this post, I showed you how to add AWS PrivateLink endpoints to your VPC for ECS and ECR, including an S3 gateway for ECR layer downloads.

The instances in your ECS cluster can communicate directly with the ECS control plane. They should be able to download Docker images directly without needing to make any connections outside of your VPC using an internet gateway or NAT gateway. All container orchestration traffic stays inside the VPC.

If you have questions or suggestions, please comment below.

Migrate Wildfly Cluster to Amazon ECS using Service Discovery

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/migrate-wildfly-cluster-to-ecs-using-service-discovery/

This post is courtesy of Vidya Narasimhan, AWS Solutions Architect

1. Overview

Java Enterprise Edition has been an important server-side platform for over a decade for developing mission-critical & large-scale applications amongst enterprises. High-availability & fault tolerance for such applications is typically achieved through built-in JEE clustering provided by the platform.

JEE clustering represents a group of machines working together to transparently provide enterprise services such as JNDI, EJB, JMS, HTTPSession etc. that enable distribution, discovery, messaging, transaction, caching, replication & component failover.  Implementation of clustering technology varies in JEE platforms provided by different vendors. Many of the clustering implementations involve proprietary communication protocols that use multicast for intra-cluster communications that is not supported in public cloud.

This article is relevant for JEE platforms & other products that use JGroups based clustering such as Wildfly. The solution described allows easy migration of applications developed on these platforms using native clustering to Amazon Elastic Container Service (Amazon ECS) which is a highly scalable, fast, container management service that makes it easy to orchestrate, run & scale Docker containers on a cluster. This solution is useful when the business objective is to migrate to cloud fast with minimum changes to the application. The approach recommends lift & shift to AWS wherein the initial focus is to migrate as-is with optimizations coming in later incrementally.

Whether the JEE application to be migrated is designed as a monolith or micro services, a legacy or green-field deployment, there are multiple reasons why organizations should opt for containerization of their application. This link explains well the benefits of containerization (see section Why Use Containers) https://aws.amazon.com/getting-started/projects/break-monolith-app-microservices-ecs-docker-ec2/module-one/

2. Wildfly Clustering on ECS

Here onwards, this article highlights how to migrate a standard clustered JEE app deployed on Wildfly Application Server to Amazon ECS. Wildfly supports clustering out of the box and supports two modes of clustering, standalone & domain mode. This article explores how to setup WildFly cluster in ECS with multiple Wildfly standalone nodes enabled for HA to form a cluster. The clustering is demonstrated through a web application that replicates session information across the cluster nodes and can withstand a failover without session data loss.

The important components of clustering that requires a mention right away are ECS Service Discovery, JGroups & Infinispan.

  • JGroups – Wildfly clustering is enabled by the popular open-source JGroups toolkit. The JGroups subsystem provides group communication support for HA services using a multicast transmission by default. It deals with all aspects of node discovery and providing reliable messaging between the nodes as follows-
    • Node-to-node messaging — By default is based on UDP/multicast that can be extended via TCP/unicast.
    • Node discovery — By default uses multicast ping MPING. Alternatives include TCPPING, S3_PING, JDBC_PING, DNS_PING and others.

This article focusses on DNS_PING for node discovery using TCP protocol.

ECS Service discovery – Amazon ECS service can optionally be configured to use Amazon ECS Service Discovery. Service discovery uses Amazon Route 53 auto naming API actions to manage DNS entries (A or SRV records) for service tasks, making them discoverable within your VPC. You can specify health check conditions in a service task definition and Amazon ECS will ensure that only healthy service endpoints are returned by a service lookup.

As your services scale up or down in response to load or container health, the Route 53 hosted zone is kept up to date.

Wildfly uses JGroups to discover the cluster nodes via DNS_PING discovery protocol that sends a DNS service endpoint query to the ECS service registry maintained in Route53.

  • Infinispan – Wildfly uses Infinispan subsystem to provides high-performance, clustered, transactional caching. In a clustered web application, Infinispan handles the replication of application data across the cluster by means of a replicated/distributed cache. Under the hood, it uses JGroups channel for data transmission within the cluster.

3. Implementation Instructions

Configure Wildfly

  • Modify Wildfly standalone configuration file – Standalone-HA.xml. The HA suffix implies high availability configuration.
  1.  Modify the JGroup Subsystem – Add a TCP Stack with DNS_Ping as the discovery protocol & configure the DNS Query endpoint. It is important to note that the DNS_QUERY matches the ECS  service endpoint when configuring the ECS service.  
  2. Change the JGroup default stack to point to the TCP Stack.                           
  3. Configure a custom Infinispan replicated cache to be used by the web app or use the default cache.      

Build the docker image & store it in Elastic Container Registry (ECR)

  1. Package the JBoss/Wildfly image with JEE application & Wildfly platform on Docker. Create a Dockerfile & include the following:
    1. Install the WildFly distribution & set permissions – This approach requires the latest Wildfly distribution 15.0.0.Final released recently.          
    2. Copy the modified Wildfly standalone-ha.xml to the container.
    3. Deploy the JEE web application. This simple web app is configured as distributable and uses Infinispan to replicate session information across cluster nodes. It displays a page containing the container IP/hostname, Session ID & session data & helps demonstrate session replication.                   
    4. Define a custom entrypoint, entrypoint.sh, to boot Wildfly with the specified bind IP addresses to its interfaces. The script gets the container metadata, extracts the container IP to bind it to Wildfly interfaces. This interface binding is an important step as it enables the application related network communication (web, messaging) between the containers.    
    5. Add the enrypoint.sh script to the image in the Dockerfile.                                        
    6. Build the container & push it to ECR repository. Amazon Elastic Container Registry (ECR) is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images.

The Wildly configuration files, the Dockerfile & the web app WAR file can be found at the Github link https://github.com/vidyann/Wildfly_ECS

Create ECS Service with service discovery

  • Create a Service using service discovery.
    • This link describe steps to set up a ECS task & service with service discovery https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-service-discovery.html#create-service-discovery-taskdef. Though the example in the link creates a Fargate cluster, you can create an EC2 based cluster as well for this example.
    • While configuring the task choose the network mode as AWSVPC. The task networking features provided by the AWSVPC network mode give Amazon ECS tasks the same networking properties as Amazon EC2 instances. Benefits of task networking can be found here – https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.html
    • Tasks can be flagged as compatible with EC2, Fargate, or both. Here is what the cluster & service looks like:                             
    • When setting up the container details in task, use 8080 as the port, which is the default Wildfly port. This can be changed through WIldfly configuration. Enable the cloudwatch logs which captures Wildfly logging.
    • While configuring the ECS service, ensure that the service name & namespace should combine to form service endpoint that exactly matches the DNS_Query endpoint configured in Wildfly configuration file. The container security group should allow inbound traffic to port 8080. Here is what the service endpoint looks like:     
    • The route53 registry created by ECS is shown below. We see two DNS entries corresponding to the DNS endpoint myapp.sampleaws.com.              
    • Finally view the Wildfly logs in the console by clicking a task instance. You can check if clustering is enabled by looking for a log entry as below:            

Here we see that a Wildfly cluster was formed with two nodes(same as the pic in route 53).

Run the Web App in a browser

  • Spin up a windows instance in the VPC & open the web app in a browser. Below is a screenshot of the webapp:                                         
  • Open in different browsers & tabs & verify the Container IP & session ID. Now force shutdown a node by resizing the ECS service task instances to one. Note that though the container IP in the webapp changes, the session ID does not change and the webapp is available and the HTTP Session is alive thus demonstrating the session replication & failover amongst the clustering nodes.

4. Summary

Our goal here is to migrate the enterprise JEE apps to Amazon ECS by tweaking a few configurations but gaining immediately the benefits of containerization & orchestration managed by ECS. By delegating the undifferentiated heavy lifting of container management, orchestration, scaling to ECS, you can focus on improvising/re-architecting your application to micro-services oriented architecture. Please note that all the deployment procedures in this article can be fully automated via the AWS CI/CD services.

AWS Fargate Price Reduction – Up to 50%

Post Syndicated from Nathan Peck original https://aws.amazon.com/blogs/compute/aws-fargate-price-reduction-up-to-50/

AWS Fargate is a compute engine that uses containers as its fundamental compute primitive. AWS Fargate runs your application containers for you on demand. You no longer need to provision a pool of instances or manage a Docker daemon or orchestration agent. Because the infrastructure that runs your containers is invisible, you don’t have to worry about whether you have provisioned enough instances to run your containerized workload. You also don’t have to worry about whether you’re using those instances efficiently to avoid paying for resources that you don’t use. You no longer need to do undifferentiated heavy lifting to maintain the infrastructure that runs your containers. AWS Fargate automatically updates and patches underlying resources to keep you safe from vulnerabilities in the underlying operating system and software. AWS Fargate uses an on-demand pricing model that charges per vCPU and per GB of memory reserved per second, with a 1-minute minimum.

At re:Invent 2018 we announced Firecracker, an open source virtualization technology that is purpose-built for creating and managing secure, multi-tenant containers and functions-based services. Firecracker enables you to deploy workloads in lightweight virtual machines called microVMs. These microVMs can initiate code faster, with less overhead. Innovations such as these allow us to improve the efficiency of Fargate and help us pass on cost savings to customers.

Effective January 7th, 2019 Fargate pricing per vCPU per second is being reduced by 20%, and pricing per GB of memory per second is being reduced by 65%. Depending on the ratio of CPU to memory that you’re allocating for your containers, you could see an overall price reduction of anywhere from 35% to 50%.

The following table shows the price reduction for each built-in launch configuration.

vCPUGB MemoryEffective Price Cut
0.250.5-35.00%
0.251-42.50%
0.252-50.00%
0.51-35.00%
0.52-42.50%
0.53-47.00%
0.54-50.00%
12-35.00%
13-39.30%
14-42.50%
15-45.00%
16-47.00%
17-48.60%
18-50.00%
24-35.00%
25-37.30%
26-39.30%
27-41.00%
28-42.50%
29-43.80%
210-45.00%
211-46.10%
212-47.00%
213-47.90%
214-48.60%
215-49.30%
216-50.00%
48-35.00%
49-36.20%
410-37.30%
411-38.30%
412-39.30%
413-40.20%
414-41.00%
415-41.80%
416-42.50%
417-43.20%
418-43.80%
419-44.40%
420-45.00%
421-45.50%
422-46.10%
423-46.50%
424-47.00%
425-47.40%
426-47.90%
427-48.30%
428-48.60%
429-49.00%
430-49.30%

Many engineering organizations such as Turner Broadcasting System, Veritone, and Catalytic have already been using AWS Fargate to achieve significant infrastructure cost savings for batch jobs, cron jobs, and other on-and-off workloads. Running a cluster of instances at all times to run your containers constantly incurs cost, but AWS Fargate stops charging when your containers stop.

With these new price reductions, AWS Fargate also enables significant savings for containerized web servers, API services, and background queue consumers run by organizations like KPMG, CBS, and Product Hunt. If your application is currently running on large EC2 instances that peak at 10-20% CPU utilization, consider migrating to containers in AWS Fargate. Containers give you more granularity to provision the exact amount of CPU and memory that your application needs. You no longer pay for instance resources that your application doesn’t use. If a sudden spike of traffic causes your application to require more resources you still have the ability to rapidly scale your application out by adding more containers, or scale your application up by launching larger containers.

AWS Fargate lets you focus on building your containerized application without worrying about the infrastructure. This encompasses not just the infrastructure capacity provisioning, monitoring, and maintenance but also the infrastructure price. Implementing Firecracker in AWS Fargate is just part of our journey to keep making AWS Fargate faster, more powerful, and more efficient. Running your containers in AWS Fargate allows you to benefit from these improvements without any manual intervention required on your part.

AWS Fargate has achieved SOC, PCI, HIPAA BAA, ISO, MTCS, C5, and ENS High compliance certification, and has a 99.99% SLA. You can get started with AWS Fargate in 13 AWS Regions around the world.