Tag Archives: Containers

Anatomy of CVE-2019-5736: A runc container escape!

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/anatomy-of-cve-2019-5736-a-runc-container-escape/

This post is courtesy of Samuel Karp, Senior Software Development Engineer — Amazon Container Services.

On Monday, February 11, CVE-2019-5736 was disclosed.  This vulnerability is a flaw in runc, which can be exploited to escape Linux containers launched with Docker, containerd, CRI-O, or any other user of runc.  But how does it work?  Dive in!

This concern has already been addressed for AWS, and no customer action is required. For more information, see the security bulletin.

A review of Linux process management

Before I explain the vulnerability, here’s a review of some Linux basics.

  • Processes and syscalls
  • What is /proc?
  • Dynamic linking

Processes and syscalls

Processes form the core unit of running programs on Linux. Every launched program is represented by one or more processes.  Processes contain a variety of data about the running program, including a process ID (pid), a table tracking in-use memory, a pointer to the currently executing instruction, a set of descriptors for open files, and so forth.

Processes interact with the operating system to perform a variety of operations (for example, reading and writing files, taking input, communicating on the network, etc.) via system calls, or syscalls.  Syscalls can perform a variety of actions. The ones I’m interested in today involve creating other processes (typically through fork(2) or clone(2)) and changing the currently running program into something else (execve(2)).

File descriptors are how a process interacts with files, as managed by the Linux kernel.  File descriptors are short identifiers (numbers) that are passed to the appropriate syscalls for interacting with files: read(2), write(2), close(2), and so forth.

Sometimes a process wants to spawn another process.  That might be a shell running a program you typed at the terminal, a daemon that needs a helper, or even concurrent processing without threads.  When this happens, the process typically uses the fork(2) or clone(2) syscalls.

These syscalls have some differences, but they both operate by creating another copy of the currently executing process and sharing some state.  That state can include things like the memory structures (either shared memory segments or copies of the memory) and file descriptors.

After the new process is started, it’s the responsibility of both processes to figure out which one they are (am I the parent? Am I the child?). Then, they take the appropriate action.  In many cases, the appropriate action is for the child to do some setup, and then execute the execve(2) syscall.

The following example shows the use of fork(2), in pseudocode:

func main() {
    child_pid= fork();
    if (child_pid > 0) {
        // This is the parent process, since child_pid is the pid of the child
        // process.
    } else if (child_pid == 0) {
        // This is the child process. It can retrieve its own pid via getpid(2),
        // if desired.  This child process still sees all the variables in memory
        // and all the open file descriptors.
    }
}

The execve(2) syscall instructs the Linux kernel to replace the currently executing program with another program, in-place.  When called, the Linux kernel loads the new executable as specified and pass the specified arguments.  Because this is done in place, the pid is preserved and a variety of other contextual information is carried over, including environment variables, the current working directory, and any open files.

func main() {
    // execl(3) is a wrapper around the execve(2) syscall that accepts the
    // arguments for the executed program as a list.
    // The first argument to execl(3) is the path of the executable to
    // execute, which in this case is the pwd(1) utility for printing out
    // the working directory.
    // The next argument to execl(3) is the first argument passed through
    // to the new program (in a C program, this would be the first element
    // of the argc array, or argc[0]).  By convention, this is the same as
    // the path of the executable.
    // The remaining arguments to execl(3) are the other arguments visible
    // in the new program's argc array, terminated by NULL.  As you're
    // not passing any additional arguments, just pass NULL here.
    execl("/bin/pwd", "/bin/pwd", NULL);
    // Nothing after this point executes, since the running process has been
    // replaced by the new pwd(1) program.
}

Wait…open files?  By default, open files are passed across the execve(2) boundary.  This is useful in cases where the new program can’t open the file, for example if there’s a new mount covering the existing path.  This is also the mechanism by which the standard I/O streams (stdin, stdout, and stderr) are made available to the new program.

While convenient in some use cases, it’s not always desired to preserve open file descriptors in the new program. This behavior can be changed by passing the O_CLOEXEC flag to open(2) when opening the file or by setting the FD_CLOEXEC flag with fcntl(2).  Using O_CLOEXEC or FD_CLOEXEC (which are both short for close-on-exec) prevents the new program from having access to the file descriptor.

func main() {
    // open(2) opens a file.  The first argument to open is the path of the file
    // and the second argument is a bitmask of flags that describe options
    // applied to the file that's opened.  open(2) then returns a file
    // descriptor, which can be used in subsequent syscalls to represent this
    // file.
    // For this example, open /dev/urandom, which is a file containing random
    // bytes.  Pass two flags: O_RDONLY and O_CLOEXEC; O_RDONLY indicates that
    // the file should be open for reading but not writing, and O_CLOEXEC
    // indicates that the file descriptor should not pass through the execve(2)
    // boundary.
    fd = open("/dev/urandom", O_RDONLY | O_CLOEXEC);
    // All valid file descriptors are positive integers, so a returned value < 0
    // indicates that an error occurred.
    if (fd < 0) {
        // perror(3) is a function to print out the last error that occurred.
        error("could not open /dev/urandom");
        // exit(3) causes a process to exit with a given exit code. Return 1
        // here to indicate that an error occurred.
        exit(1);
    }
}

What is /proc?
/proc (or proc(5)) is a pseudo-filesystem that provides access to a number of Linux kernel data structures.  Every process in Linux has a directory available for it called /proc/[pid].  This directory stores a bunch of information about the process, including the arguments it was given when the program started, the environment variables visible to it, and the open file descriptors.

The special files inside /proc/[pid]/fd describe the file descriptors that the process has open.  They look like symbolic links (symlinks), and you can see the original path of the file, but they aren’t exactly symlinks. You can pass them to open(2) even if the original path is inaccessible and get another working file descriptor.

Another file inside /proc/[pid] is called exe. This file is like the ones in /proc/[pid]/fd except that it points to the binary program that is executing inside that process.

/proc/[pid] also has a companion directory, /proc/self.  This directory is always the same as /proc/[pid] of the process that is accessing it. That is, you can always read your own /proc data from /proc/self without knowing your pid.

Dynamic linking

When writing programs, software developers typically use libraries—collections of previously written code intended to be reused.  Libraries can cover all sorts of things, from high-level concerns like machine learning to lower-level concerns like basic data structures or interfaces with the operating system.

In the code example above, you can see the use of a library through a call to a function defined in a library (fork).

Libraries are made available to programs through linking: a mechanism for resolving symbols (types, functions, variables, etc.) to their definition.  On Linux, programs can be statically linked, in which case all the linking is done at compile time and all symbols are fully resolved. Or they can be dynamically linked, in which case at least some symbols are unresolved until a runtime linker makes them available.

Dynamic linking makes it possible to replace some parts of the resulting code without recompiling the whole application. This is typically used for upgrading libraries to fix bugs, enhance performance, or to address security concerns.  In contrast, static linking requires re-compiling and re-linking each program that uses a given library to affect the same change.

On Linux, runtime linking is typically performed by ld-linux.so(8), which is provided by the GNU project toolchain.  Dynamically linked libraries are specified by a name embedded into the compiled binary.  This dynamic linker reads those names and then performs a search across a standard set of paths to find the associated library file (a shared object file, or .so).

The dynamic linker’s search path can be influenced by the LD_LIBRARY_PATH environment variable.  The LD_PRELOAD environment variable can tell the linker to load additional, user-specified libraries before all others. This is useful in debugging scenarios to allow selective overriding of symbols without having to rebuild a library entirely.

The vulnerability

Now that the cast of characters is set (fork(2), execve(2), open(2), proc(5), file descriptors, and linking), I can start talking about the vulnerability in runc.

runc is a container runtime.  Like a shell, its primary purpose is to launch other programs. However, it does so after manipulating Linux resources like cgroups, namespaces, mounts, seccomp, and capabilities to make what is referred to as a “container.”

The primary mechanism for setting up some of these resources, like namespaces, is through flags to the clone(2) syscall that take effect in the new process.  The target of the final execve(2) call is the program the user requested. It With a container, the target of the final execve(2) call can be specified in the container image or through explicit arguments.

The CVE announcement states:

The vulnerability allows a malicious container to […] overwrite the host runc binary […].  The level of user interaction is being able to run any command […] as root within a container [when creating] a new container using an attacker-controlled image.

The operative parts of this are: being able to overwrite the host runc binary (that seems bad) by running a command (that’s…what runc is supposed to do…).  Note too that the vulnerability is as simple as running a command and does not require running a container with elevated privileges or running in a non-default configuration.

Don’t containers protect against this?

Containers are, in many ways, intended to isolate the host from a given workload or to isolate a given workload from the host.  One of the main mechanisms for doing this is through a separate view of the filesystem.  With a separate view, the container shouldn’t be able to access the host’s files and should only be able to see its own.  runc accomplishes this using a mount namespace and mounting the container image’s root filesystem as /.  This effectively hides the host’s filesystem.

Even with techniques like this, things can pass through the mount namespace.  For example, the /proc/cmdline file contains the running Linux kernel’s command-line parameters.  One of those parameters typically indicates the host’s root filesystem, and a container with enough access (like CAP_SYS_ADMIN) can remount the host’s root filesystem within the container’s mount namespace.

That’s not what I’m talking about today, as that requires non-default privileges to run.  The interesting thing today is that the /proc filesystem exposes a path to the original program’s file, even if that file is not located in the current mount namespace.

What makes this troublesome is that interacting with Linux primitives like namespaces typically requires you to run as root, somewhere.  In most installations involving runc (including the default configuration in Docker, Kubernetes, containerd, and CRI-O), the whole setup runs as root.

runc must be able to perform a number of operations that require elevated privileges, even if your container is limited to a much smaller set of privileges. For example, namespace creation and mounting both require the elevated capability CAP_SYS_ADMIN, and configuring the network requires the elevated capability CAP_NET_ADMIN. You might see a pattern here.

An alternative to running as root is to leverage a user namespace. User namespaces map a set of UIDs and GIDs inside the namespace (including ones that appear to be root) to a different set of UIDs and GIDs outside the namespace.  Kernel operations that are user-namespace-aware can delineate privileged actions occurring inside the user namespace from those that occur outside.

However, user namespaces are not yet widely employed and are not enabled by default. The set of kernel operations that are user-namespace-aware is still growing, and not everyone runs the newest kernel or user-space software.

So, /proc exposes a path to the original program’s file, and the process that starts the container runs as root.  What if that original program is something important that you knew would run again… like runc?

Exploiting it!

runc’s job is to run commands that you specify.  What if you specified /proc/self/exe?  It would cause runc to spawn a copy of itself, but running inside the context of the container, with the container’s namespaces, root filesystem, and so on.  For example, you could run the following command:

docker run –rm amazonlinux:2 /proc/self/exe

This, by itself, doesn’t get you far—runc doesn’t hurt itself.

Generally, runc is dynamically linked against some libraries that provide implementations for seccomp(2), SELinux, or AppArmor.  If you remember from earlier, ld-linux.so(8) searches a standard set of file paths to provide these implementations at runtime.  If you start runc again inside the container’s context, with its separate filesystem, you have the opportunity to provide other files in place of the expected library files. These can run your own code instead of the standard library code.

There’s an easier way, though.  Instead of having to make something that looks like (for example) libseccomp, you can take advantage of a different feature of the dynamic linker: LD_PRELOAD.  And because runc lets you specify environment variables along with the path of the executable to run, you can specify this environment variable, too.

With LD_PRELOAD, you can specify your own libraries to load first, ahead of the other libraries that get loaded.  Because the original libraries still get loaded, you don’t actually have to have a full implementation of their interface.  Instead, you can selectively override some common functions that you might want and omit others that you don’t.

So now you can inject code through LD_PRELOAD and you have a target to inject it into: runc, by way of /proc/self/exe.  For your code to get run, something must call it.  You could search for a target function to override, but that means inspecting runc’s code to figure out what could get called and how.  Again, there’s an easier way.  Dynamic libraries can specify a “constructor” that is run immediately when the library is loaded.

Using the “constructor” along with LD_PRELOAD and specifying the command as /proc/self/exe, you now have a way to inject code and get it to run.  That’s it, right?  You can now write to /proc/self/exe and overwrite runc!

Not so fast.

The Linux kernel does have a bit of a protection mechanism to prevent you from overwriting the currently running executable.  If you open /proc/self/exe for writing, you get -ETXTBSY.  This error code indicates that the file is busy, where “TXT” refers to the text (code) section of the binary.

You know from earlier that execve(2) is a mechanism to replace the currently running executable with another, which means that the original executable isn’t in use anymore.  So instead of just having a single library that you load with LD_PRELOAD, you also must have another executable that can do the dirty work for you, which you can execve(2).

Normally, doing this would still be unsuccessful due to file permissions. Executables are typically not world-writable.  But because runc runs as root and does not change users, the new runc process that you started through /proc/self/exe and the helper program that you executed are also run as root.

After you gain write access to the runc file descriptor and you’ve replaced the currently executing program with execve(2), you can replace runc’s content with your own.  The other software on the system continues to start runc as part of its normal operation (for example, creating new containers, stopping containers, or performing exec operations inside containers). Your code has the chance to operate instead of runc.  When it gets run this way, your code runs as root, in the host’s context instead of in the container’s context.

Now you’re done!  You’ve successfully escaped the container and have full root access.

Putting that all together, you get something like the following pseudocode:

Preload pseudocode for preload.so

func constructor(){
    // /proc/self/exe is a virtual file pointing to the currently running
    // executable.  Open it here so that it can be passed to the next
    // process invoked.  It must be opened read-only, or the kernel will fail
    // the open syscall with ETXTBSY.  You cannot gain write access to the text
    // portion of a running executable.
    fd = open("/proc/self/exe", O_RDONLY);
    if (fd < 0) {
        error("could not open /proc/self/exe");
        exit(1);
    }
    // /proc/self/fd/%d is a virtual file representing the open file descriptor
    // to /proc/self/exe, which you opened earlier.
    filename = sprintf("/proc/self/fd/%d", fd);
    // execl is a call that executes a new executable, replacing the
    // currently running process and preserving aspects like the process ID.
    // Execute the "rewrite" binary, passing it arguments representing the
    // path of the open file descriptor. Because you did not pass O_CLOEXEC when
    // opening the file, the file descriptor remains open in the replacement
    // program and retains the same descriptor.
    execl("/rewrite", "/rewrite", filename, NULL);
    // execl never returns, except on an error
    error("couldn't execl");
}

Pseudocode for the rewrite program

// rewrite is your cooperating malicious program that takes an argument
// representing a file descriptor path, reopens it as read-write, and
// replaces the contents.  rewrite expects that it is unable to open
// the file on the first try, as the kernel has not closed it yet
func main(argc, *argv[]) {
    fd = 0;
    printf("Running\n");
    for(tries = 0; tries < 10000; tries++) {
        // argv[1] is the argument that contains the path to the virtual file
        // of the read-only file descriptor
        fd = open(argv[1], O_RDWR|O_TRUNC);
        if( fd >= 0 ) {
            printf("open succeeded\n");
            break;
        } else {
            if(errno != ETXTBSY) {
                // You expect a lot of ETXTBSY, so only print when you get something else
                error("open");
            }
        }
    }
    if (fd < 0) {
        error("exhausted all open attempts");
        exit(1);
    }
    dprintf(fd, "CVE-2019-5736\n");
    printf("wrote over runc!\n");
    fflush(stdout);

The above code was written by Noah Meyerhans, iliana weller, and Samuel Karp.

How does the patch work?

If you try the same approach with a patched runc, you instead see that opening the file with O_RDWR is denied.  This means that the patch is working!

The runc patch operates by taking advantage of some Linux kernel features introduced in kernel 3.17, specifically a syscall called memfd_create(2).  This syscall creates a temporary memory-backed file and a file descriptor that can be used to access the file.  This file descriptor has some special semantics:  It is automatically removed when the last reference to it is dropped. It’s in memory, so that just equates to freeing the memory.  It supports another useful feature: file sealing.  File sealing allows the file to be made immutable, even to processes that are running as root.

The runc patch changes the behavior of runc so that it creates a copy of itself in one of these temporary file descriptors, and then seals it.  The next time a process launches (via fork(2)) or a process is replaced (via execve(2)), /proc/self/exe will be this sealed, memory-backed file descriptor.  When your rewrite program attempts to modify it, the Linux kernel prevents it as it’s a sealed file.

Could I have avoided being vulnerable?

Yes, a few different mechanisms were available before the patch that provided mitigation for this vulnerability.  The one that I mentioned earlier is user namespaces. Mapping to a different user namespace inside the container would mean that normal Linux file permissions would effectively prevent runc from becoming writable because the compromised process inside the container is not running as the real root user.

Another mechanism, which is used by Google Container-Optimized OS, is to have the host’s root filesystem mounted as read-only.  A read-only mount of the runc binary itself would also prevent the runc binary from becoming writable.

SELinux, when correctly configured, may also prevent this vulnerability.

A different approach to preventing this vulnerability is to treat the Linux kernel as belonging to a single tenant, and spend your effort securing the kernel through another layer of separation.  This is typically accomplished using a hypervisor or a virtual machine.

Amazon invests heavily in this type of boundary. Amazon EC2 instances, AWS Lambda functions, and AWS Fargate tasks are secured from each other using techniques like these.  Amazon EC2 bare-metal instances leverage the next-generation Nitro platform that allows AWS to offer secure, bare-metal compute with a hardware root-of-trust.  Along with traditional hypervisors, the Firecracker virtual machine manager can be used to implement this technique with function- and container-like workloads.

Further reading

The original researchers who discovered this vulnerability have published their own post, CVE-2019-5736: Escape from Docker and Kubernetes containers to root on host. They describe how they discovered the vulnerability, as well as several other attempts.

Acknowledgements

I’d like to thank the original researchers who discovered the vulnerability for reporting responsibly and the OCI maintainers (and Aleksa Sarai specifically) who coordinated the disclosure. Thanks to Linux distribution maintainers and cloud providers who made updated packages available quickly.  I’d also like to thank the Amazonians who made it possible for AWS customers to be protected:

  • AWS Security who ran the whole process
  • Clare Liguori, Onur Filiz, Noah Meyerhans, iliana weller, and Tom Kirchner who performed analysis and validation of the patch
  • The Amazon Linux team (and iliana weller specifically) who backported the patch and built new Docker RPMs
  • The Amazon ECS, Amazon EKS, and AWS Fargate teams for making patched infrastructure available quickly
  • And all of the other AWS teams who put in extra effort to protect AWS customers

How to run AWS CloudHSM workloads on Docker containers

Post Syndicated from Mohamed AboElKheir original https://aws.amazon.com/blogs/security/how-to-run-aws-cloudhsm-workloads-on-docker-containers/

AWS CloudHSM is a cloud-based hardware security module (HSM) that enables you to generate and use your own encryption keys on the AWS Cloud. With CloudHSM, you can manage your own encryption keys using FIPS 140-2 Level 3 validated HSMs. Your HSMs are part of a CloudHSM cluster. CloudHSM automatically manages synchronization, high availability, and failover within a cluster.

CloudHSM is part of the AWS Cryptography suite of services, which also includes AWS Key Management Service (KMS) and AWS Certificate Manager Private Certificate Authority (ACM PCA). KMS and ACM PCA are fully managed services that are easy to use and integrate. You’ll generally use AWS CloudHSM only if your workload needs a single-tenant HSM under your own control, or if you need cryptographic algorithms that aren’t available in the fully-managed alternatives.

CloudHSM offers several options for you to connect your application to your HSMs, including PKCS#11, Java Cryptography Extensions (JCE), or Microsoft CryptoNG (CNG). Regardless of which library you choose, you’ll use the CloudHSM client to connect to all HSMs in your cluster. The CloudHSM client runs as a daemon, locally on the same Amazon Elastic Compute Cloud (EC2) instance or server as your applications.

The deployment process is straightforward if you’re running your application directly on your compute resource. However, if you want to deploy applications using the HSMs in containers, you’ll need to make some adjustments to the installation and execution of your application and the CloudHSM components it depends on. Docker containers don’t typically include access to an init process like systemd or upstart. This means that you can’t start the CloudHSM client service from within the container using the general instructions provided by CloudHSM. You also can’t run the CloudHSM client service remotely and connect to it from the containers, as the client daemon listens to your application using a local Unix Domain Socket. You cannot connect to this socket remotely from outside the EC2 instance network namespace.

This blog post discusses the workaround that you’ll need in order to configure your container and start the client daemon so that you can utilize CloudHSM-based applications with containers. Specifically, in this post, I’ll show you how to run the CloudHSM client daemon from within a Docker container without needing to start the service. This enables you to use Docker to develop, deploy and run applications using the CloudHSM software libraries, and it also gives you the ability to manage and orchestrate workloads using tools and services like Amazon Elastic Container Service (Amazon ECS), Kubernetes, Amazon Elastic Container Service for Kubernetes (Amazon EKS), and Jenkins.

Solution overview

My solution shows you how to create a proof-of-concept sample Docker container that is configured to run the CloudHSM client daemon. When the daemon is up and running, it runs the AESGCMEncryptDecryptRunner Java class, available on the AWS CloudHSM Java JCE samples repo. This class uses CloudHSM to generate an AES key, then it uses the key to encrypt and decrypt randomly generated data.

Note: In my example, you must manually enter the crypto user (CU) credentials as environment variables when running the container. For any production workload, you’ll need to carefully consider how to provide, secure, and automate the handling and distribution of your HSM credentials. You should work with your security or compliance officer to ensure that you’re using an appropriate method of securing HSM login credentials for your application and security needs.

Figure 1: Architectural diagram

Figure 1: Architectural diagram

Prerequisites

To implement my solution, I recommend that you have basic knowledge of the below:

  • CloudHSM
  • Docker
  • Java

Here’s what you’ll need to follow along with my example:

  1. An active CloudHSM cluster with at least one active HSM. You can follow the Getting Started Guide to create and initialize a CloudHSM cluster. (Note that for any production cluster, you should have at least two active HSMs spread across Availability Zones.)
  2. An Amazon Linux 2 EC2 instance in the same Amazon Virtual Private Cloud in which you created your CloudHSM cluster. The EC2 instance must have the CloudHSM cluster security group attached—this security group is automatically created during the cluster initialization and is used to control access to the HSMs. You can learn about attaching security groups to allow EC2 instances to connect to your HSMs in our online documentation.
  3. A CloudHSM crypto user (CU) account created on your HSM. You can create a CU by following these user guide steps.

Solution details

  1. On your Amazon Linux EC2 instance, install Docker:
    
            # sudo yum -y install docker
            

  2. Start the docker service:
    
            # sudo service docker start
            

  3. Create a new directory and step into it. In my example, I use a directory named “cloudhsm_container.” You’ll use the new directory to configure the Docker image.
    
            # mkdir cloudhsm_container
            # cd cloudhsm_container           
            

  4. Copy the CloudHSM cluster’s CA certificate (customerCA.crt) to the directory you just created. You can find the CA certificate on any working CloudHSM client instance under the path /opt/cloudhsm/etc/customerCA.crt. This certificate is created during initialization of the CloudHSM Cluster and is needed to connect to the CloudHSM cluster.
  5. In your new directory, create a new file with the name run_sample.sh that includes the contents below. The script starts the CloudHSM client daemon, waits until the daemon process is running and ready, and then runs the Java class that is used to generate an AES key to encrypt and decrypt your data.
    
            #! /bin/bash
    
            # start cloudhsm client
            echo -n "* Starting CloudHSM client ... "
            /opt/cloudhsm/bin/cloudhsm_client /opt/cloudhsm/etc/cloudhsm_client.cfg &> /tmp/cloudhsm_client_start.log &
            
            # wait for startup
            while true
            do
                if grep 'libevmulti_init: Ready !' /tmp/cloudhsm_client_start.log &> /dev/null
                then
                    echo "[OK]"
                    break
                fi
                sleep 0.5
            done
            echo -e "\n* CloudHSM client started successfully ... \n"
            
            # start application
            echo -e "\n* Running application ... \n"
            
            java -ea -Djava.library.path=/opt/cloudhsm/lib/ -jar target/assembly/aesgcm-runner.jar --method environment
            
            echo -e "\n* Application completed successfully ... \n"                      
            

  6. In the new directory, create another new file and name it Dockerfile (with no extension). This file will specify that the Docker image is built with the following components:
    • The AWS CloudHSM client package.
    • The AWS CloudHSM Java JCE package.
    • OpenJDK 1.8. This is needed to compile and run the Java classes and JAR files.
    • Maven, a build automation tool that is needed to assist with building the Java classes and JAR files.
    • The AWS CloudHSM Java JCE samples that will be downloaded and built.
  7. Cut and paste the contents below into Dockerfile.

    Note: Make sure to replace the HSM_IP line with the IP of an HSM in your CloudHSM cluster. You can get your HSM IPs from the CloudHSM console, or by running the describe-clusters AWS CLI command.

    
            # Use the amazon linux image
            FROM amazonlinux:2
            
            # Install CloudHSM client
            RUN yum install -y https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-client-latest.el7.x86_64.rpm
            
            # Install CloudHSM Java library
            RUN yum install -y https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-client-jce-latest.el7.x86_64.rpm
            
            # Install Java, Maven, wget, unzip and ncurses-compat-libs
            RUN yum install -y java maven wget unzip ncurses-compat-libs
            
            # Create a work dir
            WORKDIR /app
            
            # Download sample code
            RUN wget https://github.com/aws-samples/aws-cloudhsm-jce-examples/archive/master.zip
            
            # unzip sample code
            RUN unzip master.zip
            
            # Change to the create directory
            WORKDIR aws-cloudhsm-jce-examples-master
            
            # Build JAR files
            RUN mvn validate && mvn clean package
            
            # Set HSM IP as an environmental variable
            ENV HSM_IP <insert the IP address of an active CloudHSM instance here>
            
            # Configure cloudhms-client
            COPY customerCA.crt /opt/cloudhsm/etc/
            RUN /opt/cloudhsm/bin/configure -a $HSM_IP
            
            # Copy the run_sample.sh script
            COPY run_sample.sh .
            
            # Run the script
            CMD ["bash","run_sample.sh"]                        
            

  8. Now you’re ready to build the Docker image. Use the following command, with the name jce_sample_client. This command will let you use the Dockerfile you created in step 6 to create the image.
    
            # sudo docker build -t jce_sample_client .
            

  9. To run a Docker container from the Docker image you just created, use the following command. Make sure to replace the user and password with your actual CU username and password. (If you need help setting up your CU credentials, see prerequisite 3. For more information on how to provide CU credentials to the AWS CloudHSM Java JCE Library, refer to the steps in the CloudHSM user guide.)
    
            # sudo docker run --env HSM_PARTITION=PARTITION_1 \
            --env HSM_USER=<user> \
            --env HSM_PASSWORD=<password> \
            jce_sample_client
            

    If successful, the output should look like this:

    
            * Starting cloudhsm-client ... [OK]
            
            * cloudhsm-client started successfully ...
            
            * Running application ...
            
            ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors 
            to the console.
            70132FAC146BFA41697E164500000000
            Successful decryption
                SDK Version: 2.03
            
            * Application completed successfully ...          
            

Conclusion

My solution provides an example of how to run CloudHSM workloads on Docker containers. You can use it as a reference to implement your cryptographic application in a way that benefits from the high availability and load balancing built in to AWS CloudHSM without compromising on the flexibility that Docker provides for developing, deploying, and running applications. If you have comments about this post, submit them in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author photo

Mohamed AboElKheir

Mohamed AboElKheir joined AWS in September 2017 as a Security CSE (Cloud Support Engineer) based in Cape Town. He is a subject matter expert for CloudHSM and is always enthusiastic about assisting CloudHSM customers with advanced issues and use cases. Mohamed is passionate about InfoSec, specifically cryptography, penetration testing (he’s OSCP certified), application security, and cloud security (he’s AWS Security Specialty certified).

Scheduling GPUs for deep learning tasks on Amazon ECS

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/scheduling-gpus-for-deep-learning-tasks-on-amazon-ecs/

This post is contributed by Brent Langston – Sr. Developer Advocate, Amazon Container Services

Last week, AWS announced enhanced Amazon Elastic Container Service (Amazon ECS) support for GPU-enabled EC2 instances. This means that now GPUs are first class resources that can be requested in your task definition, and scheduled on your cluster by ECS.

Previously, to schedule a GPU workload, you had to maintain your own custom configured AMI, with a custom configured Docker runtime. You also had to use custom vCPU logic as a stand-in for assigning your GPU workloads to GPU instances. Even when all that was in place, there was still no pinning of a GPU to a task. One task might consume more GPU resources than it should. This could cause other tasks to not have a GPU available.

Now, AWS maintains an ECS-optimized AMI that includes the correct NVIDIA drivers and Docker customizations. You can use this AMI to provision your GPU workloads. With this enhancement, GPUs can also be requested directly in the task definition. Like allocating CPU or RAM to a task, now you can explicitly request a number of GPUs to be allocated to your task. The scheduler looks for matching resources on the cluster to place those tasks. The GPUs are pinned to the task for as long as the task is running, and can’t be allocated to any other tasks.

I thought I’d see how easy it is to deploy GPU workloads to my ECS cluster. I’m working in the US-EAST-2 (Ohio) region, from my AWS Cloud9 IDE, so these commands work for Amazon Linux. Feel free to adapt to your environment as necessary.

If you’d like to run this example yourself, you can find all the code in this GitHub repo. If you run this example in your own account, be aware of the instance pricing, and clean up your resources when your experiment is complete.

Clone the repo using the following command:

git clone https://github.com/brentley/tensorflow-container.git

Setup

You need the latest version of the AWS CLI (for this post, I used 1.16.98):

echo “export PATH=$HOME/.local/bin:$HOME/bin:$PATH” >> ~/.bash_profile
source ~/.bash_profile
pip install --user -U awscli

Provision an ECS cluster, with two C5 instances, and two P3 instances:

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --capabilities CAPABILITY_IAM                            

While AWS CloudFormation is provisioning resources, examine the template used to build your infrastructure. Open `cluster-cpu-gpu.yml`, and you see that you are provisioning a test VPC with two c5.2xlarge instances, and two p3.2xlarge instances. This gives you one NVIDIA Tesla V100 GPU per instance, for a total of two GPUs to run training tasks.

I adapted the TensorFlow benchmark Docker container to create a training workload. I use this container to compare the GPU scheduling and runtime.

When the CloudFormation stack is deployed, register a task definition with the ECS service:

aws ecs register-task-definition --cli-input-json file://gpu-1-taskdef.json

To request GPU resources in the task definition, the only change needed is to include a GPU resource requirement in the container definition:

            "resourceRequirements": [
                {
                    "type": "GPU",
                    "value": "1"
                }
            ],

Including this resource requirement ensures that the ECS scheduler allocates the task to an instance with a free GPU resource.

Launch a single-GPU training workload

Now you’re ready to launch the first GPU workload.

export cluster=$(aws cloudformation describe-stacks --stack-name tensorflow-test --query 
'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' --output text) 
echo $cluster
aws ecs run-task --cluster $cluster --task-definition tensorflow-1-gpu

When you launch the task, the output shows the `guIds` values that are assigned to the task. This GPU is pinned to this task, and can’t be shared with any other tasks. If all GPUs are allocated, you can’t schedule additional GPU tasks until a running task with a GPU completes. That frees the GPU to be scheduled again.

When you look at the log output in Amazon CloudWatch Logs, you see that the container discovered one GPU: `/gpu0` and the training benchmark trained at a rate of 321.16 images/sec.

With your two p3.2xlarge nodes in the cluster, you are limited to two concurrent single GPU based workloads. To scale horizontally, you could add additional p3.2xlarge nodes. This would limit your workloads to a single GPU each.  To scale vertically, you could bump up the instance type,  which would allow you to assign multiple GPUs to a single task.  Now, let’s see how fast your TensorFlow container can train when assigned multiple GPUs.

Launch a multiple-GPU training workload

To begin, replace the p3.2xlarge instances with p3.16xlarge instances. This gives your cluster two instances that each have eight GPUs, for a total of 16 GPUs that can be allocated.

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --parameter-overrides GPUInstanceType=p3.16xlarge --capabilities CAPABILITY_IAM

When the CloudFormation deploy is complete, register two more task definitions to launch your benchmark container requesting more GPUs:

aws ecs register-task-definition --cli-input-json file://gpu-4-taskdef.json  
aws ecs register-task-definition --cli-input-json file://gpu-8-taskdef.json 

Next, launch two TensorFlow benchmark containers, one requesting four GPUs, and one requesting eight GPUs:

aws ecs run-task --cluster $cluster --task-definition tensorflow-4-gpu
aws ecs run-task --cluster $cluster --task-definition tensorflow-8-gpu

With each task request, GPUs are allocated: four in the first request, and eight in the second request. Again, these GPUs are pinned to the task, and not usable by any other task until these tasks are complete.

Check the log output in CloudWatch Logs:

On the “devices” lines, you can see that the container discovered and used four (or eight) GPUs. Also, the total images/sec improved to 1297.41 with four GPUs, and 1707.23 with eight GPUs.

Because you can pin single or multiple GPUs to a task, running advanced GPU based training tasks on Amazon ECS is easier than ever!

Cleanup

To clean up your running resources, delete the CloudFormation stack:

aws cloudformation delete-stack --stack-name tensorflow-test

Conclusion

For more information, see Working with GPUs on Amazon ECS.

If you want to keep up on the latest container info from AWS, please follow me on Twitter and tweet any questions! @brentContained

Amazon ECS Task Placement

Post Syndicated from tiffany jernigan (@tiffanyfayj) original https://aws.amazon.com/blogs/compute/amazon-ecs-task-placement/

Intro

Amazon Elastic Container Service (ECS) is a highly scalable, high-performance container orchestration service that allows you to easily run and scale containerized applications on AWS. This post covers how Amazon Elastic Container Service (Amazon ECS) runs containers in a cluster. Topics include why AWS built the task placement engine, the different strategies and constraints available to decide where and how containers are run, and things to consider when picking placement strategies.

If you are not familiar with the relationship between ECS and Amazon EC2 or its components, see the Building Blocks of Amazon ECS post.

Task Placement

When a task is launched in a cluster, a decision has to be made to choose which container instance should run that task. Conversely, when scaling down a service, a decision has to be made to choose the specific task to be terminated.

Task placement

By default, ECS uses the following placement strategies:

  • When you run tasks with the RunTask API action, tasks are placed randomly in a cluster.
  • When you launch and terminate tasks with the CreateService API action, the service scheduler spreads the tasks across the Availability Zones (and the instances within the zones) in a cluster.

Before December 2016, tasks could only be placed by their default placement strategies. This meant making the decision yourself, such as writing your own scheduler, and calling the StartTask API action to achieve custom task placement. When you manually constrained the placement of your grouping of containers, you could only place based on CPU, memory, and ports. Additionally, while creating your own scheduler can be powerful, there’s a tradeoff with complexity.

AWS built the task placement engine, which removes the need for you to build, run, and manage your own scheduling and placement services. There are several new features that provide you with more control over how applications run across clusters through custom attributes.

You can think of this flow as a funnel with filters for your instances. Constraints must be obeyed. If an instance doesn’t fit, it isn’t used. Strategies are then used to sort the rest of the instances by preference to determine which are the “best.”

For every instantiation of your task, it runs through every step. Calling run-task with a count of n is effectively calling run-task n times (create-service also works the same way).

Cluster Constraints, Placement Constraints, Placement Strategies

Example

Here’s how to use these placement features. In this example, you use the AWS CLI run-task command. For the last couple of filters, I show how to use them with placement flags, but you can just as easily include them in your task definition file instead. This can all be done in the console as well. Start with the cluster shown earlier:

Task Placement Instances

aws ecs run-task --task-definition nouvelleApp \
--placement-constraints type="memberOf",expression="attribute:ecs.instance-type == t2.small" \
--placement-strategy --placement-strategy type="binpack",field="memory" \
--count 8

Cluster constraints

In the first step, eliminate all the instances that don’t have the required resources based on what you defined either in the JSON task definition or what you provided overrides for to RunTask.

Not enough CPU? Not enough memory? A port is needed, but it is already in use on that instance? Then the instance is eliminated from the set of valid candidates.

Task Placement Cluster Constraints

aws ecs run-task --task-definition nouvelleApp

Placement constraints

In the second step, keep only the instances that satisfy the attribute or task group constraints. Yes, this means that you can indicate what instance to use for a task (for example, to make sure that CPU-intensive jobs are scheduled on the right type of instance, or in which Availability Zone).

You can also create any custom tags of your choosing. The green tasks on the green instances, the blue tasks on the blue instances! You can also use the Cluster Query Language to write expressions to check for multiple attributes. In the next section, I cover how to write and use the attributes and expressions.

Placement Constraints

--placement-constraints type="memberOf",expression="attribute:ecs.instance-type == t2.micro"

Placement strategies

In the third step, filter on the following supported task placement strategies:

  • random
  • binpack
  • spread

By default, tasks are randomly placed with RunTask or spread across Availability Zones with CreateService. Spread is typically used to achieve high availability by making sure that multiple copies of a task are scheduled across multiple instances based on attributes such as Availability Zones.

Conversely, binpack places tasks together to be as cost-efficient as possible. Later in this post, you’ll see how these placement strategies work, as well as how to chain them together and why you may want to do so.

Task Placement Binpack

--placement-strategy type="binpack",field="memory"

Task copies

This isn’t part of the filter, but instead, the count flag is used to indicate how many copies (n) of a given task to run. Effectively, it tells ECS to re-run this workflow n times. By default, the count is set to 1, so run-task is executed one time. For services, the desired-count flag is used.

--count 8

Attributes, task groups, and expressions

For task placement, you can use instance fields, such as attributes, as well as task groups. These can be used in expressions for task placement constraints, or instance fields can be used standalone for task placement strategies. Here’s a quick overview of attributes, task groups, and expressions before you go any further.

Instance: Fields

Because you are using these fields with respect to instances in task placement, the instance: preface is optional and can be used either of the following ways with a field name or an attribute.

instance:<field>
<field>

Field names

The currently supported field names are as follows:

ec2InstanceId
agentConnected

Attributes

There are also instance attributes, which are prefaced with attribute. Again, instance: is optional:

attribute:<attribute-name>

Built-in attributes

The following are some of the provided attributes:

ecs.ami-id
ecs.availability-zone
ecs.instance-type
ecs.os-type
ecs.subnet-id
ecs.vpc-id

Custom attributes

Well, what if you don’t see an attribute that you want? This is where custom attributes come in handy! Want to differentiate between test and prod? What about blue versus green?

aws ecs put-attributes \
--attributes name=color,value=blue,targetId=<your-container-instance-arn>

Task groups

In addition to placing tasks based on attributes, you can use task groups. Every task is assigned a group ID that you can reference in placement. For both tasks and services, a default ID is given, or you can choose your own. Perhaps you want to run version 2 of a service but only on instances with version 1.

task:group

Expressions

Alright, so you have some attributes and task groups… now what? Well, AWS created the Cluster Query Language to make it easy to create expressions for task placement constraints. These attributes and task groups are used with the available comparison operators, which may look familiar if you’ve used Boolean operators before. Some of these operators can be written in multiple ways, such as “!” or “not”.

For instance, to create an expression using a single attribute to select only t2.micro instances, use the ecs.instance-type attribute and the string equality comparator as follows:

attribute:ecs.instance-type == t2.micro

For t2.micro and t2.nano instances, you have a few options. You could use the same syntax as earlier with the or comparator:

attribute:ecs.instance-type == t2.micro or attribute:ecs.instance-type == t2.nano

Another way is to use the in comparator with an argument list:

attribute:ecs.instance-type in [t2.micro, t2.nano]

To include all t2 instances, use a wildcard and the pattern match operator instead of listing out each one:

attribute:ecs.instance-type =~ t2.*

Task group comparisons work the same way. The following snippet selects any instance upon which the task group “database” is running:

task:group == database

To select only task groups that are not “database,” combine expressions:

not(task:group == database)

You can use these expressions to filter your instances:

aws ecs list-container-instances \
--filter "attribute:ecs.instance-type != t2.micro"
aws ecs list-container-instances \
--filter "attribute:color == blue"
aws ecs list-container-instances \
--filter "task:group == database"

These expressions and attributes, respectively, are also used for task placement constraints and strategies, which I cover in the next few sections.

Constraints

Now look at placement constraints. When determining task placement, there may be certain EC2 instances to include or exclude from running containers. For example, you may want to place tasks only on GPU types.

Task placement constraints let you define where your containers should run across your cluster. ECS currently supports two types of placement constraints: distinctInstance and memberOf. By default, ECS spreads tasks across Availability Zones and instances.

  "placementConstraints": [ 
      { 
         "expression": "string",
         "type": "string"
      }
   ],

Distinct Instance

Distinct InstanceThe distinctInstance constraint makes it possible to ensure that every container is started on a unique instance in your cluster. The distinctInstance constraint never places multiple copies of a task on a single instance, even if you request more running tasks than available instances.

For example if you decide to place five copies of a task, each time it filters out the instances that are already running the task.

aws ecs run-task --task-definition nouvelleApp \
--count 5 --placement-constraints type="distinctInstance"

Member of

Member of t2-micro The memberOf constraint describes a set of instances on which your tasks should run. It is for anything you could define as an attribute or task. It also takes in an expression of attributes written in the Cluster Query Language.

For example, if you have a small application and just want it to run on t2.micro instances:

aws ecs run-task --task-definition nouvelleApp \
--count 5 \
--placement-constraints 
type="memberOf",expression="attribute:ecs.instance-type == t2.micro"

You can create expressions using the Cluster Query Language to check for multiple attributes. Here’s how you can weed out all instances in the us-west-2c Availability Zone as well as instances that aren’t of type t2.nano or t2.micro:

aws ecs run-task --task-definition nouvelleApp \
--count 5 \
--placement-constraints type="memberOf",expression="attribute:ecs.availability-zone != us-west-2c and (attribute:ecs.instance-type == t2.nano or attribute:ecs.instance-type == t2.micro)"

Member of affinity

You can also use constraints to place all tasks with the same task group on the same instance (affinity):

aws ecs run-task --task-definition nouvelleApp \
--count 5 --group webserver \
--placement-constraints type=memberOf,expression="task:group == webserver"

Or you can ensure that instances never have more than one task in the same group (anti-affinity):

aws ecs run-task –task-definition nouvelleApp –count 5 –group webserver –placement-constraints type=memberOf,expression=”not(task:group == webserver)”

Strategies

Now look at placement strategies. Placement strategies are used to identify an instance that meets a specific strategy. ECS supports three task placement strategies:

  • random
  • binpack
  • spread

Random is how RunTask places tasks by default and is fairly straightforward (it doesn’t require further parameters). The two other strategies, binpack and spread, take opposite actions. Binpack places tasks on as few instances as possible, helping to optimize resource utilization, while spread places tasks evenly across your cluster to help maximize availability. By default, ECS uses spread with the ecs.availability-zone attribute to place tasks.

   "placementStrategy": [ 
      { 
         "field": "string",
         "type": "string"
      }
   ],

Random

Placement Random

 Random places tasks on instances at random. This still honors the other constraints that you specified, implicitly or explicitly. Specifically, it still makes sure that tasks are scheduled on instances with enough resources to run them.

aws ecs run-task --task-definition nouvelleApp \
--count 5 \
--placement-strategy type="random"

Bin packing

Placement Binpack

The binpack strategy tries to fit your workloads in as few instances as possible. It gets its name from the bin packing problem where the goal is to fit objects of various sizes in the smallest number of bins. It is well suited to scenarios for minimizing the number of instances in your cluster, perhaps for cost savings, and lends itself well to automatic scaling for elastic workloads, to shut down instances that are not in use.

When you use the binpack strategy, you must also indicate if you are trying to make optimal use of your instances’ CPU or memory. This is done by passing an extra field parameter, which tells the task placement engine which parameter to use to evaluate how “full” your “bins” are. It then chooses the instance with the least available CPU or memory (depending on which you pick). If there are multiple instances with this CPU or memory remaining, it chooses randomly.

aws ecs run-task --task-definition nouvelleApp \
--count 8 --placement-strategy type="binpack",field="cpu"

aws ecs run-task --task-definition nouvelleApp \
--count 8 --placement-strategy type="binpack",field="memory"

Spread

Placement Spread

The spread strategy, contrary to the binpack strategy, tries to put your tasks on as many different instances as possible. It is typically used to achieve high availability and mitigate risks, by making sure that you don’t put all your task-eggs in the same instance-baskets. Spread across Availability Zones, therefore, is the default placement strategy used for services.

When using the spread strategy, you must also indicate a field parameter. It is used to indicate the “bins” that you are considering. The accepted values are instanceID to balance tasks across all instances, host, or attribute key:value pairs such as attribute:ecs.availability-zone to balance tasks across zones. There are several AWS attributes that start with the “ecs” prefix, but you can be creative and create your own attributes.

aws ecs run-task --task-definition nouvelleApp \
--count 8 \
--placement-strategy type="spread",field="attribute:ecs.availability-zone"

Chaining placement strategies

Placement binpack spread

Now that you’ve seen how to use task placement strategies, you can also chain multiple task placement strategies with their respective attributes together. You can have up to five strategy rules per service. Perhaps you want to spread tasks across Availability Zones and binpack:

aws ecs run-task --task-definition nouvelleApp \
--count 8 \
--placement-strategy type="spread",field="attribute:ecs.availability-zone" type="binpack",field="memory"

Use cases

Here are some use cases for task placement so you can see how they can be solved by combining attributes, expressions, constraints, and strategies.

Task creation

Mariya is fairly new to using containers and especially container orchestrators. She wants to try ECS and has a simple application that she first wants to get running on a single node. (Solution: Use the RunTask API.)

aws ecs run-task --task-definition nouvelleApp

Scaling

After trying this, Mariya wants to scale her application to run 10 containers across any available nodes in her cluster. (Solution: This means she needs to run a task using either random or spread placement strategies.)

aws ecs run-task --task-definition nouvelleApp \
--count 10 \
--placement-strategy type="random"

Availability

Mariya then realizes that if she wants her tasks to automatically restart themselves if they fail, or if she wants more than 10 instantiations of her task running, she needs to create a service. (Solution: Create a service.)

aws ecs create-service --task-definition nouvelleApp \
--desiredCount 300 --placement-strategy type="random"

Christopher wants to achieve high availability by distributing his tasks amongst all the instances in his cluster so he minimizes impact if any one host goes down. (Solution: To do this he uses spread placement over host name.)

aws ecs run-task --task-definition nouvelleApp \
--count 9 \
--placement-strategy type="spread",field="host"

Ming-ya wants to run a monitoring container on each instance in her cluster. To help her do this, she creates a service with a high desired count and a distinctInstance placement constraint. The ECS service scheduler ensures that each instance in the cluster runs this task (up to the desired count).

aws ecs create-service --service-name monitoring \
--task-definition monitor \
--desiredCount 500 \
--placement-constraints type="distinctInstance"

Availability and Task Groups

Alex wants to run a fleet of webservers. For performance reasons, they want each webserver to have local access to a caching process that was written by another team. They define their webserver as one task, the caching server as a second task. When they launch their webserver task they uses a placement constraint so that the tasks are only placed on instances that are already hosting the cache task. (Solution: Use placement constraints with a task group.)

aws ecs run-task --task-definition cache \
--group caching --count 9 \
--placement-constraints type="distinctInstance"

aws ecs run-task --task-definition webserver \
--count 9 \
--placement-constraints type="distinctInstance" type="memberOf",expression="task:group == caching"

Availability and resource optimization

Jake wants to achieve high availability, but he has a limited budget and needs to optimize all the resources he uses. (Solution: Take a balanced approach of spreading over availability Availability Zones and binpacking on memory within a zone.)

aws ecs run-task --task-definition nouvelleApp \
--count 9 \
--placement-strategy type="spread",field="attribute:ecs.availability-zone" type="binpack",field="memory"

Instance type selection

Aditya has a GPU workload that they want to run in containers on ECS. He needs to ensure that only GPU-enabled instances are used for this workload. (Solution: Create a service and spread on instance type = G2* or whatever other GPU-enabled instance types are in the cluster)

aws ecs create-service --service-name workload \
--task-definition GPU --desiredCount 30 \
--placement-constraints type="memberOf",expression="attribute:ecs.instance-type =~ g2* or attribute:ecs.instance-type =~ p2*"

Conclusion

You’ve now looked at task placement at a high level, as well as:

  • Attributes, task groups, and expressions
  • Constraints
  • Strategies
  • Example use cases

To dive deeper into any of these aspects, check out Task Placement. Also, feel free to ask any questions!

@tiffanyfayj

 

Tagging container image repositories on Amazon ECR

Post Syndicated from Brent Langston original https://aws.amazon.com/blogs/compute/tagging-container-image-repositories-on-amazon-ecr/

Starting today, you can add tags to your Amazon Elastic Container Registry (Amazon ECR) resources. This new feature enables better grouping of ECR repositories, better searching and filtering in the console, and better cost allocation. In this post, I show you how to create a tagging strategy.

You might have many ECR repositories and want start assigning tags to each of them. Two strategies come to mind almost immediately:

  • You could have repositories to host your development Docker images, and keep different repositories for hosting production images.
  • You could group repositories together according to the organization of the development teams.

Or, you can follow both strategies. Here’s a typical ecommerce application as an example. The services are organized as follows:

  • Accounts team
    • users
    • password
    • email
    • 2fa
  • Inventory team
    • catalog
    • pricing
    • backorders
  • Cart team
    • contents
    • shipping
    • coupons

In this example company, there are 10 services to manage. Realistically, these services would easily number into the hundreds or thousands for many ecommerce websites. You would likely have one development set of repositories for each service, and another production set.

Tag the repositories by team and by service level: development or production. Here’s one example, for the Accounts team repos in development.

You can also use tags in an IAM policy to allow your dev teams access to the development version of their repositories. For example, you can restrict the ECS and EKS instances to their service level, and allow the SRE team access to all repositories.

Here is an example IAM policy that restricts access to the Accounts team:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "ecr:GetAuthorizationToken",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:GetRepositoryPolicy",
            "ecr:DescribeRepositories",
            "ecr:ListImages",
            "ecr:DescribeImages",
            "ecr:BatchGetImage",
            "ecr:InitiateLayerUpload",
            "ecr:UploadLayerPart",
            "ecr:CompleteLayerUpload",
            "ecr:PutImage"
        ],
        "Resource": "*",
        "Condition": {"StringLike": {"aws:RequestTag/Team": "Accounts"}}
        
    }]
}

After you configure these tags appropriately for your repos and set the IAM policy for specific teams, developers can push and pull from any repo tagged for their team. They can even access future repos that are added with their team’s tag. They do not have access to push or pull from a different team’s repo.

You can also use tags to track costs and review in the Cost Allocation report. This helps you understand how much you’re spending in dev or prod, and how much for each service and in dev or prod per service group. Add your billing tags to your repositories.

Summary

With today’s feature launch for adding tags to ECR resources, you can now apply policies and analyze costs by grouping ECR repositories together in many different ways. For more information, see Tagging Your Amazon ECS Resources. For the public roadmap for container releases, see the containers-roadmap on GitHub.

I hope you find this helpful. If you have other creative and useful tagging schemes, for ECR or ECS, please share in the comments.

— Brent

Introducing AWS App Mesh – service mesh for microservices on AWS

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/introducing-aws-app-mesh-service-mesh-for-microservices-on-aws/

AWS App Mesh is a service mesh that allows you to easily monitor and control communications across microservices applications on AWS. You can use App Mesh with microservices running on Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Container Service for Kubernetes (Amazon EKS), and Kubernetes running on Amazon EC2.

Today, App Mesh is available as a public preview. In the coming months, we plan to add new functionality and integrations.

Why App Mesh?

Many of our customers are building applications with microservices architectures, breaking applications into many separate, smaller pieces of software that are independently deployed and operated. Microservices help to increase the availability and scalability of an application by allowing each component to scale independently based on demand. Each microservice interacts with the other microservices through an API.

When you start building more than a few microservices within an application, it becomes difficult to identify and isolate issues. These can include high latencies, error rates, or error codes across the application. There is no dynamic way to route network traffic when there are failures or when new containers need to be deployed.

You can address these problems by adding custom code and libraries into each microservice and using open source tools that manage communications for each microservice. However, these solutions can be hard to install, difficult to update across teams, and complex to manage for availability and resiliency.

AWS App Mesh implements a new architectural pattern that helps solve many of these challenges and provides a consistent, dynamic way to manage the communications between microservices. With App Mesh, the logic for monitoring and controlling communications between microservices is implemented as a proxy that runs alongside each microservice, instead of being built into the microservice code. The proxy handles all of the network traffic into and out of the microservice and provides consistency for visibility, traffic control, and security capabilities to all of your microservices.

Use App Mesh to model how all of your microservices connect. App Mesh automatically computes and sends the appropriate configuration information to each microservice proxy. This gives you standardized, easy-to-use visibility and traffic controls across your entire application.  App Mesh uses Envoy, an open source proxy. That makes it compatible with a wide range of AWS partner and open source tools for monitoring microservices.

Using App Mesh, you can export observability data to multiple AWS and third-party tools, including Amazon CloudWatch, AWS X-Ray, or any third-party monitoring and tracing tool that integrates with Envoy. You can configure new traffic routing controls to enable dynamic blue/green canary deployments for your services.

Getting started

Here’s a sample application with two services, where service A receives traffic from the internet and uses service B for some backend processing. You want to route traffic dynamically between services B and B’, a new version of B deployed to act as the canary.

First, create a mesh, a namespace that groups related microservices that must interact.

Next, create virtual nodes to represent services in the mesh. A virtual node can represent a microservice or a specific microservice version. In this example, service A and B participate in the mesh and you manage the traffic to service B using App Mesh.

Now, deploy your services with the required Envoy proxy and with a mapping to the node in the mesh.

After you have defined your virtual nodes, you can define how the traffic flows between your microservices. To do this, define a virtual router and routes for communications between microservices.

A virtual router handles traffic for your microservices. After you create a virtual router, you create routes to direct traffic appropriately. These routes include the connection requests that the route should accept, where they should go, and the weighted amount of traffic to send. All of these changes to adjust traffic between services is computed and sent dynamically to the appropriate proxies by App Mesh to execute your deployment.

You now have a virtual router set up that accepts all traffic from virtual node A sending to the existing version of service B, as well some traffic to the new version, B’.

Exporting metrics, logs, and traces

One of benefits about placing a proxy in front of every microservice is that you can automatically capture metrics, logs, and traces about the communication between your services. App Mesh enables you to easily collect and export this data to the tools of your choice. Envoy is already integrated with several tools like Prometheus and Datadog.

During the preview, we are adding support for AWS services such as Amazon CloudWatch and AWS X-Ray. We have a lot more integrations planned as well.

Available now

AWS App Mesh is available as a public preview and you can start using it today in the North Virginia, Ohio, Oregon, and Ireland AWS Regions. During the preview, we plan to add new features and want to hear your feedback. You can check out our GitHub repository for examples and our roadmap.

— Nate

Scanning Docker Images for Vulnerabilities using Clair, Amazon ECS, ECR, and AWS CodePipeline

Post Syndicated from tiffany jernigan (@tiffanyfayj) original https://aws.amazon.com/blogs/compute/scanning-docker-images-for-vulnerabilities-using-clair-amazon-ecs-ecr-aws-codepipeline/

Post by Vikrama Adethyaa, Solution Architect and Tiffany Jernigan, Developer Advocate

 

Containers are an increasingly important way for you to package and deploy your applications. They are lightweight and provide a consistent, portable software environment for applications to easily run and scale anywhere.

A container is launched from a container image, an executable package that includes everything needed to run an application: the application code, configuration files, runtime (for example, Java, Python, etc.), libraries, and environment variables.

A container image is built up from a series of layers. For a Docker image, each layer in the image represents an instruction in the image’s Dockerfile. A parent image is the image on which your image is built. It refers to the contents of the FROM directive in the Dockerfile. Most Dockerfiles start from a parent image, and often the parent image was downloaded from a public registry.

It is incredibly difficult and time-consuming to manually track all the files, packages, libraries, and so on, included in an image along with the vulnerabilities that they may possess. Having a security breach is one of the costliest things an organization can endure. It takes years to build up a reputation and only seconds to tear it down.

One way to prevent breaches is to regularly scan your images and compare the dependencies to a known list of common vulnerabilities and exposures (CVEs). Public CVE lists contain an identification number, description, and at least one public reference for known cybersecurity vulnerabilities. The automatic detection of vulnerabilities helps increase awareness and best security practices across developer and operations teams. It encourages action to patch and address the vulnerabilities.

This post walks you through the process of setting up an automated vulnerability scanning pipeline. You use AWS CodePipeline to scan your container images for known security vulnerabilities and deploy the container only if the vulnerabilities are within the defined threshold.

This solution uses CoresOS Clair for static analysis of vulnerabilities in container images. Clair is an API-driven analysis engine that inspects containers layer-by-layer for known security flaws. Clair scans each container layer and provides a notification of vulnerabilities that may be a threat, based on the CVE database and similar data feeds from Red Hat, Ubuntu, and Debian.

Deploying Clair

Here’s how to install Clair on AWS. The following diagram shows the high-level architecture of Clair.

Clair uses PostgreSQL, so use Aurora PostgreSQL to host the Clair database. You deploy Clair as an ECS service with the Fargate launch type behind an Application Load Balancer. The Clair container is deployed in a private subnet behind the Application Load Balancer that is hosted in the public subnets. The private subnets must have a route to the internet using the NAT gateway, as Clair fetches the latest vulnerability information from multiple online sources.

Prerequisites

Ensure that the following are installed or configured on your workstation before you deploy Clair:

  • Docker
  • Git
  • AWS CLI installed
  • AWS CLI is configured with your access key ID and secret access key, and the default region as us-east-1

Download the AWS CloudFormation template for deploying Clair

To help you quickly deploy Clair on AWS and set up CodePipeline with automatic vulnerability detection, use AWS CloudFormation templates that can be downloaded from the aws-codepipeline-docker-vulnerability-scan GitHub repository. The repository also includes a simple, containerized NGINX website for testing your pipeline.

# Clone the GitHub repository
git clone https://github.com/aws-samples/aws-codepipeline-docker-vulnerability-scan.git

cd aws-codepipeline-docker-vulnerability-scan

VPC requirements

We recommend a VPC with the following specification for deploying CoreOS Clair:

  • Two public subnets
  • Two private subnets
  • NAT gateways to allow internet access for services in private subnets

You can create such a VPC using the AWS CloudFormation template networking-template.yaml that is included in the sample code you cloned from GitHub.

# Create the VPC
aws cloudformation create-stack \
--stack-name coreos-clair-vpc-stack \
--template-body file://networking-template.yaml

# Verify that stack creation is complete
aws cloudformation wait stack-create-complete \
–stack-name coreos-clair-vpc-stack

# Get stack outputs
aws cloudformation describe-stacks \
--stack-name coreos-clair-vpc-stack \
--query 'Stacks[].Outputs[]'

Build the Clair Docker image

First, create an Amazon Elastic Container Registry (Amazon ECR) repository to host your Clair Docker image. Then, build the Clair Docker image on your workstation and push it to the ECR repository that you created.

# Create the ECR repository
# Note the URI and ARN of the ECR Repository
aws ecr create-repository --repository-name coreos-clair

# Build the Docker image
docker build -t <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/coreos-clair:latest ./coreos-clair

# Push the Docker image to ECR
aws ecr get-login --no-include-email | bash
docker push <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/coreos-clair:latest

Deploy Clair using AWS CloudFormation

Now that the Clair Docker image has been built and pushed to ECR, deploy Clair as an ECS service with the Fargate launch type. The following AWS CloudFormation stack creates an ECS cluster named clair-demo-cluster and deploys the Clair service.

# Create the AWS CloudFormation stack
# <ECRRepositoryUri> - CoreOS Clair ECR repository URI without an image tag
# Example - <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/coreos-clair

aws cloudformation create-stack \
--stack-name coreos-clair-stack \
--template-body file://coreos-clair/clair-template.yaml \
--capabilities CAPABILITY_IAM \
--parameters \
ParameterKey="VpcId",ParameterValue="<VpcId>" \
ParameterKey="PublicSubnets",ParameterValue=\"<PublicSubnet01-ID>,<PublicSubnet02-ID>\" \
ParameterKey="PrivateSubnets",ParameterValue=\"<PrivateSubnet01-ID>,<PrivateSubnet02-ID>\" \
ParameterKey="ECRRepositoryUri",ParameterValue="<ECRRepositoryUri>"

# Verify that stack creation is complete
aws cloudformation wait stack-create-complete \
–stack-name coreos-clair-stack

# Get stack outputs
# Note the ClairAlbDnsName
aws cloudformation describe-stacks \
--stack-name coreos-clair-stack \
--query 'Stacks[].Outputs[]'

Deploying the sample website

Deploy a simple static website running on NGINX as a container. An AWS CloudFormation template is included in the sample code that you cloned from GitHub.

Create a CodeCommit repository for the NGINX website

You create an AWS CodeCommit repository to host the sample NGINX website code. This repository is the source of the pipeline that you create later. Before you proceed with the following steps, ensure SSH authentication to CodeCommit.

# Create the CodeCommit repository
# Note the cloneUrlSsh value
aws codecommit create-repository --repository-name my-nginx-website
 
# Clone the empty CodeCommit repository
cd ../
git clone <cloneUrlSsh>

# Copy the contents of nginx-website to my-nginx-website
cp -R aws-codepipeline-docker-vulnerability-scan/nginx-website/ my-nginx-website/

# Commit the changes
cd my-nginx-website/
git add *
git commit -m "Initial commit"
git push

Build the NGINX Docker image

Create an ECR repository to host your NGINX website Docker image. Build the image on your workstation using the file Dockerfile-amznlinux, where Amazon Linux is the parent image. After the image is built, push it to the ECR repository that you created.

# Create an ECR repository
# Note the URI and ARN of the ECR repository
aws ecr create-repository --repository-name nginx-website

# Build the Docker image
docker build -f Dockerfile-amznlinux -t <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/nginx-website:latest .

# Push the Docker image to ECR
docker push <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/nginx-website:latest

Deploy the NGINX website using AWS CloudFormation

Now deploy the NGINX website. The following stack deploys the NGINX website onto the same ECS cluster (clair-demo-cluster) as Clair.

# Create the AWS CloudFormation stack
# <ECRRepositoryUri> - Nginx-Website ECR Repository URI without Image tag
# Example: <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/nginx-website

cd ../aws-codepipeline-docker-vulnerability-scan/

aws cloudformation create-stack \
--stack-name nginx-website-stack \
--template-body file://nginx-website/nginx-website-template.yaml \
--capabilities CAPABILITY_IAM \
--parameters \
ParameterKey="VpcId",ParameterValue="<VpcId>" \
ParameterKey="PublicSubnets",ParameterValue=\"<PublicSubnet01-ID>,<PublicSubnet02-ID>\" \
ParameterKey="PrivateSubnets",ParameterValue=\"<PrivateSubnet01-ID>,<PrivateSubnet02-ID>\" \
ParameterKey="ECRRepositoryUri",ParameterValue="<ECRRepositoryUri>"

# Verify that stack creation is complete
aws cloudformation wait stack-create-complete \
–stack-name nginx-website-stack

# Get stack outputs
aws cloudformation describe-stacks \
--stack-name nginx-website-stack \
--query 'Stacks[].Outputs[]'

Note the AWS CloudFormation stack outputs. The stack output contains the Application Load Balancer URL for the NGINX website and the ECS service name of the NGINX website. You need the ECS service name for the pipeline.

Building the pipeline

In this section, you build a pipeline to automate vulnerability scanning for the nginx-website Docker image builds. Every time that a code change is made, the Docker image is rebuilt and scanned for vulnerabilities. Only if vulnerabilities are within the defined threshold is the container is deployed onto ECS. For more information, see Tutorial: Continuous Deployment with AWS CodePipeline.

The sample code includes an AWS CloudFormation template to create the pipeline. The buildspec.yml file is used by AWS CodeBuild to build the nginx-website Docker image and scan the image using Clair.

CodeBuild build spec

build spec is a collection of build commands and related settings, in YAML format, that AWS CodeBuild uses to run a build. You can include a build spec in the root directory of your application source code, or you can define a build spec when you create a build project.

In this sample app, you include the build spec in the root directory of your sample application source code. The buildspec.yml file is located in the /aws-codepipeline-docker-vulnerability-scan/nginx-website folder.

Use Klar, a simple tool to analyze images stored in a private or public Docker registry for security vulnerabilities using Clair. Klar serves as a client which coordinates the image checks between ECR and Clair.

In the buildspec.yml file, you set the variable CLAIR_OUTPUT=Critical. CLAIR_OUTPUT defines the severity level threshold. Vulnerabilities with severity levels higher than or equal to this threshold are outputted. The supported levels are:

  • Unknown
  • Negligible
  • Low
  • Medium
  • High
  • Critical
  • Defcon1

You can configure Klar to your requirements by setting the variables as defined in https://github.com/optiopay/klar.

# Set the following variables as CodeBuild project environment variables
# ECR_REPOSITORY_URI
# CLAIR_URL

version: 0.2
phases:
  pre_build:
    commands:
      - echo Fetching ECR Login
      - ECR_LOGIN=$(aws ecr get-login --region $AWS_DEFAULT_REGION --no-include-email)
      - echo Logging in to Amazon ECR...
      - $ECR_LOGIN
      - IMAGE_TAG=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - echo Downloading Clair client Klar-2.1.1
      - wget https://github.com/optiopay/klar/releases/download/v2.1.1/klar-2.1.1-linux-amd64
      - mv ./klar-2.1.1-linux-amd64 ./klar
      - chmod +x ./klar
      - PASSWORD=`echo $ECR_LOGIN | cut -d' ' -f6`
  build:
    commands:
      - echo Build started on `date`
      - echo Building the Docker image...
      - docker build -t $ECR_REPOSITORY_URI:latest .
      - docker tag $ECR_REPOSITORY_URI:latest $ECR_REPOSITORY_URI:$IMAGE_TAG
  post_build:
    commands:
      - bash -c "if [ /"$CODEBUILD_BUILD_SUCCEEDING/" == /"0/" ]; then exit 1; fi"
      - echo Build completed on `date`
      - echo Pushing the Docker images...
      - docker push $ECR_REPOSITORY_URI:latest
      - docker push $ECR_REPOSITORY_URI:$IMAGE_TAG
      - echo Running Clair scan on the Docker Image
      - DOCKER_USER=AWS DOCKER_PASSWORD=${PASSWORD} CLAIR_ADDR=$CLAIR_URL CLAIR_OUTPUT=Critical ./klar $ECR_REPOSITORY_URI
      - echo Writing image definitions file...
      - printf '[{"name":"MyWebsite","imageUri":"%s"}]' $ECR_REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json
artifacts:
  files: imagedefinitions.json

The build spec does the following:

Pre-build stage:

  • Log in to ECR.
  • Download the Clair client Klar.

Build stage:

  • Build the Docker image and tag it as latest and with the Git commit ID.

Post-build stage:

  • Push the image to your ECR repository with both tags.
  • Trigger Klar to scan the image that you pushed to ECR for security vulnerabilities using Clair.
  • Write a file called imagedefinitions.json in the build root that has your Amazon ECS service’s container name and the image and tag. The deployment stage of your CD pipeline uses this information to create a new revision of your service’s task definition. It then updates the service to use the new task definition. The imagedefinitions.json file is required for the AWS CodeDeploy ECS job worker.

Deploy the pipeline

Deploy the pipeline using the AWS CloudFormation template provided with the sample code. The following template creates the CodeBuild project, CodePipeline pipeline, Amazon CloudWatch Events rule, and necessary IAM permissions.

# Deploy the pipeline
 
# Replace the following variables 
# WebsiteECRRepositoryARN – NGINX website ECR repository ARN
# WebsiteECRRepositoryURI – NGINX website ECR repository URI
# ClairAlbDnsName - Output variable from coreos-clair-stack
# EcsServiceName – Output variable from nginx-website-stack

aws cloudformation create-stack \
--stack-name nginx-website-codepipeline-stack \
--template-body file://clair-codepipeline-template.yaml \
--capabilities CAPABILITY_IAM \
--disable-rollback \
--parameters \
ParameterKey="EcrRepositoryArn",ParameterValue="<WebsiteECRRepositoryARN>" \
ParameterKey="EcrRepositoryUri",ParameterValue="<WebsiteECRRepositoryURI>" \
ParameterKey="ClairAlbDnsName",ParameterValue="<ClairAlbDnsName>" \
ParameterKey="EcsServiceName",ParameterValue="<WebsiteECSServiceName>"

# Verify that stack creation is complete
aws cloudformation wait stack-create-complete \
–stack-name nginx-website-codepipeline-stack

The pipeline is triggered after the AWS CloudFormation stack creation is complete. You can log in to the AWS Management Console to monitor the status of the pipeline. The vulnerability scan information is available in CloudWatch Logs.

You can also modify the CLAIR_OUTPUT value from Critical to High in the buildspec.yml file in the /cores-clair-ecs-cicd/nginx-website-repo folder and then check the status of the build.

Summary

I’ve described how to deploy Clair on AWS and set up a release pipeline for the automated vulnerability scanning of container images. The Clair instance can be used as a centralized Docker image vulnerability scanner and used by other CodeBuild projects. To meet your organization’s security requirements, define your vulnerability threshold in Klar by setting the variables, as defined in https://github.com/optiopay/klar.

Introducing private registry authentication support for AWS Fargate

Post Syndicated from tiffany jernigan (@tiffanyfayj) original https://aws.amazon.com/blogs/compute/introducing-private-registry-authentication-support-for-aws-fargate/

Private registry authentication support for Amazon Elastic Container Service (Amazon ECS) is now available with the AWS Fargate launch type! Now, in addition to Amazon Elastic Container Registry (Amazon ECR), you can use any private registry or repository of your choice for both EC2 and Fargate launch types.

For ECS to pull from a private repository, it needs a secret in AWS Secrets Manager with your registry credentials, an ECS task execution IAM role in AWS Identity Access Management (IAM) with a policy granting access to the secret, and a task with the secret and task execution IAM role ARNs in the task definition.

Diagram of ECS Private Registry Authentication Architecture

Here’s how to use ECS with a private repository on Docker Hub via the AWS Management Console.

Registry

If you don’t already have a private repository (or account), you can create a free repo now. To follow along, run the following commands in a terminal to pull an image, get the image ID, and push it to your new repository:

docker pull tiffanyfay/space
docker images tiffanyfay/space --format {{.ID}}
docker tag <image-id> <your-username/repository-name>:latest
docker login
docker push <your-username/repository-name>

Secrets Manager

In the Secrets Manager console, store a new secret with your Docker Hub credentials, which is used to access your private repository.

By default, Secrets Manager creates an encryption key, DefaultEncryptionKey, on your behalf. You can instead use an existing key or add a new one with AWS Key Management Service (AWS KMS), if you would prefer.

Choose Other type of secrets and add secret keys and values for username and password.

Next, create a name, such as dockerhub, and description for your secret.

Because the keys are corresponding to your Docker Hub credentials, leave rotation disabled.

On the next page, you can review your settings and store your secret. Open your new secret to see the details. Write down the Secret ARN value and keep it handy, as it is used in the next step and later, in your task definition.

IAM

Now that you have a secret, you need to provide Fargate permissions to read it. This is done via a task execution IAM role.

In the IAM console, choose Policies, Create policy. Provide Secrets Manager with read access for secretsmanager:GetSecretValue, with your secret’s ARN as the resource.

Name your policy dockerhubsecret.

If you chose to use your own encryption key, you also need to create a policy with kms:Decrypt permissions for KMS.

Next, choose Role to create an IAM role, which is used as your task execution role. Choose AWS service, Elastic Container Service, and Elastic Container Service Task.

Search for your dockerhubsecret policy and attach it to the role.

Lastly, give the role a name, such as ecsExecutionRoleDockerHub, and create it. Copy the role ARN value. Depending on how you create your task definition, you may need it.

ECS

While the mechanism to authenticate private registries is supported on both EC2 and Fargate launch types, for this example we will be launching a task on Fargate.

Before you can create a task, you need an ECS cluster, VPC, and subnets. If you don’t already have them, in the ECS console, choose Clusters, Get Started. Keep track of the cluster name, VPC ID, and subnet IDs, as you use them soon.

It’s time to create your task definition, which is used to create your task (grouping of up to ten containers that run on the same host). This is where you need your Secrets Manager ARN and IAM role name.

Choose Task Definitions, Create new Task Definition, and select the Fargate launch type. You can then configure your task definition via the wizard or scroll down, choose Configure via JSON and paste the following task definition after replacing fields with angle brackets. This task definition also works with the EC2 launch type.

{
    "family": "space-td",
    "containerDefinitions": [
        {
            "name": "space",
            "image": "<your-username/repository-name>",
            "portMappings": [
                {
                    "protocol": "tcp",
                    "containerPort": 80
                }
            ],
            "cpu": 0,
            "repositoryCredentials": {
                "credentialsParameter": "<secret-ARN>"
            }
        }
    ],
    "memory": "512",
    "cpu": "256",
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "networkMode": "awsvpc",
    "executionRoleArn": "<execution-role-ARN>"
}

If you use the wizard, give your task a name, such as space-td, and specify your task execution IAM role (ecsTaskExecutionRoleDockerHub), a task size of 0.5 GB of memory, and 0.25 vCPU.

Next, choose Container Definitions, Add container. Give the container a name, specify your image <your-username/repository-name>, check the box for private registry authentication, and add your secrets manager ARN and a container port 80. Choose Add.

After you create your task definition, choose Actions, Run Task, and specify the Fargate launch type, your cluster, cluster VPC, subnets, a security group with inbound permissions for your container ports (the default one provides access to port 80). Enable auto-assigning a public IP address.

Open the task from its ID to see the details:

When the Last status field is RUNNING, under Network, copy the public IP address and paste it in a browser.

If you used pushed tiffanyfay/space to your repository, you should see the following:

I hope this post has helped you. If you have any questions, feel free to reach out!

-tiffany

Special thanks to Yuling Zhou, Deepak Dayama, Derek Petersen, Varun Iyer, Adnan Khan and several others for their insights in this blog.

tiffany jernigan

tiffany jernigan

@tiffanyfayj
Tiffany is a developer advocate at Amazon for containers on AWS. Previously she worked at Docker and Intel in software engineering and as a hardware engineer after graduating from Georgia Tech in Electrical Engineering. In the majority of her free time she dabbles in photography and spends time with family and friends. You can find her on twitter/ig as tiffanyfayj.

Hosting ASP.NET Core applications in Amazon ECS using AWS Fargate

Post Syndicated from Sundar Narasiman original https://aws.amazon.com/blogs/compute/hosting-asp-net-core-applications-in-amazon-ecs-using-aws-fargate/

There is an increasing amount of customer interest in hosting microservices-based applications using Amazon Elastic Container Service (ECS), largely due to the benefits offered by AWS Fargate.

AWS Fargate is a compute engine for containers that allows you to run containers without needing to provision, manage, or scale any Amazon EC2 compute infrastructure. Fargate works with Amazon ECS and can run microservices developed in many programming languages or application frameworks. This includes Java, .NET Core, Python, Node.js, Go, or Ruby on Rails. Nowadays, enterprises that are building microservices applications using .NET are using .NET core because of the cross-platform support (the ability to run in Linux).

In this post, I cover how to host a cross-platform ASP.NET core application using AWS Fargate.

Reference architecture

A good reference architecture for AWS Fargate application deployment should cover the VPC, Subnets, Load Balancer, Internet Gateway, Elastic Network Interface (ENI), AWS Fargate Task, Network ACLs, and Security Groups. The architectural choices for VPC Networking, Load Balancing, and Container Networking are also important.

There are a couple of networking approaches for deploying containers in Amazon ECS:

  • Deploy containers in the public VPC Subnet with direct Internet access
  • Deploy containers in the private VPC Subnet without direct Internet access

Because the ASP.NET Core application is going to serve traffic from the Internet, we will deploy containers in the Public VPC Subnet with direct Internet access.

When it comes to sending traffic to containers through the Load Balancer, the following options are available:

  • A public Load Balancer that accepts traffic from the Internet and route it to container through the AWS Fargate Task’s Elastic Network Interface (ENI).
  • A private, Internal Load Balancer that only accepts traffic from other containers in the cluster

Because the ASP.NET Core application container lives in the web tier, go with a public Load Balancer. The public Load Balancer accepts traffic from the Internet and routes it to the container through the AWS Fargate Task’s Elastic Network Interface (ENI).

Based on these considerations, the reference architecture for deploying to AWS Fargate should look like this diagram:

This solution deploys containers in a public Subnet (inside a VPC). The AWS Fargate Task and the two containers are hosted with direct access to the internet. They are also accessible to clients, using the public Load Balancer.

Walkthrough

To implement this architecture, we will do the following:

  1. Containerize the ASP.NET core application.
  2. Configure the reverse-proxy server.
  3. Containerize the NGINX reverse-proxy server.
  4. Create the Docker Compose file.
  5. Push container images to Amazon ECR.
  6. Create the ECS cluster.
  7. Create an Application Load Balancer.
  8. Create an AWS Fargate Task definition.
  9. Create the Amazon ECS service.

Code examples

The code examples, Dockerfile definition, Docker Compose file, and ECS task definition for this solution are available in the amazon-ecs-fargate-aspnetcore GitHub repository.

Pre-requisites

The development environment needs to have the following pre-requisites :-

  • Mac OS latest version (or) Windows 10 with latest updates (or) Ubuntu 16.0.4 or higher
  • .NET core 2.0 or higher
  • Docker latest version
  • aws cli
  • aws-ecs cli

Containerize the ASP.NET Core application

The first step in this journey is to containerize the ASP.NET Core application.

If you are using Visual Studio 2017 or later with the latest updates in Windows, you can add container support to the solution. Open the context (right-click) menu for the existing project and add Docker support.

If you are developing in Linux or Mac OS, you must explicitly add a Dockerfile.

The Dockerfile definition should look like the following, irrespective of the operating system used for development.

FROM microsoft/aspnetcore:2.0
WORKDIR /mymvcweb
COPY bin/Release/netcoreapp2.0/publish . 
ENV ASPNETCORE_URLS http://+:5000
EXPOSE 5000
ENTRYPOINT ["dotnet", "mymvcweb.dll"]

This Dockerfile definition creates an application container based on the microsoft/aspnetcore:2.0 base image. It publishes the contents of the bin/Release folder to a specified work directory, starts the default Kestrel web server and listens on port 5000 to serve web traffic.

By default, ASP.NET core uses Kestrel as the web server. Kestrel is a lightweight HTTP server and is great for serving dynamic content from ASP.NET core. However, for capabilities such as serving static content, caching requests, compressing requests, and terminating SSL from the HTTP server, a dedicated reverse-proxy server like NGINX is required.

Configure the reverse-proxy server

NGINX can act as both the HTTP and reverse-proxy server. NGINX is highly adopted because of its asynchronous, event-driven architecture that allows it to serve thousands of concurrent requests with a low-memory footprint.

In this solution, deploy a NGINX (reverse-proxy server) container in front of the application (ASP.NET core) container, defined in the AWS Fargate Task.

The reverse-proxy configuration file nginx.conf should be defined as follows:

worker_processes 4;
 
events { worker_connections 1024; }
 
http {
    sendfile on;
 
    upstream app_servers {
        server 127.0.0.1:5000;
    }
 
    server {
        listen 80;
 
        location / {
            proxy_pass         http://app_servers;
            proxy_redirect     off;
            proxy_set_header   Host $host;
            proxy_set_header   X-Real-IP $remote_addr;
            proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Host $server_name;
        }
    }
}

The NGINX container is set to listen on port 80 and it is configured to forward the request to the application container listening on port 5000. The attribute upstream app_server in the nginx.conf file must be set with a value of mymvcweb:5000 in the local development environment.

Containerize the NGINX reverse-proxy server

Create a Dockerfile definition like the following to containerize the NGINX reverse-proxy server. It should look like the following:

FROM nginx
COPY nginx.conf /etc/nginx/nginx.conf

Create the Docker Compose file

Next, use docker-compose to define these two containers as a microservices in the local development environment. The Docker Compose file should look like the following:

version: '2'
services:
  mymvcweb:
    build:
      context: ./mymvcweb
      dockerfile: Dockerfile
    expose:
      - "5000"
  reverseproxy:
    build:
      context: ./reverseproxy
      dockerfile: Dockerfile
    ports:
      - "80:80"
    links :
      - mymvcweb

These two containers can be built and tested by issuing the following docker-compose commands:

docker-compose build
docker-compose up

Open http://localhost:80 in the browser and it should render the default view of ‘index. cshtml’. Whenever there is a change to the application code or container definition, the docker-compose cache should be cleaned to affect the latest changes. To do this, run the following docker-compose commands:

docker-compose stop
docker-compose rm
docker-compose rmi ‘containerimageid’

Push container images to Amazon ECR

Next, push the container images from the local environment to Amazon Elastic Container Registry (ECR) so that the container images are available in Amazon ECR before the creation of AWS Fargate cluster.

Before you deploy this application to ECS, the upstream app_server attribute in the nginx.conf file must be set with the value of 127.0.0.1:5000. This enables the communication with the upstream application container listening on port 5000.

The first step to push the container images to ECR is to fetch the docker login command with the required security tokens. Run the following command:

aws ecr get-login --no-include-email --region us-east-1

It should return you a Docker login command with a security token. Copy the command and tokens and run it.

The second step is to tag the local container image with the remote ECR repository. Run the following command:

docker tag aspnetcorefargate_mymvcweb:latest <yourawsaccountnumber>.dkr.ecr.us-east-1.amazonaws.com/mymvcweb:latest

The third step is to push the tagged image to the remote ECR registry. Run the following command:

docker push <yourawsaccountnumber>.dkr.ecr.us-east-1.amazonaws.com/mywebmvc:latest

The above steps are repeated for the NGINX container as well. Now you have the container images available in ECR.

Create the Amazon ECS cluster

The Amazon ECS cluster is a logical grouping for AWS Fargate and Amazon ECS tasks. The cluster remains an administrative boundary for running every application.

In the AWS Management Console, Navigate to Create Cluster and select Networking only.
Since we’re going to create and host the Amazon ECS Service with AWS Fargate as the launch type, the notion of the Amazon ECS Cluster becomes a logical boundary. We need not create ECS instances while creating Amazon ECS Cluster, when the launch type is Fargate. Hence, we can create the Fargate cluster with required networking constructs such as VPC and Subnets.

Name the cluster and select Creation of new VPC for this cluster.

Leave the rest of the fields as their default values. You now have a VPC with two public subnets.

Create an Application Load Balancer

Next, create an Application Load Balancer, as defined in the reference architecture. The Application Load Balancer is required to load balance across multiple AWS Fargate tasks.

In the EC2 console, navigate to Create Load Balancer. Name your Load Balancer as aspnetcorefargatealb.

For Scheme, select internet-facing. For IP address type, choose ipv4. The Load Balancer listens on port 80 (HTTP). The Load Balancer’s Security Group should also allow traffic on port 80 (HTTP) from the internet.

While configuring the routing for the Load Balancer, for Target type, choose ip. For Protocol, choose HTTP. For Path, enter / (forward slash).

For more information, see Creating an Application Load Balancer.

Create an AWS Fargate Task definition

The AWS Fargate Task definition is an important resource, acts as a blueprint for the AWS Fargate task. The Task definition defines parameters such as:

  • Container image URL
  • CPU
  • Memory
  • IAM execution role
  • Host port
  • Container port
  • Log configurations
  • Container networking mode
  • Task type
  • Mount point
  • Volume

A Fargate Task is the running instance of Task definition. Each Task represents a microservice. Tasks can be managed and independently scaled using AWS Fargate Service, which is explained in the upcoming sections.

In the console, choose Task Definitions, Create new Task Definition. For more information, see Creating a Task Definition.

Use the following AWS Fargate Task definition, which based on the reference architecture defined for this walkthrough. Replace <awsaccount> with your own account.

{
  "executionRoleArn": "arn:aws:iam::<awsaccount>:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/aspnetcorefargatetask",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 80,
          "protocol": "tcp",
          "containerPort": 80
        }
      ],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": 1024,
      "volumesFrom": [],
      "image": "<awsaccount>.dkr.ecr.us-east-1.amazonaws.com/reverseproxy: latest",
      "disableNetworking": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "privileged": null,
      "name": "reverseproxy"
    },
    {
      "dnsSearchDomains": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/aspnetcorefargatetask",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 5000,
          "protocol": "tcp",
          "containerPort": 5000
        }
      ],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": 1024,
      "volumesFrom": [],
      "image": "<awsaccount>.dkr.ecr.us-east-1.amazonaws.com/mymvcweb:latest",
      "disableNetworking": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "privileged": null,
      "name": "mymvcweb"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": "arn:aws:iam::<awsaccount>:role/aspnetecstaskroles",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-east-1:<awsaccount>:task-definition/aspnetcorefargatetask:1",
  "family": "aspnetcorefargatetask",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    }
  ],
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "revision": 1,
  "status": "ACTIVE",
  "volumes": []
}

The above Task definition contains two containers, the ASP.NET core and the NGINX reverse-proxy server. Currently, awsvpc is the only networking mode supported for AWS Fargate Tasks. When an AWS Fargate Task is launched, the ECS container network plugin assigns a dedicated Elastic Network Interface (ENI) for the Tasks. This ENI does not share the global default network namespace with ECS instances.

You also specify the subnets for placing tasks across ECS instances. This means that the Subnet Security Group is also applicable to the ENI for the respective Tasks. This enables communication between two AWS Fargate Tasks, or other resources within the VPC. Because of the awsvpc network mode, calls from AWS Fargate Tasks do not go through the eth0 Docker bridge.

Create the Amazon ECS service

The AWS Fargate service is a managed AWS Fargate task. The desired state of the application can be defined using the AWS Fargate service. For more information, see Create a service.

In the console, choose Task Definitions and select the task definition that you just created.

On the Task Definition [name] page, select the revision of the task definition from which to create your service.

Review the task definition, and choose Actions, Create Service. For Launch type, choose FARGATE. Enter values for the rest of the fields:

  • Platform version: LATEST
  • Cluster: aspcorefargatecluster (or the cluster name you chose)
  • Service name: aspcorefargatesvc (or another name of your choice)
  • Number of tasks: 2
  • Minimum healthy percent: 50
  • Maximum percent: 200

On the Configure networking page, select the required VPC and subnets required for running the tasks.

Register the Application Load Balancer (ALB) that you created. The ECS scheduler has built-in intelligence, which makes it seamless to work with Application Load Balancer (ALB).

Then, configure Service Auto Scaling. Even though this is an optional feature, I recommend to enable service-level scaling. It addresses the key tenets of how a microservice should behave at runtime. For more information, see (Optional) Configuring Your Service to Use Service Auto Scaling.

I’m defining minimum number of tasks as 2, desired tasks as 2 and maximum tasks as 3.

Complete the Amazon ECS Service creation.

When the Amazon ECS Task gets placed, the ECS scheduler registers the Task as a target for the Load Balancer.

When the Task is healthy and passes the Load Balancer health checks, it is reflected in the healthy host count.

Access the DNS ‘A’ record of the Load Balancer in the browser. The ASP.NET core application should render successfully.

Conclusion

In this post, we took an existing ASP.NET core application, containerized it, and hosted it in Amazon ECS as a microservice using the AWS Fargate compute engine. AWS Fargate gives you a way to run containers directly without managing any EC2 instances and giving you full control over how the task is defined, including task networking and resources.

If you have questions or suggestions, please comment below.

Sundararajan Narasiman is an AWS Partner Solutions Architect

Run your Kubernetes Workloads on Amazon EC2 Spot Instances with Amazon EKS

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/run-your-kubernetes-workloads-on-amazon-ec2-spot-instances-with-amazon-eks/

Contributed by Madhuri Peri, Sr. EC2 Spot Specialist SA, and Shawn OConnor, AWS Enterprise Solutions Architect

Many organizations today are using containers to package source code and dependencies into lightweight, immutable artifacts that can be deployed reliably to any environment.

Kubernetes (K8s) is an open-source framework for automated scheduling and management of containerized workloads. In addition to master nodes, a K8s cluster is made up of worker nodes where containers are scheduled and run.

Amazon Elastic Container Service for Kubernetes (Amazon EKS) is a managed service that removes the need to manage the installation, scaling, or administration of master nodes and the etcd distributed key-value store. It provides a highly available and secure K8s control plane.

This post demonstrates how to use Spot Instances as K8s worker nodes, and shows the areas of provisioning, automatic scaling, and handling interruptions (termination) of K8s worker nodes across your cluster.

What this post does not cover

This post focuses primarily on EC2 instance scaling. This post also assumes a default interruption mode of terminate for EC2 instances, though there are other interruption types, stop and hibernate. For stateless K8s sessions, I recommend choosing the interruption mode of terminate.

Spot Instances

Amazon EC2 Spot Instances are spare EC2 capacity that offer discounts of 70-90% over On-Demand prices. The Spot price is determined by term trends in supply and demand and the amount of On-Demand capacity on a particular instance size, family, Availability Zone, and AWS Region.

If the available On-Demand capacity of a particular instance type is depleted, the Spot Instance is sent an interruption notice two minutes ahead to gracefully wrap up things. I recommend a diversified fleet of instances, with multiple instance types created by Spot Fleets or EC2 Fleets.

You can use Spot Instances for various fault-tolerant and flexible applications. In a workload that uses container orchestration and management platforms like EKS or Amazon Elastic Container Service (Amazon ECS), the schedulers have built-in mechanisms to identify any pods or containers on these interrupted EC2 instances. The interrupted pods or containers are then replaced on other EC2 instances in the cluster.

Solution architecture

There are three goals to accomplish with this solution:

  1.  The cluster must scale automatically to match the demands of an application.
  2. Optimize for cost by using Spot Instances.
  3. The cluster must be resilient to Spot Instance interruptions.

These goals are accomplished with the following components:

Solution componentRole in solutionCodeDeployment
Cluster AutoscalerScales EC2 instances in or outOpen sourceK8s pod DaemonSet on On-Demand Instances
Auto Scaling groupProvisions Spot or On-Demand InstancesAWSVia CloudFormation
Spot Instance interrupt handlerSets K8s nodes to drain state, when the Spot Instance is interruptedOpen sourceK8s pod DaemonSet on all K8s nodes with the label lifecycle=EC2Spot

Here’s a diagram of the solution architecture.

There are a few important things to note in this architecture:

  • Cluster Autoscaler is being used to control all scaling activities, with changes to the MinSize and DesiredCapacity parameters of the Auto Scaling group. This separation of duties ensures that there are no race conditions.
  • The Auto Scaling groups are used purely to replace any lost instances automatically (for example, terminations or interruptions) and maintain the desired number of instances. There are no scaling policies attached to the groups.
  • Auto Scaling, at the time of this post, supports a single instance type. As noted by Jeff Barr’s post EC2 Fleet – Manage Thousands of On-Demand and Spot Instances with One Request, in H2 2018, Auto Scaling groups will support mixed instance types. At that point, multiple groups will not be required, and can collapse into a single group specifying all instance types.

Here’s a further breakdown on the components.

Cluster Autoscaler

Automatic scaling in K8s comes in two forms:

  • Horizontal Pod Autoscaler scales the pods in a deployment or replica set. It is implemented as a K8s API resource and a controller. The controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. It obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).
  • Cluster Autoscaler scales the worker nodes available for pods to be placed. Cluster Autoscaler is the focus for this post.

Cluster Autoscaler is the default K8s component that can be used to perform pod scaling as well as scaling nodes in a cluster. It automatically increases the size of an Auto Scaling group so that pods have a place to run. And it attempts to remove idle nodes, that is, nodes with no running pods.

When a pod cannot be scheduled due to lack of available resources, Cluster Autoscaler determines that the cluster must scale up. Expander interfaces allow you to apply different pod placement strategies. Currently, the following strategies are supported:

  • Random – Randomly select an available node group.
  • Most Pods – Selects the group that can schedule the largest quantity of nodes. This can be used balance the load across groups of nodes.
  • Least Waste – This is commonly referred to as ‘bin packing.’ It selects the node-group with the least available tied resource (CPU or memory). This helps to reduce the total node footprint, and is the strategy used in this post.

Although Cluster Autoscaler is the de facto standard for automatic scaling in K8s, it is not part of the main release. Deploy it like any other pod in the kube-system namespace, like other management pods. Those management pods would prevent the cluster from scaling down. Override this default behavior by passing in the –-skip-nodes-with-system-pods=false flag.

But how do you reliably control scale-down operations so that you do not remove the pods that you need? This is accomplished using a pod disruption budget (PDB). A PDB limits the number of replicated pods that can be down at a given time. Create a PDB to ensure that you always have at least one Cluster Autoscaler pod running

In summary, Cluster Autoscaler does not remove nodes under the following scenarios:

  • Pods with a restrictive PDB.
  • Pods running in the kube-system namespace that are deployed (that is, not run on the node by default or which do not have a PDB).
  • Pods not backed by a controller object (not created by a deployment, replica set, job, stateful-set, and so on).
  • Pods running with local storage.
  • Pods running that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, and so on).

Auto Scaling Group

With Spot Instances, each instance type in each Availability Zone is a pool with its own Spot price based on the available capacity. A recommended best practice when working with Spot Instances is to use a diversified fleet of instances with multiple instance types, as created by Spot Fleet or EC2 Fleet. These APIs aim to fulfill the specified TargetCapacity across the instance types to launch the number of Spot Instances and optionally, On-Demand Instances.

Unfortunately, Cluster Autoscaler does not support Spot Fleets at this time. You need a different strategy to provide diversification. Cluster Autoscaler for AWS provides integration with Auto Scaling groups. It enables users to choose from four different options of deployment:

  • One Auto Scaling group
  • Multiple Auto Scaling groups
  • Auto-Discovery
  • Master Node setup

For this post, you use the Multi-ASG deployment option. For Cluster Autoscaler and other cluster administration and management pods that run on EKS worker nodes, create a small Auto Scaling group using On-Demand Instances. This ensures that the health of the cluster is not impacted by Spot interruptions.

In K8s, label selectors are used to control where pods are placed. Use the K8s node label selector to place the appropriate pods on Spot or On-Demand Instances.

Interrupt handler

The last component to consider handles how the cluster responds to the interruption of a Spot Instance. The workflow can be summarized as:

  • Identify that a Spot Instance is being reclaimed.
  • Use the 2-minute notification window to gracefully prepare the node for termination.
  • Taint the node and cordon it off to prevent new pods from being placed.
  • Drain connections on the running pods.
  • To maintain desired capacity, replace the pods on remaining nodes.

Spot interruptions are reported in the following ways:

For this post, you use a K8s DaemonSet, which means running one pod per node. The pod periodically polls the EC2 metadata service for a Spot termination notice. If a termination notice is received (HTTP status 200), then it tries to gracefully stop and restart on other nodes before the 2-minute grace period expires. This approach is based on an existing project at the kube-spot-termination-notice-handler GitHub repo.

 Walkthrough

Here’s the suggested workflow for this solution:

  1. Provision the worker nodes with EC2 instances using CloudFormation templates.
  2. Deploy the K8s Cluster Autoscaler pods as a DaemonSet, with a PDB.
  3. Deploy the Spot Instance interrupt handler pods as a DaemonSet.
  4. Deploy the sample application

Prerequisites

You should have the following resources or configurations before starting this walkthrough:

  • An EKS cluster master endpoint
  • An EKS service role ARN
  • Subnet IDs and the control plane security group values
  • EKS master cluster certificates
  • Configuration of kubectl against the master EKS endpoint

For more information, see Amazon EKS – Now Generally Available and Deploy a Kubernetes Application with Amazon Elastic Container Service for Kubernetes.

When you describe the EKS cluster, you get a response like the following sample output:

    "cluster": {
        "name": " DemoSpotClusterScale",
        "arn": "arn:aws:eks:us-west-2: 0123456789012:cluster/ DemoSpotClusterScale",
        "createdAt": 1528317531.751,
        "version": "1.10",
        "endpoint": "https://B960845ED5E21A3439ABB5E12F09CE88.sk1.us-west-2.eks.amazonaws.com",
        "roleArn": "arn:aws:iam::0123456789012:role/eksServiceRoleGA",
        "resourcesVpcConfig": {
            "subnetIds": [
                "subnet-3326464a",
                "subnet-c2b93b89",
                "subnet-13225b49"
            ],
            "securityGroupIds": [
                "sg-7fd0b70e"
            ],
            "vpcId": "vpc-c7c8c4be"
        },
        "status": "ACTIVE",
        "certificateAuthority": {
            "data": "<Your ca data here>"
        }
    }
}

I use the cluster name DemoSpotClusterScale throughout this post. Replace that with your cluster name in the following commands.

Get started

git clone https://github.com/awslabs/ec2-spot-labs.git

cd ec2-spot-labs/ec2-spot-eks-solution

Provision the worker nodes

Add worker nodes to your cluster so that you can deploy your applications. Worker nodes can be either Spot or On-Demand Instances. In this example, use Spot Instances for worker nodes.

You can use this customized AWS CloudFormation template to create the Auto Scaling groups described earlier. This template also labels the node with a lifecycle key value indicating whether it is an On-Demand or Spot Instance node.

The template deploys Auto Scaling groups dedicated to the following instance types:

  • Spot Instances, m4.large, across three Availability Zones.
  • Spot Instances, t2.medium, across three Availability Zones.
  • On-Demand Instances, across three Availability Zones.

Make sure that you apply the aws-auth-cm.yaml file with the appropriate NodeInstanceRole value, as provisioned by the CloudFormation template. Find this parameter on the Resources tab.

kubectl apply -f aws-auth-cm.yaml

If the kubectl get nodes command worked as documented, then you are ready to proceed to the next section

Deploying Cluster Autoscaler and PDB

  1. Download the manifest file cluster-autoscaler-ds.yaml. There are six K8s resources that enable the cluster-autoscaler add-on to work in the EKS environment:
    • Service account
    • Cluster role
    • Role
    • Cluster role binding
    • Role binding
    • Two Auto Scaling groups created by the CloudFormation template for Spot and On-Demand Instances

    You also see the cluster-autoscaler command with configured parameters.

  2. Edit the cluster-autoscaler-ds.yaml file to replace the [OD-NodeGroup-Name], [Spot-NodeGroup1-Name], [Spot-NodeGroup2-Name] sections in lines 141-143 with the resources created in your worker node cloudformation template as shown in screenshot above. Deploy the cluster-autoscaler-ds.yaml manifest
    $ kubectl create -f cluster-autoscaler/cluster-autoscaler-ds.yaml

  3. Monitor the deployment:
    $ kubectl logs cluster-autoscaler-<podgeneratedID> --namespace=kube-system

  4. Download and deploy the Cluster Autoscaler PDB:
    $ kubectl create -f cluster-autoscaler/cluster-autoscaler-pdb.yaml

Deploy the Spot Instance interrupt handler

Each K8s EC2 node being launched must have the lifecycle=Ec2Spot value for -node-label, as in the following example. This line is an excerpt from the CloudFormation template:

“sed -i s,MAX_PODS,”, !Join [ “”, [ “‘”, { “Fn::FindInMap”: [ MaxPodsPerNode, { Ref: SpotNode2InstanceType }, MaxPods ] }, ” –node-labels “, “lifecycle=Ec2Spot” , “‘” ] ], “,g /etc/systemd/system/kubelet.service”, “\n”,

The Docker image contains the instance metadata poll script, as shown in entrypoint.sh. Publish this image to your repository. In the following screenshot, I used my ECR repository. A sample image is available on Docker Hub.

Deploy the Spot interrupt handler pod using spec. This sets up the DaemonSet only on the instances that have a K8s label of lifecycle=Ec2Spot.

kubectl apply -f spot-termination-handler/deploy-k8-pod/spot-interrupt-handler.yaml

When the Spot Instance is interrupted, this pod catches the interruption and vacates the pods.

Deploy the sample application and test out scaling up & down

Deploy a sample application with three replicas. Create a new manifest file named greeter-sample.yaml from the code below, or download it from here

You are using node affinity to prefer deployment on Spot Instances. If the Ec2Spot label is unavailable, the manifest file allows the application to run elsewhere

$ kubectl create -f sample/greeter-sample.yaml

Scale up, and watch Cluster Autoscaler manage the Auto Scaling groups. Verify that Cluster Autoscaler is working by scaling up the sample service beyond the current limits of the cluster.

$ kubectl scale --replicas=50 deployment/greeter-sample

Check the AWS Management Console to confirm that the Auto Scaling groups are scaling up to meet demand. This may take a few minutes. You can also follow along with the pod deployment from the command line. You should see the pods transition from pending to running as nodes are scaled up.

$ kubectl get pods -o wide --watch

Scale down, and watch Cluster Autoscaler manage the Auto Scaling groups:

$ kubectl scale --replicas=1 deployment/greeter

Check the K8s logs to watch the terminations occur:

$ kubectl logs deployment/cluster-autoscaler-<podgeneratedID> –namespace=kube-system

Conclusion

In this post, I showed you how to use Spot Instances with K8s workloads, by provisioning, scaling, and managing terminations effectively in EKS clusters to leverage both cost and scale optimizations. Happy coding!

Building, deploying, and operating containerized applications with AWS Fargate

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/building-deploying-and-operating-containerized-applications-with-aws-fargate/

This post was contributed by Jason Umiker, AWS Solutions Architect.

Whether it’s helping facilitate a journey to microservices or deploying existing tools more easily and repeatably, many customers are moving toward containerized infrastructure and workflows. AWS provides many of the services and mechanisms to help you with that.

In this post, I show you how to use Amazon ECS and AWS Fargate, as well as AWS CodeBuild and AWS CodePipeline, for an end-to-end CI/CD container solution.

What is Amazon ECS?

Amazon Elastic Container Service (ECS) helps schedule and orchestrate containers across a fleet of servers. It involves installing an agent on each container host that takes instructions from the ECS control plane and relays them to the local Docker image on each one. ECS makes this easy by providing an optimized Amazon Machine Image (AMI) that launches automatically using the ECS console or CLI and that you can use to launch container hosts yourself.

It is up to you to choose the appropriate instance types, sizes, and quantity for your cluster fleet. You should have the capacity to deploy and scale workloads as well as to spread them across enough failure domains for high availability. Features like Auto Scaling groups help with that.

Also, while AWS provides Amazon Linux and Windows AMIs pre-configured for ECS, you are responsible for ongoing maintenance of the OS, which includes patching and security. Items that require regular patching or updating in this model are the OS, Docker, the ECS agent, and of course the contents of the container images.

Two of the key ECS concepts are Tasks and Services. A task is one or more containers that are to be scheduled together by ECS. A service is like an Auto Scaling group for tasks. It defines the quantity of tasks to run across the cluster, where they should be running (for example, across multiple Availability Zones), automatically associates them with a load balancer, and horizontally scales based on metrics that you define like CPI or memory utilization.

What is Fargate?

AWS Fargate is a new compute engine for Amazon ECS that runs containers without requiring you to deploy or manage the underlying Amazon EC2 instances. With Fargate, you specify an image to deploy and the amount of CPU and memory it requires. Fargate handles the updating and securing of the underlying Linux OS, Docker daemon, and ECS agent as well as all the infrastructure capacity management and scaling.

How to use Fargate?

Fargate is exposed as a launch type for ECS. It uses an ECS task and service definition that is similar to the traditional EC2 launch mode, with a few minor differences. It is easy to move tasks and services back and forth between launch types. The differences include:

  • Using the awsvpc network mode
  • Specifying the CPU and memory requirements for the task in the definition

The best way to learn how to use Fargate is to walk through the process and see it in action.

Walkthrough: Deploying a service with Fargate in the console

At the time of publication, Fargate for ECS is available in the N. Virginia, Ohio, Oregon, and Ireland AWS regions. This walkthrough works in any AWS region where Fargate is available.

If you’d prefer to use a CloudFormation template, this one covers Steps 1-4. After launching this template you can skip ahead to Explore Running Service after Step 4.

Step 1 – Create an ECS cluster

An ECS cluster is a logical construct for running groups of containers known as tasks. Clusters can also be used to segregate different environments or teams from each other. In the traditional EC2 launch mode, there are specific EC2 instances associated with and managed by each ECS cluster, but this is transparent to the customer with Fargate.

  1. Open the ECS console and ensure that Fargate is available in the selected Region (for example, N. Virginia).
  2. Choose Clusters, Create Cluster.
  3. Choose Networking only, Next step.
  4. For Cluster name, enter “Fargate”. If you don’t already have a VPC to use, select the Create VPC check box and accept the defaults as well. Choose Create.

Step 2 – Create a task definition, CloudWatch log group, and task execution role

A task is a collection of one or more containers that is the smallest deployable unit of your application. A task definition is a JSON document that serves as the blueprint for ECS to know how to deploy and run your tasks.

The console makes it easier to create this definition by exposing all the parameters graphically. In addition, the console creates two dependencies:

  • The Amazon CloudWatch log group to store the aggregated logs from the task
  • The task execution IAM role that gives Fargate the permissions to run the task
  1. In the left navigation pane, choose Task Definitions, Create new task definition.
  2. Under Select launch type compatibility, choose FARGATE, Next step.
  3. For Task Definition Name, enter NGINX.
  4. If you had an IAM role for your task, you would enter it in Task Role but you don’t need one for this example.
  5. The Network Mode is automatically set to awsvpc for Fargate
  6. Under Task size, for Task memory, choose 0.5 GB. For Task CPU, enter 0.25.
  7. Choose Add container.
  8. For Container name, enter NGINX.
  9. For Image, put nginx:1.13.9-alpine.
  10. For Port mappings type 80 into Container port.
  11. Choose Add, Create.

Step 3 – Create an Application Load Balancer

Sending incoming traffic through a load balancer is often a key piece of making an application both scalable and highly available. It can balance the traffic between multiple tasks, as well as ensure that traffic is only sent to healthy tasks. You can have the service manage the addition or removal of tasks from an Application Load Balancer as they come and go but that must be specified when the service is created. It’s a dependency that you create first.

  1. Open the EC2 console.
  2. In the left navigation pane, choose Load Balancers, Create Load Balancer.
  3. Under Application Load Balancer, choose Create.
  4. For Name, put NGINX.
  5. Choose the appropriate VPC (10.0.0.0/16 if you let ECS create if for you).
  6. For Availability Zones, select both and choose Next: Configure Security Settings.
  7. Choose Next: Configure Security Groups.
  8. For Assign a security group, choose Create a new security group. Choose Next: Configure Routing.
  9. For Name, enter NGINX. For Target type, choose ip.
  10. Choose Next: Register Targets, Next: Review, Create.
  11. Select the new load balancer and note its DNS name (this is the public address for the service).

Step 4 – Create an ECS service using Fargate

A service in ECS using Fargate serves a similar purpose to an Auto Scaling group in EC2. It ensures that the needed number of tasks are running both for scaling as well as spreading the tasks over multiple Availability Zones for high availability. A service creates and destroys tasks as part of its role and can optionally add or remove them from an Application Load Balancer as targets as it does so.

  1. Open the ECS console and ensure that that Fargate is available in the selected Region (for example, N. Virginia).
  2. In the left navigation pane, choose Task Definitions.
  3. Select the NGINX task definition that you created and choose Actions, Create Service.
  4. For Launch Type, select Fargate.
  5. For Service name, enter NGINX.
  6. For Number of tasks, enter 1.
  7. Choose Next step.
  8. Under Subnets, choose both of the options.
  9. For Load balancer type, choose Application Load Balancer. It should then default to the NGINX version that you created earlier.
  10. Choose Add to load balancer.
  11. For Target group name, choose NGINX.
  12. Under DNS records for service discovery, for TTL, enter 60.
  13. Click Next step, Next step, and Create Service.

Explore the running service

At this point, you have a running NGINX service using Fargate. You can now explore what you have running and how it works. You can also ask it to scale up to two tasks across two Availability Zones in the console.

Go into the service and see details about the associated load balancer, tasks, events, metrics, and logs:

Scale the service from one task to multiple tasks:

  • Choose Update.
  • For Number of tasks, enter 2.
  • Choose Next step, Next step, Next step then Update Service.
  • Watch the event that is logged and the new additional task both appear.

On the service Details tab, open the NGINX Target Group Name link and see the IP address registered targets spread across the two zones.

Go to the DNS name for the Application Load Balancer in your browser and see the default NGINX page. Get the value from the Load Balancers dashboard in the EC2 console.

Walkthrough: Adding a CI/CD pipeline to your service

Now, I’m going to show you how to set up a CI/CD pipeline around this service. It watches a GitHub repo for changes and rebuilds the container with CodeBuild based on the buildspec.yml file and Dockerfile in the repo. If that build is successful, it then updates your Fargate service to deploy the new image.

If you’d prefer to use a CloudFormation Template, this one covers the creation of the dependencies so that the console will pre-fill these (CodeBuild Project and IAM Roles) during the creation of the CodePipeline in the steps below.

Step 1 – Create an ECR repository for the rebuilt container image

An ECR repository is a place to store your container images in a secure and reliable manner. Scaling and self-healing of Fargate tasks requires these images to be always available to be pulled when required. This is an important part of a container platform.

  1. Open the ECS console and ensure that that Fargate is available in the selected Region (for example N. Virginia).
  2. In the left navigation pane, under Amazon ECR, choose Repositories, Get started.
  3. For Repository name, put NGINX and choose Next step.

Step 2 – Fork the nginx-codebuild example into your own GitHub account

I have created an example project that takes the Dockerfile and config files for the official NGINX Docker Hub image and adds a buildspec.yml file to tell CodeBuild how to build the container and push it to your new ECR registry on completion. You can fork it into your own GitHub account for this CI/CD demo.

  1. Go to https://github.com/jasonumiker/nginx-codebuild.
  2. In the upper right corner, choose Fork.

Step 3 – Create the pipeline and associated IAM roles

You have two complementary AWS services for building a CI/CD pipeline for your containers. CodeBuild executes the build jobs and CodePipeline kicks off those builds when it notices that the source GitHub or CodeCommit repo changes. If successful, CodePipeline then deploys the new container image to Fargate.

The CodePipeline console can create the associated CodeBuild project, in addition to other dependencies such as the required IAM roles.

  1. Open the CodePipeline console and ensure that that Fargate is available in the selected Region (for example, N. Virginia).
  2. Choose Get started.
  3. For Pipeline name, enter NGINX and choose Next step.
  4. For Source provider, choose GitHub.
  5. Choose Connect to GitHub and log in.
    • For Repository, choose your forked nginx-codebuild repo. For Branch, enter master. Choose Next step.
  6. For Build provider, enter AWS CodeBuild.
  7. Select Create a new build project.
  8. For Project name, enter NGINX.
  9. For Operating system, choose Ubuntu. For Runtime, choose Docker. For Version, select the latest version.
  10. Expand Advanced and set the following environment variables:
    • AWS_ACCOUNT_ID with a value of the account number
    • IMAGE_REPO_NAME with a value of NGINX (or whatever ECR name that you used)
  11. Choose Save build project, Next step.
  12. For Deployment provider, choose Amazon ECS.
  13. For Cluster name, enter Fargate.
  14. For Service name, choose NGINX.
  15. For Image filename, enter images.json.
  16. Choose Next step.
  17. Choose Create role, Allow, Next step, and then choose Create pipeline.
  18. Open the IAM console and ensure that that Fargate is available in the selected Region (for example, N. Virginia).
  19. In the left navigation pane, choose Roles.
  20. Choose the code-build-nginx-service-role that was just created and choose Attach policy.
  21. For Policy type, choose AmazonEC2ContainerRegistryPowerUser and choose Attach policy.

Step 4 – Start the pipeline

You now have CodePipeline watching the GitHub repo for changes. It kicks off a CodeBuild build job on a change and, if the build is successful, creates a new deployment of the Fargate service with the new image.

Make a change to the source repo (even just adding a new dummy file) and then commit it and push it to master on your GitHub fork. This automatically kicks off the pipeline to build and deploy the change.

Conclusion

As you’ve seen, Fargate is fast and easy to set up, integrates well with the rest of the AWS platform, and saves you from much of the heavy lifting of running containers reliably at scale.

While it is useful to go through creating things in the console to understand them better we suggest automating them with infrastructure-as-code patterns via things like our CloudFormation to ensure that they are repeatable, and any changes can be managed. There are some example templates to help you get started in this post.

In addition, adding things like unit and integration testing, blue/green and/or manual approval gates into CodePipeline are often a good idea before deploying patterns like this to production in many organizations. Some additional examples to look at next include:

AWS Online Tech Talks – June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2018/

AWS Online Tech Talks – June 2018

Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

 

Analytics & Big Data

June 18, 2018 | 11:00 AM – 11:45 AM PTGet Started with Real-Time Streaming Data in Under 5 Minutes – Learn how to use Amazon Kinesis to capture, store, and analyze streaming data in real-time including IoT device data, VPC flow logs, and clickstream data.
June 20, 2018 | 11:00 AM – 11:45 AM PT – Insights For Everyone – Deploying Data across your Organization – Learn how to deploy data at scale using AWS Analytics and QuickSight’s new reader role and usage based pricing.

 

AWS re:Invent
June 13, 2018 | 05:00 PM – 05:30 PM PTEpisode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar.
Compute

June 25, 2018 | 01:00 PM – 01:45 PM PTAccelerating Containerized Workloads with Amazon EC2 Spot Instances – Learn how to efficiently deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances.

June 26, 2018 | 01:00 PM – 01:45 PM PTEnsuring Your Windows Server Workloads Are Well-Architected – Get the benefits, best practices and tools on running your Microsoft Workloads on AWS leveraging a well-architected approach.

 

Containers
June 25, 2018 | 09:00 AM – 09:45 AM PTRunning Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.

 

Databases

June 18, 2018 | 01:00 PM – 01:45 PM PTOracle to Amazon Aurora Migration, Step by Step – Learn how to migrate your Oracle database to Amazon Aurora.
DevOps

June 20, 2018 | 09:00 AM – 09:45 AM PTSet Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tools – Learn how to set up a CI/CD pipeline for deploying containers using the AWS Developer Tools.

 

Enterprise & Hybrid
June 18, 2018 | 09:00 AM – 09:45 AM PTDe-risking Enterprise Migration with AWS Managed Services – Learn how enterprise customers are de-risking cloud adoption with AWS Managed Services.

June 19, 2018 | 11:00 AM – 11:45 AM PTLaunch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new

 

AWS Environments

June 21, 2018 | 11:00 AM – 11:45 AM PTLeading Your Team Through a Cloud Transformation – Learn how you can help lead your organization through a cloud transformation.

June 21, 2018 | 01:00 PM – 01:45 PM PTEnabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.

June 28, 2018 | 01:00 PM – 01:45 PM PTFireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device.
IoT

June 27, 2018 | 11:00 AM – 11:45 AM PTAWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.

 

Machine Learning

June 19, 2018 | 09:00 AM – 09:45 AM PTIntegrating Amazon SageMaker into your Enterprise – Learn how to integrate Amazon SageMaker and other AWS Services within an Enterprise environment.

June 21, 2018 | 09:00 AM – 09:45 AM PTBuilding Text Analytics Applications on AWS using Amazon Comprehend – Learn how you can unlock the value of your unstructured data with NLP-based text analytics.

 

Management Tools

June 20, 2018 | 01:00 PM – 01:45 PM PTOptimizing Application Performance and Costs with Auto Scaling – Learn how selecting the right scaling option can help optimize application performance and costs.

 

Mobile
June 25, 2018 | 11:00 AM – 11:45 AM PTDrive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.

 

Security, Identity & Compliance

June 26, 2018 | 09:00 AM – 09:45 AM PTUnderstanding AWS Secrets Manager – Learn how AWS Secrets Manager helps you rotate and manage access to secrets centrally.
June 28, 2018 | 09:00 AM – 09:45 AM PTUsing Amazon Inspector to Discover Potential Security Issues – See how Amazon Inspector can be used to discover security issues of your instances.

 

Serverless

June 19, 2018 | 01:00 PM – 01:45 PM PTProductionize Serverless Application Building and Deployments with AWS SAM – Learn expert tips and techniques for building and deploying serverless applications at scale with AWS SAM.

 

Storage

June 26, 2018 | 11:00 AM – 11:45 AM PTDeep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services.
June 27, 2018 | 01:00 PM – 01:45 PM PTChanging the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances.
June 28, 2018 | 11:00 AM – 11:45 AM PTBig Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.

Amazon SageMaker Updates – Tokyo Region, CloudFormation, Chainer, and GreenGrass ML

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/sagemaker-tokyo-summit-2018/

Today, at the AWS Summit in Tokyo we announced a number of updates and new features for Amazon SageMaker. Starting today, SageMaker is available in Asia Pacific (Tokyo)! SageMaker also now supports CloudFormation. A new machine learning framework, Chainer, is now available in the SageMaker Python SDK, in addition to MXNet and Tensorflow. Finally, support for running Chainer models on several devices was added to AWS Greengrass Machine Learning.

Amazon SageMaker Chainer Estimator


Chainer is a popular, flexible, and intuitive deep learning framework. Chainer networks work on a “Define-by-Run” scheme, where the network topology is defined dynamically via forward computation. This is in contrast to many other frameworks which work on a “Define-and-Run” scheme where the topology of the network is defined separately from the data. A lot of developers enjoy the Chainer scheme since it allows them to write their networks with native python constructs and tools.

Luckily, using Chainer with SageMaker is just as easy as using a TensorFlow or MXNet estimator. In fact, it might even be a bit easier since it’s likely you can take your existing scripts and use them to train on SageMaker with very few modifications. With TensorFlow or MXNet users have to implement a train function with a particular signature. With Chainer your scripts can be a little bit more portable as you can simply read from a few environment variables like SM_MODEL_DIR, SM_NUM_GPUS, and others. We can wrap our existing script in a if __name__ == '__main__': guard and invoke it locally or on sagemaker.


import argparse
import os

if __name__ =='__main__':

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--learning-rate', type=float, default=0.05)

    # Data, model, and output directories
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])

    args, _ = parser.parse_known_args()

    # ... load from args.train and args.test, train a model, write model to args.model_dir.

Then, we can run that script locally or use the SageMaker Python SDK to launch it on some GPU instances in SageMaker. The hyperparameters will get passed in to the script as CLI commands and the environment variables above will be autopopulated. When we call fit the input channels we pass will be populated in the SM_CHANNEL_* environment variables.


from sagemaker.chainer.estimator import Chainer
# Create my estimator
chainer_estimator = Chainer(
    entry_point='example.py',
    train_instance_count=1,
    train_instance_type='ml.p3.2xlarge',
    hyperparameters={'epochs': 10, 'batch-size': 64}
)
# Train my estimator
chainer_estimator.fit({'train': train_input, 'test': test_input})

# Deploy my estimator to a SageMaker Endpoint and get a Predictor
predictor = chainer_estimator.deploy(
    instance_type="ml.m4.xlarge",
    initial_instance_count=1
)

Now, instead of bringing your own docker container for training and hosting with Chainer, you can just maintain your script. You can see the full sagemaker-chainer-containers on github. One of my favorite features of the new container is built-in chainermn for easy multi-node distribution of your chainer training jobs.

There’s a lot more documentation and information available in both the README and the example notebooks.

AWS GreenGrass ML with Chainer

AWS GreenGrass ML now includes a pre-built Chainer package for all devices powered by Intel Atom, NVIDIA Jetson, TX2, and Raspberry Pi. So, now GreenGrass ML provides pre-built packages for TensorFlow, Apache MXNet, and Chainer! You can train your models on SageMaker then easily deploy it to any GreenGrass-enabled device using GreenGrass ML.

JAWS UG

I want to give a quick shout out to all of our wonderful and inspirational friends in the JAWS UG who attended the AWS Summit in Tokyo today. I’ve very much enjoyed seeing your pictures of the summit. Thanks for making Japan an amazing place for AWS developers! I can’t wait to visit again and meet with all of you.

Randall

[$] Easier container security with entitlements

Post Syndicated from corbet original https://lwn.net/Articles/755238/rss

During KubeCon
+ CloudNativeCon Europe 2018
, Justin Cormack and Nassim Eddequiouaq presented
a proposal to simplify the setting of security parameters for containerized
applications.
Containers depend on a large set of intricate security primitives that can
have weird interactions. Because they are so hard to use, people often just
turn the whole thing off. The goal of the proposal is to make those
controls easier to understand and use; it is partly inspired by mobile apps
on iOS and Android platforms, an idea that trickled back into Microsoft and
Apple desktops. The time seems ripe to improve the field of
container security, which is in desperate need of simpler controls.

Kata Containers 1.0

Post Syndicated from ris original https://lwn.net/Articles/755230/rss

Kata Containers 1.0 has been released. “This first release of Kata Containers completes the merger of Intel’s Clear Containers and Hyper’s runV technologies, and delivers an OCI compatible runtime with seamless integration for container ecosystem technologies like Docker and Kubernetes.

[$] Securing the container image supply chain

Post Syndicated from corbet original https://lwn.net/Articles/754443/rss

“Security is hard” is a tautology, especially in the fast-moving world
of container orchestration. We have previously covered various aspects of
Linux container
security through, for example, the Clear Containers implementation
or the broader question of Kubernetes and
security
, but those are mostly concerned with container isolation; they do not address the
question of trusting a container’s contents. What is a container running?
Who built it and when? Even assuming we have good programmers and solid
isolation layers, propagating that good code around a Kubernetes cluster
and making strong assertions on the integrity of that supply chain is far
from trivial. The 2018 KubeCon
+ CloudNativeCon Europe
event featured some projects that could
eventually solve that problem.

[$] Updates in container isolation

Post Syndicated from corbet original https://lwn.net/Articles/754433/rss

At KubeCon
+ CloudNativeCon Europe
2018, several talks explored the topic of
container isolation and security. The last year saw the release of Kata Containers which, combined with
the CRI-O project, provided strong isolation
guarantees for containers using a hypervisor. During the conference, Google
released its own hypervisor called gVisor, adding yet another
possible solution for this problem. Those new developments prompted the
community to work on integrating the concept of “secure containers”
(or “sandboxed containers”) deeper
into Kubernetes. This work is now coming to fruition; it prompts us to look
again at how Kubernetes tries to keep the bad guys from wreaking havoc once
they break into a container.

[$] Autoscaling for Kubernetes workloads

Post Syndicated from corbet original https://lwn.net/Articles/754153/rss

Technologies like containers, clusters, and Kubernetes offer the prospect
of rapidly scaling the available computing resources to match variable demands
placed on the system. Actually implementing that scaling can be a
challenge, though.
During KubeCon
+ CloudNativeCon Europe 2018
,
Frederic Branczyk from CoreOS (now
part of Red Hat) held a packed session
to introduce a standard and officially recommended way to scale workloads
automatically in Kubernetes
clusters.