Tag Archives: Amazon EC2

Executing Ansible playbooks in your Amazon EC2 Image Builder pipeline

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/executing-ansible-playbooks-in-your-amazon-ec2-image-builder-pipeline/

This post is contributed by Andrew Pearce – Sr. Systems Dev Engineer, AWS

Since launching Amazon EC2 Image Builder, many customers say they want to re-use existing investments in configuration management technologies (such as Ansible, Chef, or Puppet) with Image Builder pipelines.

In this blog, I walk through creating a document that can execute an Ansible playbook in an Image Builder pipeline. I then use the document to create an Image Builder component that can be added to an Image Builder image recipe. In addition to this blog, you can find further samples in the amazon-ec2-image-builder-samples GitHub repository.

Ansible playbook requirements

There are likely some changes required so your existing Ansible playbooks can work in Image Builder.

When executing Ansible within Image Builder, the playbook must be configured to execute using the localhost. In a traditional Ansible environment, the control server executes playbooks against remote hosts using a protocol like SSH or WinRM. In Image Builder, the host executing the playbook is also the host that must be configured.

There are a number of ways to accomplish this in Ansible. In this example, I set the value of hosts to 127.0.0.1, gather_facts to false, and connection to local. These values set execution at the localhost, prevent gathering Ansible facts about remote hosts, and force local execution.

In this walk through, I use the following sample playbook. This installs Apache, configures a default webpage, and ensures that Apache is enabled and running.

---
- name: Apache Hello World
  hosts: 127.0.0.1
  gather_facts: false
  connection: local
  tasks:
    - name: Install Apache
      yum:
        name: httpd
        state: latest
    - name: Create a default page
      shell: echo "<h1>Hello world from EC2 Image Builder and Ansible</h1>" > /var/www/html/index.html
    - name: Enable Apache
      service: name=httpd enabled=yes state=started

Creating the Ansible document

This example uses the build phase to install Ansible, download the playbook from an S3 bucket, execute the playbook and cleanup the system. It also ensures that Apache is returning the correct content in both the validate and test phases.

I use Amazon Linux 2 as the target operating system, and assume that the playbook is already uploaded to my S3 bucket.

Document design diagram

Document Design Diagram

To start, I must specify the top-level properties for the document. I specify a name and description to describe the document’s purpose, and a schemaVersion, which at the time of writing is 1.0. I also include a build phase, as I want this document to be used when building the image.

name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
phases:
  - name: build
    steps:

The first required step is to install Ansible. With Amazon Linux 2, I can install Ansible using Amazon Linux Extras, so I use the ExecuteBash action to perform the work.

     - name: InstallAnsible
        action: ExecuteBash
        inputs:
          commands:
           - sudo amazon-linux-extras install -y ansible2

Once Ansible is installed, I must download the playbook for execution. I use the S3Download action to download the playbook and store it in a temporary location. Note that the S3Download action accepts a list of inputs, so a single S3Download step could download multiple files.

     - name: DownloadPlaybook
        action: S3Download
        inputs:
          - source: 's3://mybucket/my-playbook.yml'
            destination: '/tmp/my-playbook.yml'

Now Ansible is installed and the playbook is downloaded. I use the ExecuteBinary action to invoke Ansible and perform the work described in the playbook. This step uses a feature of the component management application called chaining.

Chaining allows you to reference the input or output values of a step in subsequent steps. In the following example, {{build.DownloadPlaybook.inputs[0].destination}} passes the string of the first “destination” property value in the DownloadPlaybook step, which executes in the build phase. The “[0]” indicates the first value in the array (position zero).

For more information about how chaining works, review the Component Manager documentation. In this example, I refer to the S3Download destination path to avoid mistyping the value, which causes the build to fail.

      - name: InvokeAnsible
        action: ExecuteBinary
        inputs:
          path: ansible-playbook
          arguments:
            - '{{build.DownloadPlaybook.inputs[0].destination}}'

Finally, I clean up the system, so I use another ExecuteBash action combined with chaining to delete the downloaded playbook.

      - name: DeletePlaybook
        action: ExecuteBash
        inputs:
          commands:
            - rm '{{build.DownloadPlaybook.inputs[0].destination}}'

Now, I have installed Ansible, downloaded the playbook, executed it, and cleaned up the system. The next step is to add a validate phase where I use curl to ensure that Apache responds with the desired content. Including a validate phase allows you to identify build issues more quickly. This fails the build before Image Builder creates an AMI and moves to the test phase.

In the validate phase, I use grep to confirm that Apache returned a value containing my required string. If the command fails, a non-zero exit code is returned, indicating a failure. This failure is returned to Image Builder and the image build is marked as “Failed”.

  - name: validate
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"

After the validate phase executes, the EC2 instance is stopped and an Amazon Machine Image (AMI) is created. Image Builder moves to the test stage where a new EC2 instance is launched using the AMI that was created in the previous step. During this stage, the test phase of a document is executed.

To ensure that Apache is still functioning as required, I add a test phase to the document using the same curl command as the validate phase. This ensures that Apache is started and still returning the correct content.

  - name: test
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"

Here is the complete document:

name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
phases:
  - name: build
    steps:
      - name: InstallAnsible
        action: ExecuteBash
        inputs:
          commands:
           - sudo amazon-linux-extras install -y ansible2
      - name: DownloadPlaybook
        action: S3Download
        inputs:
          - source: 's3://mybucket/playbooks/my-playbook.yml'
            destination: '/tmp/my-playbook.yml'
      - name: InvokeAnsible
        action: ExecuteBinary
        inputs:
          path: ansible-playbook
          arguments:
            - '{{build.DownloadPlaybook.inputs[0].destination}}'
      - name: DeletePlaybook
        action: ExecuteBash
        inputs:
          commands:
            - rm '{{build.DownloadPlaybook.inputs[0].destination}}'
  - name: validate
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"
  - name: test
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"

I now have a document that installs Ansible, downloads a playbook from an S3 bucket, and invokes Ansible to perform the work described in the playbook. I am also validating the installation was successful by ensuring Apache is returning the desired content.

Next, I use this document to create a component in Image Builder and assign the component to an image recipe. The image recipe could then be used when creating an image pipeline. The blog post Automate OS Image Build Pipelines with EC2 Image Builder provides a tutorial demonstrating how to create an image pipeline with the EC2 Image Builder console. For more information, review Getting started with EC2 Image Builder.

Conclusion

In this blog, I walk through creating a document that can execute Ansible playbooks for use in Image Builder. I define the required changes for Ansible playbooks and when each phase of the document would execute. I also note the amazon-ec2-image-builder-samples GitHub repository, which provides sample documents for executing Ansible playbooks and Chef recipes.

We continue to use this repository for publishing sample content. We hope Image Builder is providing value and making it easier for you to build virtual machine (VM) images. As always, keep sending us your feedback!

Best practices for handling EC2 Spot Instance interruptions

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/

This post is contributed by Scott Horsfield – Sr. Specialist Solutions Architect, EC2 Spot

Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand Instance prices. The only difference between an On-Demand Instance and a Spot Instance is that a Spot Instance can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back. Fortunately, there is a lot of spare capacity available, and handling interruptions in order to build resilient workloads with EC2 Spot is simple and straightforward. In this post, I introduce and walk through several best practices that help you design your applications to be fault tolerant and resilient to interruptions.

By using EC2 Spot Instances, customers can access additional compute capacity between 70%-90% off of On-Demand Instance pricing. This allows customers to run highly optimized and massively scalable workloads that would not otherwise be possible. These benefits make interruptions an acceptable trade-off for many workloads.

When you follow the best practices, the impact of interruptions is insignificant because interruptions are infrequent and don’t affect the availability of your application. At the time of writing this blog post, less than 5% of Spot Instances are interrupted by EC2 before being terminated intentionally by a customer, because they are automatically handled through integrations with AWS services.

Many applications can run on EC2 Spot Instances with no modifications, especially applications that already adhere to modern software development best practices such as microservice architecture. With microservice architectures, individual components are stateless, scalable, and fault tolerant by design. Other applications may require some additional automation to properly handle being interrupted. Examples of these applications include those that must gracefully remove themselves from a cluster before termination to avoid impact to availability or performance of the workload.

In this post, I cover best practices around handing interruptions so that you too can access the massive scale and steep discounts that Spot Instances can provide.

Architect for fault tolerance and reliability

Ensuring your applications are architected to follow modern software development best practices is the first step in successfully adopting Spot Instances. Software development best practices, such as graceful degradation, externalizing state, and microservice architecture already enforce the same best practices that can allow your workloads to be resilient to Spot interruptions.

A great place to start when designing new workloads, or evaluating existing workloads for fault tolerance and reliability, is the Well-Architected Framework. The Well-Architected Framework is designed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. Based on five pillars — Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization — the Framework provides a consistent approach for customers to evaluate architectures, and implement designs that will scale over time.

The Well-Architected Tool is a free tool available in the AWS Management Console. Using this tool, you can create self-assessments to identify and correct gaps in your current architecture that might affect your toleration to Spot interruptions.

Use Spot integrated services

Some AWS services that you might already use, such as Amazon ECS, Amazon EKS, AWS Batch, and AWS Elastic Beanstalk have built-in support for managing the lifecycle of EC2 Spot Instances. These services do so through integration with EC2 Auto Scaling. This service allows you to build fleets of compute using On-Demand Instance and Spot Instances with a mixture of instance types and launch strategies. Through these services, tasks such as replacing interrupted instances are handled automatically for you. For many fault-tolerant workloads, simply replacing the instance upon interruption is enough to ensure the reliability of your service. For advanced use cases, EC2 Auto Scaling groups provide notifications, auto-replacement of interrupted instances, lifecycle hooks, and weights. These more advanced features allow for more control by the user over the composition and lifecycle of your compute infrastructure.

Spot-integrated services automate processes for handling interruptions. This allows you to stay focused on building new features and capabilities, and avoid the additional cost that custom automation may accrue over time.

Most examples in this post use EC2 Auto Scaling groups to demonstrate best practices because of the built-in integration with other AWS services. Also, Auto Scaling brings flexibility when building scalable fault-tolerant applications with EC2 Spot Instances. However, these same best practices can also be applied to other services such as Amazon EMR, EC2 fleet, and your own custom automation and frameworks.

Choose the right Spot Allocation Strategy

It is a best practice to use the capacity-optimized Spot Allocation Strategy when configuring your EC2 Auto Scaling group. This allows Auto Scaling groups to launch instances from Spot Instance pools with the most available capacity. Since Spot Instances can be interrupted when EC2 needs the capacity back, launching instances optimized for available capacity is a key best practice for reducing the possibility of interruptions.

You can simply select the Spot allocation strategy when creating new Auto Scaling groups, or modifying an existing Auto Scaling group from the console.

capacity-optimized allocation

Choose instance types across multiple instance sizes, families, and Availability Zones

It is important that the Auto Scaling group has a diverse set of options to choose from so that services launch instances optimally based on capacity. You do this by configuring the Auto Scaling group to launch instances of multiple sizes, and families, across multiple Availability Zones. Each instance type of a particular size, family, and Availability Zone in each Region is a separate Spot capacity pool. When you provide the Auto Scaling group and the capacity-optimized Spot Allocation Strategy a diverse set of Spot capacity pools, your instances are launched from the deepest pools available.

When you create a new Auto Scaling group, or modify an existing Auto Scaling group, you can specify a primary instance type and secondary instance types. The Auto Scaling group also provides recommended options. The following image shows what this configuration looks like in the console.

instance type selection

When using the recommended options, the Auto Scaling group is automatically configured with a diverse list of instance types across multiple instance families, generations, or sizes. We recommend leaving a many of these instance types in place as possible.

Handle interruption notices

For many workloads, the replacement of interrupted instances from a diverse set of instance choices is enough to maintain the reliability of your application. In other cases, you can gracefully decommission an application on an instance that is being interrupted. You can do this by knowing that an instance is going to be interrupted and responding through automation to react to the interruption. The good news is, there are several ways you can capture an interruption warning, which is published two minutes before EC2 reclaims the instance.

Inside an instance

The Instance Metadata Service is a secure endpoint that you can query for information about an instance directly from the instance. When a Spot Instance interruption occurs, you can retrieve data about the interruption through this service. This can be useful if you must perform an action on the instance before the instance is terminated, such as gracefully stopping a process, and blocking further processing from a queue. Keep in mind that any actions you automate must be completed within two minutes.

To query this service, first retrieve an access token, and then pass this token to the Instance Metadata Service to authenticate your request. Information about interruptions can be accessed through http://169.254.169.254/latest/meta-data/spot/instance-action. This URI returns a 404 response code when the instance is not marked for interruption. The following code demonstrates how you could query the Instance Metadata Service to detect an interruption.

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" –v http://169.254.169.254/latest/meta-data/spot/instance-action

If the instance is marked for interruption, you receive a 200 response code. You also receive a JSON formatted response that includes the action that is taken upon interruption (terminate, stop or hibernate) and a time when that action will be taken (ie: the expiration of your 2-minute warning period).

{"action": "terminate", "time": "2020-05-18T08:22:00Z"}

The following sample script demonstrates this pattern.

#!/bin/bash

TOKEN=`curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`

while sleep 5; do

    HTTP_CODE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s -w %{http_code} -o /dev/null http://169.254.169.254/latest/meta-data/spot/instance-action)

    if [[ "$HTTP_CODE" -eq 401 ]] ; then
        echo 'Refreshing Authentication Token'
        TOKEN=`curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30"`
    elif [[ "$HTTP_CODE" -eq 200 ]] ; then
        # Insert Your Code to Handle Interruption Here
    else
        echo 'Not Interrupted'
    fi

done

Developing automation with the Amazon EC2 Metadata Mock

When developing automation for handling Spot Interruptions within an EC2 Instance, it’s useful to mock Spot Interruptions to test your automation. The Amazon EC2 Metadata Mock (AEMM) project allows for running a mock endpoint of the EC2 Instance Metadata Service, and simulating Spot interruptions. You can use AEMM to serve a mock endpoint locally as you develop your automation, or within your test environment to support automated testing.

1. Download the latest Amazon EC2 Metadata Mock binary from the project repository.

2. Run the Amazon EC2 Metadata Mock configured to mock Spot Interruptions.

amazon-ec2-metadata-mock spotitn --instance-action terminate --imdsv2 

3. Test the mock endpoint with curl.

TOKEN=`curl -X PUT "http://localhost:1338/latest/api/token" -H "X-aws-ec2-  metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" –v http://localhost:1338/latest/meta-data/spot/instance-action

With default configuration, you receive a response for a simulated Spot Interruption with a time set to two minutes after the request was received.

{"action": "terminate", "time": "2020-05-13T08:22:00Z"}

Outside an instance

When a Spot Interruption occurs, a Spot instance interruption notice event is generated. You can create rules using Amazon CloudWatch Events or Amazon EventBridge to capture these events, and trigger a response such as invoking a Lambda Function. This can be useful if you need to take action outside of the instance to respond to an interruption, such as graceful removal of an interrupted instance from a load balancer to allow in-flight requests to complete, or draining containers running on the instance.

The generated event contains useful information such as the instance that is interrupted, in addition to the action (terminate, stop, hibernate) that is taken when that instance is interrupted. The following example event demonstrates a typical EC2 Spot Instance Interruption Warning.

{
    "version": "0",
    "id": "12345678-1234-1234-1234-123456789012",
    "detail-type": "EC2 Spot Instance Interruption Warning",
    "source": "aws.ec2",
    "account": "123456789012",
    "time": "yyyy-mm-ddThh:mm:ssZ",
    "region": "us-east-2",
    "resources": ["arn:aws:ec2:us-east-2:123456789012:instance/i-1234567890abcdef0"],
    "detail": {
        "instance-id": "i-1234567890abcdef0",
        "instance-action": "action"
    }
}

Capturing this event is as simple as creating a new event rule that matches the pattern of the event, where the source is aws.ec2 and the detail-type is EC2 Spot Interruption Warning.

aws events put-rule --name "EC2SpotInterruptionWarning" \
    --event-pattern "{\"source\":[\"aws.ec2\"],\"detail-type\":[\"EC2 Spot Interruption Warning\"]}" \  
    --role-arn "arn:aws:iam::123456789012:role/MyRoleForThisRule"

When responding to these events, it’s common to retrieve additional details about the instance that is being interrupted, such as Tags. When triggering additional API calls in response to Spot Interruption Warnings, keep in mind that AWS APIs implement Request Throttling. So, ensure that you are implementing exponential backoffs and retries as needed and using paginated API calls.

For example, if multiple EC2 Spot Instances were interrupted, you could aggregate the instance-ids by temporarily storing them in DynamoDB and then combine the instance-ids into a single DescribeInstances API call. This allows you to retrieve details about multiple instances rather than implementing individual DescribeInstances API calls that may exceed API limits and result in throttling.

The following script demonstrates describing multiple EC2 Instances with a reusable paginator that can accept API-specific arguments. Additional examples can be found here.

import boto3

ec2 = boto3.client('ec2')

def paginate(method, **kwargs):
    client = method.__self__
    paginator = client.get_paginator(method.__name__)
    for page in paginator.paginate(**kwargs).result_key_iters():
        for item in page:
            yield item

def describe_instances(instance_ids):
    described_instances = []

    response = paginate(ec2.describe_instances, InstanceIds=instance_ids)

    for item in response:
        described_instances.append(item)

    return described_instances

if __name__ == "__main__":

    instance_ids = ['instance1','instance2', '...']

    print(describe_instances(instance_ids))

Container services

When running containerized workloads, it’s common to want to let the container orchestrator know when a node is going to be interrupted. This allows for the node to be marked so that the orchestrator knows to stop placing new containers on an interrupted node, and drain running containers off of the interrupted node. Amazon EKS and Amazon ECS are two popular services that manage containers on AWS, and each have simple integrations that allow for handling interruptions.

Amazon EKS/Kubernetes

With Kubernetes workloads, including self-managed clusters and those running on Amazon EKS, use the AWS maintained AWS Node Termination Handler to monitor for Spot Interruptions and make requests to the Kubernetes API to mark the node as non-schedulable. This project runs as a Daemonset on your Kubernetes nodes. In addition to handling Spot Interruptions, it can also be configured to handle Scheduled Maintenance Events.

You can use kubectl to apply the termination handler with default options to your cluster with the following command.

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.5.0/all-resources.yaml

Amazon ECS

With Amazon ECS workloads you can enable Spot Instance draining by passing a configuration parameter to the ECS container agent. Once enabled, when a container instance is marked for interruption, ECS receives the Spot Instance interruption notice and places the instance in DRAINING status. This prevents new tasks from being scheduled for placement on the container instance. If there are container instances in the cluster that are available, replacement service tasks are started on those container instances to maintain your desired number of running tasks.

You can enable this setting by configuring the user data on your container instances to apply this the setting at boot time.

#!/bin/bash
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=MyCluster
ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
ECS_CONTAINER_STOP_TIMEOUT=120s
EOF

Inside a Container Running on Amazon ECS

When an ECS Container Instance is interrupted, and the instance is marked as DRAINING, running tasks are stopped on the instance. When these tasks are stopped, a SIGTERM signal is sent to the running task, and ECS waits up to 2 minutes before forcefully stopping the task, resulting in a SIGKILL signal sent to the running container. This 2 minute window is configurable through a stopTimeout container timeout option, or through ECS Agent Configuration, as shown in the prior code, giving you flexibility within your container to handle the interruption. If you set this value to be greater than 120 seconds, it will not prevent your instance from being interrupted after the 2 minute warning. So, I recommend setting to be less than or equal to 120 seconds.

You can capture the SIGTERM signal within your containerized applications. This allows you to perform actions such as preventing the processing of new work, checkpointing the progress of a batch job, or gracefully exiting the application to complete tasks such as ensuring database connections are properly closed.

The following example Golang application shows how you can capture a SIGTERM signal, perform cleanup actions, and gracefully exit an application running within a container.

package main

import (
    "os"
    "os/signal"
    "syscall"
    "time"
    "log"
)

func cleanup() {
    log.Println("Cleaning Up and Exiting")
    time.Sleep(10 * time.Second)
    os.Exit(0)
}

func do_work() {
    log.Println("Working ...")
    time.Sleep(10 * time.Second)
}

func main() {
    log.Println("Application Started")

    signalChannel := make(chan os.Signal)

    interrupted := false
    for interrupted == false {

        signal.Notify(signalChannel, os.Interrupt, syscall.SIGTERM)
        go func() {
            sig := <-signalChannel
            switch sig {
                case syscall.SIGTERM:
                    log.Println("SIGTERM Captured - Calling Cleanup")
                    interrupted = true
                    cleanup()
            }
        }()

        do_work()
    }

}

EC2 Spot Instances are a great fit for containerized workloads because containers can flexibly run on instances from multiple instance families, generations, and sizes. Whether you’re using Amazon ECS, Amazon EKS, or your own self-hosted orchestrator, use the patterns and tools discussed in this section to handle interruptions. Existing projects such as the AWS Node Termination Handler are thoroughly tested and vetted by multiple customers, and remove the need for investment in your own customized automation.

Implement checkpointing

For some workloads, interruptions can be very costly. An example of this type of workload is if the application needs to restart work on a replacement instance, especially if your applications needs to restart work from the beginning. This is common in batch workloads, sustained load testing scenarios, and machine learning training.

To lower the cost of interruption, investigate patterns for implementing checkpointing within your application. Checkpointing means that as work is completed within your application, progress is persisted externally. So, if work is interrupted, it can be restarted from where it left off rather than from the beginning. This is similar to saving your progress in a video game and restarting from the last save point vs. the start of the level. When working with an interruptible service such as EC2 Spot, resuming progress from the latest checkpoint can significantly reduce the cost of being interrupted, and reduce any delays that interruptions could otherwise incur.

For some workloads, enabling checkpointing may be as simple as setting a configuration option to save progress externally (Amazon S3 is often a great place to store checkpointing data). For example, when running an object detection model training job on Amazon SageMaker with Managed Spot Training, enabling checkpointing is as simple as passing the right configuration to the training job.

The following example demonstrates how to configure a SageMaker Estimator to train an object detection model with checkpointing enabled. You can access the complete example notebook here.

od_model = sagemaker.estimator.Estimator(
    image_name=training_image,
    base_job_name='SPOT-object-detection-checkpoint',
    role=role, 
    train_instance_count=1, 
    train_instance_type='ml.p3.2xlarge',
    train_volume_size = 50,
    train_use_spot_instances=True,
    train_max_wait=3600,
    train_max_run=3600,
    input_mode= 'File',
    checkpoint_s3_uri="s3://{}/{}".format(s3_checkpoint_path, uuid.uuid4()),
    output_path=s3_output_location,
    sagemaker_session=sess)

For other workloads, enabling checkpointing may require extending your custom framework, or a public frameworks to persist data externally, and then reload that data when an instance is replaced and work is resumed.

For example, MXNet, a popular open source library for deep-learning, has a mechanism for executing Custom Callbacks during the execution of deep learning jobs. Custom Callbacks can be invoked during the completion of each training epoch, and can perform custom actions such as saving checkpointing data to S3. Your deep-learning training container could then be configured to look for any saved checkpointing data once restarted. Then load that training data locally so that training can be continued from the latest checkpoint.

By enabling checkpointing, you ensure that your workload is resilient to any interruptions that occur. Checkpointing can help to minimize data loss and the impact of an interruption on your workload by saving state and recovering from a saved state. When using existing frameworks, or developing your own framework, look for ways to integrate checkpointing so that your workloads have the flexibility to run on EC2 Spot Instances.

Track the right metrics

The best practices discussed in this blog can help you build reliable and scalable architectures that are resilient to Spot Interruptions. Architecting for fault tolerance is the best way you can ensure that your applications are resilient to interruptions, and is where you should focus most of your energy. Monitoring your services is important because it helps ensure that best practices are being effectively implemented. It’s critical to pick metrics that reflect availability and reliability, and not get caught up tracking metrics that do not directly correlate to service availability and reliability.

Some customers choose to track Spot Interruptions, which can be useful when done for the right reasons, such as evaluating tolerance to interruptions or conducing testing. However, frequency of interruption, or number of interruptions of a particular instance type, are examples of metrics that do not directly reflect the availability or reliability of your applications. Since Spot interruptions can fluctuate dynamically based on overall Spot Instance availability and demand for On-Demand Instances, tracking interruptions often results in misleading conclusions.

Rather than tracking interruptions, look to track metrics that reflect the true reliability and availability of your service including:

Conclusion

The best practices we’ve discussed, when followed, can help you run your stateless, scalable, and fault tolerant workloads at significant savings with EC2 Spot Instances. Architect your applications to be fault tolerant and resilient to failure by using the Well-Architected Framework. Lean on Spot Integrated Services such as EC2 Auto Scaling, AWS Batch, Amazon ECS, and Amazon EKS to handle the provisioning and replacement of interrupted instances. Be flexible with your instance selections by choosing instance types across multiple families, sizes, and Availability Zones. Use the capacity-optimized allocation strategy to launch instances from the Spot Instance pools with the most available capacity. Lastly, track the right metrics that represent the true availability of your application. These best practices help you safely optimize and scale your workloads with EC2 Spot Instances.

What is a cyber range and how do you build one on AWS?

Post Syndicated from John Formento, Jr. original https://aws.amazon.com/blogs/security/what-is-cyber-range-how-do-you-build-one-aws/

In this post, we provide advice on how you can build a current cyber range using AWS services.

Conducting security incident simulations is a valuable exercise for organizations. As described in the AWS Security Incident Response Guide, security incident response simulations (SIRS) are useful tools to improve how an organization handles security events. These simulations can be tabletop sessions, individualized labs, or full team exercises conducted using a cyber range.

A cyber range is an isolated virtual environment used by security engineers, researchers, and enthusiasts to practice their craft and experiment with new techniques. Traditionally, these ranges were developed on premises, but on-prem ranges can be expensive to build and maintain (and do not reflect the new realities of cloud architectures).

In this post, we share concepts for building a cyber range in AWS. First, we cover the networking components of a cyber range, then how to control access to the cyber range. We also explain how to make the exercise realistic and how to integrate your own tools into a cyber range on AWS. As we go through each component, we relate them back to AWS services.

Using AWS to build a cyber range provides flexibility since you only pay for services when in use. They can also be templated to a certain degree, to make creation and teardown easier. This allows you to iterate on your design and build a cyber range that is as identical to your production environment as possible.

Networking

Designing the network architecture is a critical component of building a cyber range. The cyber range must be an isolated environment so you have full control over it and keep the live environment safe. The purpose of the cyber range is to be able to play with various types of malware and malicious code, so keeping it separate from live environments is necessary. That being said, the range should simulate closely or even replicate real-world environments that include applications on the public internet, as well as internal systems and defenses.

There are several services you can use to create an isolated cyber range in AWS:

  • Amazon Virtual Private Cloud (Amazon VPC), which lets you provision a logically isolated section of AWS. This is where you can launch AWS resources in a virtual network that you define.
    • Traffic mirroring is an Amazon VPC feature that you can use to copy network traffic from an elastic network interface of Amazon Elastic Compute Cloud (Amazon EC2) instances.
  • AWS Transit Gateway, which is a service that enables you to connect your Amazon VPCs and your on-premises networks to a single gateway.
  • Interface VPC endpoints, which enable you to privately connect your VPC to supported AWS services and VPC endpoint services powered by AWS PrivateLink. You’re able to do this without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.

Amazon VPC provides the fundamental building blocks to create the isolated software defined network for your cyber range. With a VPC, you have fine grained control over IP CIDR ranges, subnets, and routing. When replicating real-world environments, you want the ability to establish communications between multiple VPCs. By using AWS Transit Gateway, you can add or subtract an environment from your cyber range and route traffic between VPCs. Since the cyber range is isolated from the public internet, you need a way to connect to the various AWS services without leaving the cyber range’s isolated network. VPC endpoints can be created for the AWS services so they can function without an internet connection.

A network TAP (test access point) is an external monitoring device that mirrors traffic passing between network nodes. A network TAP can be hardware or virtual; it sends traffic to security and monitoring tools. Though its unconventional in a typical architecture, routing all traffic through an EC2 Nitro instance enables you to use traffic mirroring to provide the network TAP functionality.

Accessing the system

Due to the isolated nature of a cyber range, administrators and participants cannot rely on typical tools to connect to resources, such as the secure shell (SSH) protocol and Microsoft remote desktop protocol (RDP). However, there are several AWS services that help achieve this goal, depending on the type of access role.

There are typically two types of access roles for a cyber range: an administrator and a participant. The administrator – sometimes called the Black Team – is responsible for designing, building, and maintaining the environment. The participant – often called the Red Team (attacking), Blue Team (defending), or Purple Team (both) – then plays within the simulated environment.

For the administrator role, the following AWS services would be useful:

  • Amazon EC2, which provides scalable computing capacity on AWS.
  • AWS Systems Manager, which gives you visibility into and control of your infrastructure on AWS

The administrator can use Systems Manager to establish SSH, RDP, or run commands manually or on a schedule. Files can be transferred in and out of the environment using a combination of S3 and VPC endpoints. Amazon EC2 can host all the web applications and security tools used by the participants, but are managed by the administrators with Systems Manager.

For the participant role, the most useful AWS service is Amazon WorkSpaces, which is a managed, secure cloud desktop service.

The participants are provided with virtual desktops that are contained in the isolated cyber range. Here, they either initiate an attack on resources in the environment or defend the attack. With Amazon WorkSpaces, participants can use the same operating system environment they are accustomed to in the real-world, while still being fully controlled and isolated within the cyber range.

Realism

A cyber range must be realistic enough to provide a satisfactory experience for its participants. This realism must factor tactics, techniques, procedures, communications, toolsets, and more. Constructing a cyber range on AWS enables the builder full control of the environment. It also means the environment is repeatable and auditable.

It is important to replicate the toolsets that are used in real life. By creating re-usable “Golden” Amazon Machine Images (AMIs) that contain tools and configurations that would typically be installed on machines, a builder can easily slot the appropriate systems into an environment. You can also use AWS PrivateLink – a feature of Amazon VPC – to establish connections from your isolated environment to customer-defined outside tools.

To emulate the tactics of an adversary, you can replicate a functional copy of the internet that is scaled down. Using private hosted zones in Amazon Route 53, you can use a “.” record to control name resolution for the entire “internet”. Alternatively, you can use a Route 53 resolver to forward all requests for a specific domain to your name servers. With these strategies, you can create adversaries that act from known malicious domains to test your defense capabilities. You can also use Amazon VPC route tables to send all traffic to a centralized web server under your control. This web server can respond as variety of websites to emulate the internet.

Logging and monitoring

It is important for responders to exercise using the same tools and techniques that they would use in the real world. AWS VPC traffic mirroring is an effective way to utilize many IDS products. Our documentation provides guidance for using popular open source tool such as Zeek and Suricata. Additionally, responders can leverage any network monitoring tools that support VXLAN.

You can interact with the tools outside of the VPC by using AWS PrivateLink. PrivateLink provides secure connectivity to resources outside of a network, without the need for firewall rules or route tables. PrivateLink also enables integration with many AWS Partner offerings.

You can use Amazon CloudWatch Logs to aggregate operating system logs, application logs, and logs from AWS resources. You can also easily share CloudWatch Logs with a dedicated security logging account.

In addition, if there are third party tools that you currently use, you can leverage the AWS Marketplace to easily procure the specific tools and install within your AWS account.

Summary

In this post, I covered what a cyber range is, the value of using one, and how to think about creating one using AWS services. AWS provides a great platform to build a cost effective cyber range that can be used to bolster your security practices.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

John Formento, Jr.

John is a Solutions Architect at Amazon Web Services. He helps large enterprises achieve their goals by architecting secure and scalable solutions on the AWS Cloud. John holds 5 AWS certifications including AWS Certified Solutions Architect – Professional and DevOps Engineer – Professional.

Author

Adam Cerini

Adam is a Senior Solutions Architect with Amazon Web Services. He focuses on helping Public Sector customers architect scalable, secure, and cost effective systems. Adam holds 5 AWS certifications including AWS Certified Solutions Architect – Professional and AWS Certified Security – Specialist.

Serverless Architecture for a Web Scraping Solution

Post Syndicated from Dzidas Martinaitis original https://aws.amazon.com/blogs/architecture/serverless-architecture-for-a-web-scraping-solution/

If you are interested in serverless architecture, you may have read many contradictory articles and wonder if serverless architectures are cost effective or expensive. I would like to clear the air around the issue of effectiveness through an analysis of a web scraping solution. The use case is fairly simple: at certain times during the day, I want to run a Python script and scrape a website. The execution of the script takes less than 15 minutes. This is an important consideration, which we will come back to later. The project can be considered as a standard extract, transform, load process without a user interface and can be packed into a self-containing function or a library.

Subsequently, we need an environment to execute the script. We have at least two options to consider: on-premises (such as on your local machine, a Raspberry Pi server at home, a virtual machine in a data center, and so on) or you can deploy it to the cloud. At first glance, the former option may feel more appealing — you have the infrastructure available free of charge, why not to use it? The main concern of an on-premises hosted solution is the reliability — can you assure its availability in case of a power outage or a hardware or network failure? Additionally, does your local infrastructure support continuous integration and continuous deployment (CI/CD) tools to eliminate any manual intervention? With these two constraints in mind, I will continue the analysis of the solutions in the cloud rather than on-premises.

Let’s start with the pricing of three cloud-based scenarios and go into details below.

Pricing table of three cloud-based scenarios

*The AWS Lambda free usage tier includes 1M free requests per month and 400,000 GB-seconds of compute time per month. Review AWS Lambda pricing.

Option #1

The first option, an instance of a virtual machine in AWS (called Amazon Elastic Cloud Compute or EC2), is the most primitive one. However, it definitely does not resemble any serverless architecture, so let’s consider it as a reference point or a baseline. This option is similar to an on-premises solution giving you full control of the instance, but you would need to manually spin an instance, install your environment, set up a scheduler to execute your script at a specific time, and keep it on for 24×7. And don’t forget the security (setting up a VPC, route tables, etc.). Additionally, you will need to monitor the health of the instance and maybe run manual updates.

Option #2

The second option is to containerize the solution and deploy it on Amazon Elastic Container Service (ECS). The biggest advantage to this is platform independence. Having a Docker file (a text document that contains all the commands you could call on the command line to assemble an image) with a copy of your environment and the script enables you to reuse the solution locally—on the AWS platform, or somewhere else. A huge advantage to running it on AWS is that you can integrate with other services, such as AWS CodeCommit, AWS CodeBuild, AWS Batch, etc. You can also benefit from discounted compute resources such as Amazon EC2 Spot instances.

Architecture of CloudWatch, Batch, ECR

The architecture, seen in the diagram above, consists of Amazon CloudWatch, AWS Batch, and Amazon Elastic Container Registry (ECR). CloudWatch allows you to create a trigger (such as starting a job when a code update is committed to a code repository) or a scheduled event (such as executing a script every hour). We want the latter: executing a job based on a schedule. When triggered, AWS Batch will fetch a pre-built Docker image from Amazon ECR and execute it in a predefined environment. AWS Batch is a free-of-charge service and allows you to configure the environment and resources needed for a task execution. It relies on ECS, which manages resources at the execution time. You pay only for the compute resources consumed during the execution of a task.

You may wonder where the pre-built Docker image came from. It was pulled from Amazon ECR, and now you have two options to store your Docker image there:

  • You can build a Docker image locally and upload it to Amazon ECR.
  • You just commit few configuration files (such as Dockerfile, buildspec.yml, etc.) to AWS CodeCommit (a code repository) and build the Docker image on the AWS platform.This option, shown in the image below, allows you to build a full CI/CD pipeline. After updating a script file locally and committing the changes to a code repository on AWS CodeCommit, a CloudWatch event is triggered and AWS CodeBuild builds a new Docker image and commits it to Amazon ECR. When a scheduler starts a new task, it fetches the new image with your updated script file. If you feel like exploring further or you want actually implement this approach please take a look at the example of the project on GitHub.

CodeCommit. CodeBuild, ECR

Option #3

The third option is based on AWS Lambda, which allows you to build a very lean infrastructure on demand, scales continuously, and has generous monthly free tier. The major constraint of Lambda is that the execution time is capped at 15 minutes. If you have a task running longer than 15 minutes, you need to split it into subtasks and run them in parallel, or you can fall back to Option #2.

By default, Lambda gives you access to standard libraries (such as the Python Standard Library). In addition, you can build your own package to support the execution of your function or use Lambda Layers to gain access to external libraries or even external Linux based programs.

Lambda Layer

You can access AWS Lambda via the web console to create a new function, update your Lambda code, or execute it. However, if you go beyond the “Hello World” functionality, you may realize that online development is not sustainable. For example, if you want to access external libraries from your function, you need to archive them locally, upload to Amazon Simple Storage Service (Amazon S3), and link it to your Lambda function.

One way to automate Lambda function development is to use AWS Cloud Development Kit (AWS CDK), which is an open source software development framework to model and provision your cloud application resources using familiar programming languages. Initially, the setup and learning might feel strenuous; however the benefits are worth of it. To give you an example, please take a look at this Python class on GitHub, which creates a Lambda function, a CloudWatch event, IAM policies, and Lambda layers.

In a summary, the AWS CDK allows you to have infrastructure as code, and all changes will be stored in a code repository. For a deployment, AWS CDK builds an AWS CloudFormation template, which is a standard way to model infrastructure on AWS. Additionally, AWS Serverless Application Model (SAM) allows you to test and debug your serverless code locally, meaning that you can indeed create a continuous integration.

See an example of a Lambda-based web scraper on GitHub.

Conclusion

In this blog post, we reviewed two serverless architectures for a web scraper on AWS cloud. Additionally, we have explored the ways to implement a CI/CD pipeline in order to avoid any future manual interventions.

Using Amazon EFS for AWS Lambda in your serverless applications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/

Serverless applications are event-driven, using ephemeral compute functions to integrate services and transform data. While AWS Lambda includes a 512-MB temporary file system for your code, this is an ephemeral scratch resource not intended for durable storage.

Amazon EFS is a fully managed, elastic, shared file system designed to be consumed by other AWS services, such as Lambda. With the release of Amazon EFS for Lambda, you can now easily share data across function invocations. You can also read large reference data files, and write function output to a persistent and shared store. There is no additional charge for using file systems from your Lambda function within the same VPC.

EFS for Lambda makes it simpler to use a serverless architecture to implement many common workloads. It opens new capabilities, such as building and importing large code libraries directly into your Lambda functions. Since the code is loaded dynamically, you can also ensure that the latest version of these libraries is always used by every new execution environment. For appending to existing files, EFS is also a preferred option to using Amazon S3.

This blog post shows how to enable EFS for Lambda in your AWS account, and walks through some common use-cases.

Capabilities and behaviors of Lambda with EFS

EFS is built to scale on demand to petabytes of data, growing and shrinking automatically as files are written and deleted. When used with Lambda, your code has low-latency access to a file system where data is persisted after the function terminates.

EFS is a highly reliable NFS-based regional service, with all data stored durably across multiple Availability Zones. It is cost-optimized, due to no provisioning requirements, and no purchase commitments. It uses built-in lifecycle management to optimize between SSD-performance class and an infrequent access class that offer 92% lower cost.

EFS offers two performance modes – general purpose and MaxIO. General purpose is suitable for most Lambda workloads, providing lower operational latency and higher performance for individual files.

You also choose between two throughput modes – bursting and provisioned. The bursting mode uses a credit system to determine when a file system can burst. With bursting, your throughput is calculated based upon the amount of data you are storing. Provisioned throughput is useful when you need more throughout than provided by the bursting mode. Total throughput available is divided across the number of concurrent Lambda invocations.

The Lambda service mounts EFS file systems when the execution environment is prepared. This adds minimal latency when the function is invoked for the first time, often within hundreds of milliseconds. When the execution environment is already warm from previous invocations, the EFS mount is already available.

EFS can be used with Provisioned Concurrency for Lambda. When the reserved capacity is prepared, the Lambda service also configures and mounts EFS file system. Since Provisioned Concurrency executes any initialization code, any libraries or packages consumed from EFS at this point are downloaded. In this use-case, it’s recommended to use provisioned throughout when configuring EFS.

The EFS file system is shared across Lambda functions as it scales up the number of concurrent executions. As files are written by one instance of a Lambda function, all other instances can access and modify this data, depending upon the access point permissions. The EFS file system scales with your Lambda functions, supporting up to 25,000 concurrent connections.

Creating an EFS file system

Configuring EFS for Lambda is straight-forward. I show how to do this in the AWS Management Console but you can also use the AWS CLI, AWS SDK, AWS Serverless Application Model (AWS SAM), and AWS CloudFormation. EFS file systems are always created within a customer VPC, so Lambda functions using the EFS file system must all reside in the same VPC.

To create an EFS file system:

  1. Navigate to the EFS console.
  2. Choose Create File System.
    EFS: Create File System
  3. On the Configure network access page, select your preferred VPC. Only resources within this VPC can access this EFS file system. Accept the default mount targets, and choose Next Step.
  4. On Configure file system settings, you can choose to enable encryption of data at rest. Review this setting, then accept the other defaults and choose Next Step. This uses bursting mode instead of provisioned throughout.
  5. On the Configure client access page, choose Add access point.
    EFS: Add access point
  6. Enter the following parameters. This configuration creates a file system with open read/write permissions – read more about settings to secure your access points. Choose Next Step.EFS: Access points
  7. On the Review and create page, check your settings and choose Create File System.
  8. In the EFS console, you see the new file system and its configuration. Wait until the Mount target state changes to Available before proceeding to the next steps.

Alternatively, you can use CloudFormation to create the EFS access point. With the AWS::EFS::AccessPoint resource, the preceding configuration is defined as follows:

  AccessPointResource:
    Type: 'AWS::EFS::AccessPoint'
    Properties:
      FileSystemId: !Ref FileSystemResource
      PosixUser:
        Uid: "1000"
        Gid: "1000"
      RootDirectory:
        CreationInfo:
          OwnerGid: "1000"
          OwnerUid: "1000"
          Permissions: "0777"
        Path: "/lambda"

For more information, see the example setup template in the code repository.

Working with AWS Cloud9 and Amazon EC2

You can mount EFS access points on Amazon EC2 instances. This can be useful for browsing file systems contents and downloading files from other locations. The EFS console shows customized mount instructions directly under each created file system:

EFS customized mount instructions

The instance must have access to the same security group and reside in the same VPC as the EFS file system. After connecting via SSH to the EC2 instance, you mount the EFS mount target to a directory. You can also mount EFS in AWS Cloud9 instances using the terminal window.

Any files you write into the EFS file system are available to any Lambda functions using the same EFS file system. Similarly, any files written by Lambda functions are available to the EC2 instance.

Sharing large code packages with Lambda

EFS is useful for sharing software packages or binaries that are otherwise too large for Lambda layers. You can copy these to EFS and have Lambda use these packages as if there are installed in the Lambda deployment package.

For example, on EFS you can install Puppeteer, which runs a headless Chromium browser, using the following script run on an EC2 instance or AWS Cloud9 terminal:

  mkdir node && cd node
  npm init -y
  npm i puppeteer --save

Building packages in EC2 for EFS

You can then use this package from a Lambda function connected to this folder in the EFS file system. You include the Puppeteer package with the mount path in the require declaration:

const puppeteer = require ('/mnt/efs/node/node_modules/puppeteer')

In Node.js, to avoid changing declarations manually, you can add the EFS mount path to the Node.js module search path by using app-module-path. Lambda functions support a range of other runtimes, including Python, Java, and Go. Many other runtimes offer similar ways to add the EFS path to the list of default package locations.

There is an important difference between using packages in EFS compared with Lambda layers. When you use Lambda layers to include packages, these are downloaded to an immutable code package. Any changes to the underlying layer do not affect existing functions published using that layer.

Since EFS is a dynamic binding, any changes or upgrades to packages are available immediately to the Lambda function when the execution environment is prepared. This means you can output a build process to an EFS mount, and immediately consume any new versions of the build from a Lambda function.

Configuring AWS Lambda to use EFS

Lambda functions that access EFS must run from within a VPC. Read this guide to learn more about setting up Lambda functions to access resources from a VPC. There are also sample CloudFormation templates you can use to configure private and public VPC access.

The execution role for Lambda function must provide access to the VPC and EFS. For development and testing purposes, this post uses the AWSLambdaVPCAccessExecutionRole and AmazonElasticFileSystemClientFullAccess managed policies in IAM. For production systems, you should use more restrictive policies to control access to EFS resources.

Once your Lambda function is configured to use a VPC, next configure EFS in Lambda:

  1. Navigate to the Lambda console and select your function from the list.
  2. Scroll down to the File system panel, and choose Add file system.
    EFS: Add file system
  3. In the File system configuration:
  • From the EFS file system dropdown, select the required file system. From the Access point dropdown, choose the required EFS access point.
  • In the Local mount path, enter the path your Lambda function uses to access this resource. Enter an absolute path.
  • Choose Save.
    EFS: Add file system

The File system panel now shows the configuration of the EFS mount, and the function is ready to use EFS. Alternatively, you can use an AWS Serverless Application Model (SAM) template to add the EFS configuration to a function resource:

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MyLambdaFunction:
    Type: AWS::Serverless::Function
    Properties:
	...
      FileSystemConfigs:
      - Arn: arn:aws:elasticfilesystem:us-east-1:xxxxxx:accesspoint/
fsap-123abcdef12abcdef
        LocalMountPath: /mnt/efs

To learn more, see the SAM documentation on this feature.

Example applications

You can view and download these examples from this GitHub repository. To deploy, follow the instructions in the repo’s README.md file.

1. Processing large video files

The first example uses EFS to process a 60-minute MP4 video and create screenshots for each second of the recording. This uses to the FFmpeg Linux package to process the video. After copying the MP4 to the EFS file location, invoke the Lambda function to create a series of JPG frames. This uses the following code to execute FFmpeg and pass the EFS mount path and input file parameters:

const os = require('os')

const inputFile = process.env.INPUT_FILE
const efsPath = process.env.EFS_PATH

const { exec } = require('child_process')

const execPromise = async (command) => {
	console.log(command)
	return new Promise((resolve, reject) => {
		const ls = exec(command, function (error, stdout, stderr) {
		  if (error) {
		    console.log('Error: ', error)
		    reject(error)
		  }
		  console.log('stdout: ', stdout);
		  console.log('stderr: ' ,stderr);
		})
		
		ls.on('exit', function (code) {
		  console.log('Finished: ', code);
		  resolve()
		})
	})
}

// The Lambda handler
exports.handler = async function (eventObject, context) {
	await execPromise(`/opt/bin/ffmpeg -loglevel error -i ${efsPath}/${inputFile} -s 240x135 -vf fps=1 ${efsPath}/%d.jpg`)
}

In this example, the process writes more than 2000 individual JPG files back to the EFS file system during a single invocation:

Console output from sample application

2. Archiving large numbers of files

Using the output from the first application, the second example creates a single archive file from the JPG files. The code uses the Node.js archiver package for processing:

const outputFile = process.env.OUTPUT_FILE
const efsPath = process.env.EFS_PATH

const fs = require('fs')
const archiver = require('archiver')

// The Lambda handler
exports.handler = function (event) {

  const output = fs.createWriteStream(`${efsPath}/${outputFile}`)
  const archive = archiver('zip', {
    zlib: { level: 9 } // Sets the compression level.
  })
  
  output.on('close', function() {
    console.log(archive.pointer() + ' total bytes')
  })
  
  output.on('end', function() {
    console.log('Data has been drained')
  })
  
  archive.pipe(output)  

  // append files from a glob pattern
  archive.glob(`${efsPath}/*.jpg`)
  archive.finalize()
}

After executing this Lambda function, the resulting ZIP file is written back to the EFS file system:

Console output from second sample application.

3. Unzipping archives with a large number of files

The last example shows how to unzip an archive containing many files. This uses the Node.js unzipper package for processing:

const inputFile = process.env.INPUT_FILE
const efsPath = process.env.EFS_PATH
const destinationDir = process.env.DESTINATION_DIR

const fs = require('fs')
const unzipper = require('unzipper')

// The Lambda handler
exports.handler = function (event) {

  fs.createReadStream(`${efsPath}/${inputFile}`)
    .pipe(unzipper.Extract({ path: `${efsPath}/${destinationDir}` }))

}

Once this Lambda function is executed, the archive is unzipped into a destination direction in the EFS file system. This example shows the screenshots unzipped into the frames subdirectory:

Console output from third sample application.

Conclusion

EFS for Lambda allows you to share data across function invocations, read large reference data files, and write function output to a persistent and shared store. After configuring EFS, you provide the Lambda function with an access point ARN, allowing you to read and write to this file system. Lambda securely connects the function instances to the EFS mount targets in the same Availability Zone and subnet.

EFS opens a range of potential new use-cases for Lambda. In this post, I show how this enables you to access large code packages and binaries, and process large numbers of files. You can interact with the file system via EC2 or AWS Cloud9 and pass information to and from your Lambda functions.

EFS for Lambda is supported at launch in APN Partner solutions, including Epsagon, Lumigo, Datadog, HashiCorp Terraform, and Pulumi. To learn more about how to use EFS for Lambda, see the AWS News Blog post and read the documentation.

Introducing Instance Refresh for EC2 Auto Scaling

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/introducing-instance-refresh-for-ec2-auto-scaling/

This post is contributed to by: Ran Sheinberg – Principal EC2 Spot SA, and Isaac Vallhonrat – Sr. EC2 Spot Specialist SA

Today, we are launching Instance Refresh. This is a new feature in EC2 Auto Scaling that enables automatic deployments of instances in Auto Scaling Groups (ASGs), in order to release new application versions or make infrastructure updates.

Amazon EC2 Auto Scaling is used for a wide variety of workload types and applications. EC2 Auto Scaling helps you maintain application availability through a rich feature set. This feature set includes integration into Elastic Load Balancing, automatically replacing unhealthy instances, balancing instances across Availability Zones, provisioning instances across multiple pricing options and instance types, dynamically adding and removing instances, and more.

Many customers use an immutable infrastructure approach. This approach encourages replacing EC2 instances to update the application or configuration, instead of deploying into EC2 instances that are already running. This can be done by baking code and software in golden Amazon Machine Images (AMIs), and rolling out new EC2 Instances that use the new AMI version. Another common pattern for rolling out application updates is changing the package version that the instance pulls when it boots (via updates to instance user data). Or, keeping that pointer static, and pushing a new version to the code repository or another type of artifact (container, package on Amazon S3) to be fetched by an instance when it boots and gets provisioned.

Until today, EC2 Auto Scaling customers used different methods for replacing EC2 instances inside EC2 Auto Scaling groups when a deployment or operating system update was needed. For example, UpdatePolicy within AWS CloudFormation, create_before_destroy lifecycle in Hashicorp Terraform, using AWS CodeDeploy, or even custom scripts that call the EC2 Auto Scaling API.

Customers told us that they want native deployment functionality built into EC2 Auto Scaling to take away the heavy lifting of custom solutions, or deployments that are initiated from outside of Auto Scaling groups.

Introducing Instance Refresh in EC2 Auto Scaling

You can trigger an Instance Refresh using the EC2 Auto Scaling groups Management Console, or use the new StartInstanceRefresh API in AWS CLI or any AWS SDK. All you need to do is specify the percentage of healthy instances to keep in the group while ASG terminates and launches instances. Also specify the warm-up time, which is the time period that ASG waits between groups of instances, that refreshes via Instance Refresh. If your ASG is using Health Checks, the ASG waits for the instances in the group to be healthy before it continues to the next group of instances.

Instance Refresh in action

To get started with Instance Refresh in the AWS Management Console, click on an existing ASG in the EC2 Auto Scaling Management Console. Then click the Instance refresh tab.

When clicking the Start instance refresh button, I am presented with the following options:

start instance refresh

With the default configuration, ASG works to keep 90% of the instances running and does not proceed to the next group of instances if that percentage is not kept. After each group, ASG waits for the newly launched instances to transition into the healthy state, in addition to the 300-second warm-up time to pass before proceeding to the next group of instances.

I can also initiate the same action from the AWS CLI by using the following code:

aws autoscaling start-instance-refresh --auto-scaling-group-name ASG-Instance-Refresh —preferences MinHealthyPercentage=90,InstanceWarmup=300

After initializing the instance refresh process, I can see ongoing instance refreshes in the console:

initialize instance refresh

The following image demonstrates how an active Instance refresh looks in the EC2 Instances console. Moreover, ASG strives to keep the capacity balanced between Availability Zones by terminating and launching instances in different Availability Zones in each group.

active instance refresh

Automate your workflow with Instance Refresh

You can now use this new functionality to create automations that work for your use-case.

To get started quickly, we created an example solution based on AWS Lambda. Visit the solution page on Github and see the deployment instructions.

Here’s an overview of what the solution contains and how it works:

  • An EC2 Auto Scaling group with two instances running
  • An EC2 Image Builder pipeline, set up to build and test an AMI
  • An SNS topic that would get notified when the image build completes
  • A Lambda function that is subscribed to the SNS topic, which gets triggered when the image build completes
  • The Lambda function gets the new AMI ID from the SNS notification, creates a new Launch Template version, and then triggers an Instance Refresh in the ASG, which starts the rolling update of instances.
  • Because you can configure the ASG with LaunchTemplateVersion = $Latest, every new instance that is launched by the Instance Refresh process uses the new AMI from the latest version of the Launch Template.

See the automation flow in the following diagram.

instance refresh automation flow

Conclusion

We hope that the new Instance Refresh functionality in your ASGs allow for a more streamlined approach to launching and updating your application deployments running on EC2. You can now create automations that fit your use case. This allows you to more easily refresh the EC2 instances running in your Auto Scaling groups, when deploying a new version of your application or when you must replace the AMI being used. Visit the user-guide to learn more and get started.

Building for Cost optimization and Resilience for EKS with Spot Instances

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/cost-optimization-and-resilience-eks-with-spot-instances/

This post is contributed by Chris Foote, Sr. EC2 Spot Specialist Solutions Architect

Running your Kubernetes and containerized workloads on Amazon EC2 Spot Instances is a great way to save costs. Kubernetes is a popular open-source container management system that allows you to deploy and manage containerized applications at scale. AWS makes it easy to run Kubernetes with Amazon Elastic Kubernetes Service (EKS) a managed Kubernetes service to run production-grade workloads on AWS. To cost optimize these workloads, run them on Spot Instances. Spot Instances are available at up to a 90% discount compared to On-Demand prices. These instances are best used for various fault-tolerant and instance type flexible applications. Spot Instances and containers are an excellent combination, because containerized applications are often stateless and instance flexible.

In this blog, I illustrate the best practices of using Spot Instances such as diversification, automated interruption handling, and leveraging Auto Scaling groups to acquire capacity. You then adapt these Spot Instance best practices to EKS with the goal of cost optimizing and increasing the resilience of container-based workloads.

Spot Instances Overview

Spot Instances are spare Amazon EC2 capacity that allows customers to save up to 90% over On-Demand prices. Spot capacity is split into pools determined by instance type, Availability Zone (AZ), and AWS Region. The Spot Instance price changes slowly determined by long-term trends in supply and demand of a particular Spot capacity pool, as shown below:

Spot Instance pricing

Prices listed are an example, and may not represent current prices. Spot Instance pricing is illustrated in orange blocks, and On-Demand is illustrated in dark-blue.

When EC2 needs the capacity back, the Spot Instance service arbitrarily sends Spot interruption notifications to instances within the associated Spot capacity pool. This Spot interruption notification lands in both the EC2 instance metadata and Eventbridge. Two minutes after the Spot interruption notification, the instance is reclaimed. You can set up your infrastructure to automate a response to this two-minute notification. Examples include draining containers, draining ELB connections, or post-processing.

Instance flexibility is important when following Spot Instance best practices, because it allows you to provision from many different pools of Spot capacity. Leveraging multiple Spot capacity pools help reduce interruptions depending on your defined Spot Allocation Strategy, and decrease time to provision capacity. Tapping into multiple Spot capacity pools across instance types and AZs, allows you to achieve your desired scale — even for applications that require 500K concurrent cores:

Spot capacity pools = (Availability Zones) * (Instance Types)

If your application is deployed across two AZs and uses only an c5.4xlarge then you are only using (2 * 1 = 2) two Spot capacity pools. To follow Spot Instance best practices, consider using six AZs and allowing your application to use c5.4xlarge, c5d.4xlarge, c5n.4xlarge, and c4.4xlarge. This gives us (6 * 4 = 24) 24 Spot capacity pools, greatly increasing the stability and resilience of your application.

Auto Scaling groups support deploying applications across multiple instance types, and automatically replace instances if they become unhealthy, or terminated due to Spot interruption. To decrease the chance of interruption, use the capacity-optimized Spot allocation strategy. This automatically launches Spot Instances into the most available pools by looking at real-time capacity data, and identifying which are the most available.

Now that I’ve covered Spot best practices, you can apply them to Kubernetes and build an architecture for EKS with Spot Instances.

Solution architecture

The goals of this architecture are as follows:

  • Automatically scaling the worker nodes of Kubernetes clusters to meet the needs of the application
  • Leveraging Spot Instances to cost-optimize workloads on Kubernetes
  • Adapt Spot Instance best practices (like diversification) to EKS and Cluster Autoscaler

You achieve these goals via the following components:

ComponentRoleDetailsDeployment Method
Cluster AutoscalerScales EC2 instances automatically according to pods running in the clusterOpen SourceA DaemonSet via Helm on On-Demand Instances
EC2 Auto Scaling groupProvisions and maintains EC2 instance capacityAWSCloudformation via eksctl
AWS Node Termination HandlerDetects EC2 Spot interruptions and automatically drains nodesOpen SourceA DaemonSet via Helm on Spot Instances

The architecture deploys the EKS worker nodes over three AZs, and leverages three Auto Scaling groups – two for Spot Instances, and one for On-Demand. The Kubernetes Cluster Autoscaler is deployed on On-Demand worker nodes, and the AWS Node Termination Handler is deployed on all worker nodes.

EKS worker node architecture

Additional detail on Kubernetes interaction with Auto Scaling group:

  • Cluster Autoscaler can be used to control scaling activities by changing the DesiredCapacity of the Auto Scaling group, and directly terminating instances. Auto Scaling groups can be used to find capacity, and automatically replace any instances that become unhealthy, or terminated through Spot Instance interruptions.
  • The Cluster Autoscaler can be provisioned as a Deployment of one pod to an instance of the On-Demand Auto Scaling group. The proceeding diagram shows the pod in AZ1, however this may not necessarily be the case.
  • Each node group maps to a single Auto Scaling group. However, Cluster Autoscaler requires all instances within a node group to share the same number of vCPU and amount of RAM. To adhere to Spot Instance best practices and maximize diversification, you use multiple node groups. Each of these node groups is a mixed-instance Auto Scaling group with capacity-optimized Spot allocation strategy.

Kuberentes Node Groups

Autoscaling in Kubernetes Clusters

There are two common ways to scale Kubernetes clusters:

  1. Horizontal Pod Autoscaler (HPA) scales the pods in deployment or a replica set to meet the demand of the application. Scaling policies are based on observed CPU utilization or custom metrics.
  2. Cluster Autoscaler (CA) is a standalone program that adjusts the size of a Kubernetes cluster to meet the current needs. It increases the size of the cluster when there are pods that failed to schedule on any of the current nodes due to insufficient resources. It attempts to remove underutilized nodes, when its pods can run elsewhere.

When a pod cannot be scheduled due to lack of available resources, Cluster Autoscaler determines that the cluster must scale out and increases the size of the node group. When multiple node groups are used, Cluster Autoscaler chooses one based on the Expander configuration. Currently, the following strategies are supported: random, most-pods, least-waste, and priority

You use random placement strategy in this example for the Expander in Cluster Autoscaler. This is the default expander, and arbitrarily chooses a node-group when the cluster must scale out. The random expander maximizes your ability to leverage multiple Spot capacity pools. However, you can evaluate the others and may find another more appropriate for your workload.

Atlassian Escalator:

An alternative to Cluster Autoscaler for batch workloads is Escalator. It is designed for large batch or job-based workloads that cannot be force-drained and moved when the cluster needs to scale down.

Auto Scaling group

Following best practices for Spot Instances means deploying a fleet across a diversified set of instance families, sizes, and AZs. An Auto Scaling group is one of the best mechanisms to accomplish this. Auto Scaling groups automatically replace Spot Instances that have been terminated due to a Spot interruption with an instance from another capacity pool.

Auto Scaling groups support launching capacity from multiple instances types, and using multiple purchase options. For the example in this blog post, I’m maintaining a 1:4 vCPU to memory ratio for all instances chosen. For your application, there may be a different set of requirements. The instances chosen for the two Auto Scaling groups are below:

  • 4vCPU / 16GB ASG: xlarge, m5d.xlarge, m5n.xlarge, m5dn.xlarge, m5a.xlarge, m4.xlarge
  • 8vCPU / 32GB ASG: 2xlarge, m5d.2xlarge, m5n.2xlarge, m5dn.2xlarge, m5a.2xlarge, m4.2xlarge

In this example I use a total of 12 different instance types and three different AZs, for a total of (12 * 3 = 36) 36 different Spot capacity pools. The Auto Scaling group chooses which instance types to deploy based on the Spot allocation strategy. To minimize the chance of Spot Instance interruptions, you use the capacity-optimized allocation strategy. The capacity-optimized strategy automatically launches Spot Instances into the most available pools by looking at real-time capacity data and predicting which are the most available.

capacity-optimized Spot allocation strategy

Example of capacity-optimized Spot allocation strategy.

For Cluster Autoscaler, other cluster administration/management pods, and stateful workloads that run on EKS worker nodes, you create a third Auto Scaling group using On-Demand Instances. This ensures that Cluster Autoscaler is not impacted by Spot Instance interruptions. In Kubernetes, labels and nodeSelectors can be used to control where pods are placed. You use the nodeSelector to place Cluster Autoscaler on an instance in the On-Demand Auto Scaling group.

Note: The Auto Scaling groups continually work to balance the number of instances in each AZ they are deployed over. This may cause worker nodes to be terminated while ASG is scaling in the number of instances within an AZ. You can disable this functionality by suspending the AZRebalance process, but this can result in capacity becoming unbalanced across AZs. Another option is running a tool to drain instances upon ASG scale-in such as the EKS Node Drainer. The code provides an AWS Lambda function that integrates as an Amazon EC2 Auto Scaling Lifecycle Hook. When called, the Lambda function calls the Kubernetes API to cordon and evicts all evictable pods from the node being terminated. It then waits until all pods are evicted before the Auto Scaling group continues to terminate the EC2 instance.

Spot Instance Interruption Handling

To mitigate the impact of potential Spot Instance interruptions, leverage the ‘node termination handler’. The DaemonSet deploys a pod on each Spot Instance to detect the Spot Instance interruption notification, so that it can both terminate gracefully any pod that was running on that node, drain from load balancers and allow the Kubernetes scheduler to reschedule the evicted pods elsewhere on the cluster.

The workflow can be summarized as:

  • Identify that a Spot Instance is about to be interrupted in two minutes.
  • Use the two-minute notification window to gracefully prepare the node for termination.
  • Taint the node and cordon it off to prevent new pods from being placed on it.
  • Drain connections on the running pods.

Consequently:

  • Controllers that manage K8s objects like Deployments and ReplicaSet will understand that one or more pods are not available and create a new replica.
  • Cluster Autoscaler and AWS Auto Scaling group will re-provision capacity as needed.

Walkthrough

Getting started (launch EKS)

First, you use eksctl to create an EKS cluster with the name spotcluster-eksctl in combination with a managed node group. The managed node group will have two On-Demand t3.medium nodes and it will bootstrap with the labels lifecycle=OnDemand and intent=control-apps. Be sure to replace <YOUR REGION> with the region you’ll be launching your cluster into.

eksctl create cluster --version=1.15 --name=spotcluster-eksctl --node-private-networking --managed --nodes=3 --alb-ingress-access --region=<YOUR REGION> --node-type t3.medium --node-labels="lifecycle=OnDemand" --asg-access

This takes approximately 15 minutes. Once cluster creation is complete, test the node connectivity:

kubectl get nodes

Provision the worker nodes

You use eksctl create nodegroup and eksctl configuration files to add the new nodes to the cluster. First, create the configuration file spot_nodegroups.yml. Then, paste the code and replace <YOUR REGION> with the region you launched your EKS cluster in.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
    name: spotcluster-eksctl
    region: <YOUR REGION>
nodeGroups:
    - name: ng-4vcpu-16gb-spot
      minSize: 0
      maxSize: 5
      desiredCapacity: 1
      instancesDistribution:
        instanceTypes: ["m5.xlarge", "m5n.xlarge", "m5d.xlarge", "m5dn.xlarge","m5a.xlarge", "m4.xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
      labels:
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
      tags:
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
      iam:
        withAddonPolicies:
          autoScaler: true
          albIngress: true
    - name: ng-8vcpu-32gb-spot
      minSize: 0
      maxSize: 5
      desiredCapacity: 1
      instancesDistribution:
        instanceTypes: ["m5.2xlarge", "m5n.2xlarge", "m5d.2xlarge", "m5dn.2xlarge","m5a.2xlarge", "m4.2xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotAllocationStrategy: capacity-optimized
      labels:
        lifecycle: Ec2Spot
        intent: apps
        aws.amazon.com/spot: "true"
      tags:
        k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
        k8s.io/cluster-autoscaler/node-template/label/intent: apps
      iam:
        withAddonPolicies:
          autoScaler: true
          albIngress: true

This configuration file adds two diversified Spot Instance node groups with 4vCPU/16GB and 8vCPU/32GB instance types. These node groups use the capacity-optimized Spot allocation strategy as described above. Last, you label all nodes created with the instance lifecycle “Ec2Spot” and later use nodeSelectors, to guide your application front-end to your Spot Instance nodes. To create both node groups, run:

eksctl create nodegroup -f spot_nodegroups.yml

This takes approximately three minutes. Once done, confirm these nodes were added to the cluster:

kubectl get nodes --show-labels --selector=lifecycle=Ec2Spot

Install the Node Termination Handler

You can install the .yaml file from the official GitHub site.

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.3.1/all-resources.yaml

This installs the Node Termination Handler to both Spot Instance and On-Demand nodes, which is helpful because the handler responds to both EC2 maintenance events and Spot Instance interruptions. However, if you are interested in limiting deployment to just Spot Instance nodes, the site has additional instructions to accomplish this.

Verify the Node Termination Handler is running:

kubectl get daemonsets --all-namespaces

Deploy the Cluster Autoscaler

For additional detail, see the EKS page here. Export the Cluster Autoscaler into a configuration file:

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Open the file created and edit the cluster-autoscaler container command to replace <YOUR CLUSTER NAME> with your cluster’s name, and add the following options.

--balance-similar-node-groups
--skip-nodes-with-system-pods=false

You also need to change the expander configuration. Search for - --expander= and replace least-waste with random

Example:

    spec:
        containers:
        - command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=random
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false

Save the file and then deploy the Cluster Autoscaler:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Next, add the cluster-autoscaler.kubernetes.io/safe-to-evictannotation to the deployment with the following command:

kubectl -n kube-system annotate deployment.apps/cluster-autoscaler cluster-autoscaler.kubernetes.io/safe-to-evict="false"

Open the Cluster Autoscaler releases page in a web browser and find the latest Cluster Autoscaler version that matches your cluster’s Kubernetes major and minor version. For example, if your cluster’s Kubernetes version is 1.16 find the latest Cluster Autoscaler release that begins with 1.16.

Set the Cluster Autoscaler image tag to this version using the following command, replacing 1.15.n with your own value. You can replace us with asia or eu:

kubectl -n kube-system set image deployment.apps/cluster-autoscaler cluster-autoscaler=us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.15.n

To view the Cluster Autoscaler logs, use the following command:

kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler

Deploy the sample application

Create a new file web-app.yaml, paste the following specification into it and save the file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-stateless
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: web-stateless
        resources:
          limits:
            cpu: 1000m
            memory: 1024Mi
          requests:
            cpu: 1000m
            memory: 1024Mi
      nodeSelector:    
        lifecycle: Ec2Spot
--- 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-stateful
spec:
  replicas: 2
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        service: redis
        app: redis
    spec:
      containers:
      - image: redis:3.2-alpine
        name: web-stateful
        resources:
          limits:
            cpu: 1000m
            memory: 1024Mi
          requests:
            cpu: 1000m
            memory: 1024Mi
      nodeSelector:
        lifecycle: OnDemand

This deploys three replicas, which land on one of the Spot Instance node groups due to the nodeSelector choosing lifecycle: Ec2Spot. The “web-stateful” nodes are not fault-tolerant and not appropriate to be deployed on Spot Instances. So, you use nodeSelector again, and instead choose lifecycle: OnDemand. By guiding fault-tolerant pods to Spot Instance nodes, and stateful pods to On-Demand nodes, you can even use this to support multi-tenant clusters.

To deploy the application:

kubectl apply -f web-app.yaml

Confirm that both deployments are running:

kubectl get deployment/web-stateless
kubectl get deployment/web-stateful

Now, scale out the stateless application:

kubectl scale --replicas=30 deployment/web-stateless

Check to see that there are pending pods. Wait approximately 5 minutes, then check again to confirm the pending pods have been scheduled:

kubectl get pods

Clean-Up

Remove the AWS Node Termination Handler:

kubectl delete daemonset aws-node-termination-handler -n kube-system

Remove the two Spot node groups (EC2 Auto Scaling group) that you deployed in the tutorial.

eksctl delete nodegroup ng-4vcpu-16gb-spot --cluster spotcluster-eksctl
eksctl delete nodegroup ng-8vcpu-32gb-spot --cluster spotcluster-eksctl

If you used a new cluster and not your existing cluster, delete the EKS cluster.

eksctl confirms the deletion of the cluster’s CloudFormation stack immediately but the deletion could take up to 15 minutes. You can optionally track it in the CloudFormation Console.

eksctl delete cluster --name spotcluster-eksctl

Conclusion

By following best practices, Kubernetes workloads can be deployed onto Spot Instances, achieving both resilience and cost optimization. Instance and Availability Zone flexibility are the cornerstones of pulling from multiple capacity pools and obtaining the scale your application requires. In addition, there are pre-built tools to handle Spot Instance interruptions, if they do occur. EKS makes this even easier by reducing operational overhead through offering a highly-available managed control-plane and managed node groups. You’re now ready to begin integrating Spot Instances into your Kubernetes clusters to reduce workload cost, and if needed, achieve massive scale.

New – Amazon EC2 C5a Instances Powered By 2nd Gen AMD EPYC™ Processors

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-c5a-instances-powered-by-2nd-gen-amd-epyc-processors/

Over the last 18 months, we have launched AMD-powered M5a and R5a/M5ad and R5ad, and T3a instances to provide customers additional choice for running their general purpose and memory intensive workloads. Built on the AWS Nitro System, these instances are powered by custom 1st generation AMD EPYC™ processors. These instances are priced 10% lower than comparable EC2 M5, R5, and T3 instances, and provide you with options to balance your instance mix based on cost and performance.

Today, I am excited to announce the general availability of compute-optimized C5a instances featuring 2nd Gen AMD EPYC™ processors, running at frequencies up to 3.3 GHz, are generally available. C5a instances are variants of Amazon EC2’s compute-optimized (C5) instance family and provide high performance processing at 10% lower cost over comparable instances. C5a instances are ideal for a broad set of compute-intensive workloads including batch processing, distributed analytics, data transformations, log analysis, and web applications.

You can launch C5a instances today in eight sizes in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Sydney), and Asia Pacific (Singapore) Regions in On-Demand, Spot, and Reserved Instance or as part of a Savings Plan. Here are the specs:

Instance NamevCPUsRAMEBS-Optimized BandwidthNetwork Bandwidth
c5a.large
24 GiBUp to 3.170 GbpsUp to 10 Gbps
c5a.xlarge
48 GiBUp to 3.170 GbpsUp to 10 Gbps
c5a.2xlarge
816 GiBUp to 3.170 GbpsUp to 10 Gbps
c5a.4xlarge
1632 GiBUp to 3.170 GbpsUp to 10 Gbps
c5a.8xlarge
3264 GiB3.170 Gbps10 Gbps
c5a.12xlarge
4896 GiB4.750 Gbps12 Gbps
c5a.16xlarge
64128 GiB6.3 Gbps20 Gbps
c5a.24xlarge
96192 GiB9.5 Gbps20 Gbps

But wait, there’s more! Disk variants, C5ad, that come with fast, local NVMe instance storage and bare metal variants, C5an.metal and C5adn.metal, are coming soon.

The C5a instances will be supported across a broad set of AWS services that support C5 today, including AWS Batch, Amazon EMR, Elastic Container Service (ECS), and Elastic Kubernetes Service(EKS). C5a are fully-compatible 64-bit x86 and managed by the same Nitro platform used across Amazon EC2. Again, these instances are available in similar sizes as the C5 instances, and the AMIs work on either, so go ahead and try both!

To learn more, visit our AMD Instances page and please send feedback to [email protected], AWS forum for EC2 or through your usual AWS Support contacts.

Channy;

Migrating a SharePoint application using the AWS Server Migration Service

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/migrating-a-sharepoint-application-using-the-aws-server-migration-service/

This post contributed by Ashwini Rudra, Solutions Architect; Rajesh Rathod, Sr. Product Manager; Vivek Chawda, Senior Software Engineer, EC2 Enterprise

Many AWS customers are migrating on-premises SharePoint workloads to AWS for greater reliability, faster performance, and lower costs. While planning the migration, customers are looking for tools and methodologies that reduce the time to migrate, application downtime, and performance disruption. They use continuous replication to optimize cost and effort required to migrate applications reliably.

To accelerate these migrations, AWS provides a comprehensive set of tools. In this blog post, we explore how to use AWS Server Migration Service (AWS SMS) for migrating a SharePoint application from on-premises to AWS.

Overview of solution

AWS Server Migration Service is an agentless service. It makes it easier and faster to migrate thousands of workloads from on-premises or Microsoft Azure to AWS. In this article, we discuss one of the approaches and steps to migrate SharePoint farm using AWS SMS.

AWS SMS also supports migrating a group of servers organized as an application. This can simplify migration of applications with complex dependencies that must be migrated together. This service provides a customized replication schedule designed to simplify migration at scale. This also tracks the progress of each migration using AWS Migration Hub.

For SharePoint migrations, it facilitates migration failovers quickly. After the initial sync, the migration uses an incremental change capture approach to synchronize changes made to the on-premises SharePoint servers. This method also reduced required network bandwidth for the migration.

SharePoint Migration Architecture

SharePoint Migration Architecture

Here is how this service and solution works:

Walkthrough

A basic SharePoint deployment is a 3-tier architecture comprising of web frontend servers, application servers, and backend SQL database servers. It also includes authentication services servers with Active Directory domain controllers.

To migrate this application, you must deploy a Server Migration Connector, which is a preconfigured virtual machine. This connector creates a server catalog. Based on selected server and configuration, it takes snapshots of virtual machines and stores them in S3 buckets.

In the background, AWS VM import/export service converts these snapshots into Amazon Machine Images (AMIs). Using these AMIs, you can configure launch settings where you define an order of application launch. You can also select instance types and set user-defined PowerShell scripts.

At the end, you define the cloud network topology: a VPC, subnets, and security groups. With these launch settings, SMS creates an AWS CloudFormation template, which can launch the SharePoint application in the target AWS account.

To migrate a SharePoint using SMS:

  1. Install and register the SMS connector.
  2. Create an application from SMS server catalog.
  3. Configure replication settings for your SharePoint farm.
  4. Configure launch settings.
  5. Start the replication.
  6. Launch the SharePoint application in AWS.

Install and register the Server Migration Connector

The Server Migration Connector is a VM that you install in your on-premises virtualization environment. The supported platforms are VMware vSphere, Microsoft Hyper-V/SCVMM, and Microsoft Azure. Follow the links below to install in your environment:

For example, VMWare is a common scenario. First, you must set up Server Migration Connector and import server catalog:

  1. Login to AWS Server Migration Service Console, and click to Connectors.
  2. Download .ova file and deploy it to your VMWare environment using vSphere client from AWS Server Migration Setup page.
  3. Install OVA file on your VMWare environment. This is your AWS SMS Connector.
  4. Use Connector Host IP address access Connector page and start five step registration process. Refer AWS documentation, Install the Server Migration Connector on VMWare.
  5. Follow the five-step process of registration. Here, you set up the password and network configuration between the connector virtual machine and AWS accounts.
    Setup network info

    Setup network info

    vCenter credentials

    vCenter credentials

    At the end, provide your vCenter host name and credentials.

    In setup, provide all details of the VMWare and AWS environment to the connector. After establishing the connection, the connector setup looks like this in the AWS Management Console:

    SMS Connectors

    SMS Connectors

Create an application from SMS server catalog

After configuring the connector and selecting “Import Server Catalog”, you are able to view the server catalog in AWS Server Migration Service console. To migrate your SharePoint application, select the application server, SQL Server, and Active Directory server from the server catalog.

Depending on the application architecture, you can group these servers to apply server-specific configuration settings and select appropriate instance types. Here are the steps:

  1. Navigate to the application feature of SMS and create a “new application” for the SharePoint farm. Provide the application name, description, and IAM role.

    Create new application - Application settings

    Create new application – Application settings

  2. Select servers to migrate from the catalog.
  3. Create different groups for these servers in your console. For this, select servers and choose Add servers to group. This helps in defining different instance types and run user-defined PowerShell scripts for all servers in a group during application launch. You may create different groups for the application, web frontend, database, and Active Directory servers. In the below example, there are two groups – one for application servers, and the other for database servers. This process assumes that you have the authentication services servers already in place and operational in AWS. For more information on SharePoint authentication services, Active Directory and Domain Services, Refer Active Directory Domain Services on AWS Deployment Scenario and Architecture.

    Create new application - Add servers to groups

    Create new application – Add servers to groups

  4. Add tags, per your organization tagging strategy or policy.
  5. Review your application and click “Next” when it is ready for replication settings configurations.

Configure replication settings

  1. Define “replication job type”, “when to start replication job”, and “automatic AMI deletion” based on your requirements. Choose Next.
  2. On the Configure server-specific settings page, in the License type column, select the license type for AMIs created from the replication job. Windows Servers can use either an AWS-provided license or Bring Your Own License (BYOL). Check Microsoft Licensing to review the licensing options. You can also choose Auto to allow AWS SMS to detect the source-system operating system (OS) and apply the appropriate license to the migrated virtual machine. Choose Next.

    Server-specific settings

    Server-specific settings

  3. Review your application replication setting and choose Configure Launch Settings.

Configure launch settings

An important aspect of migration is how this application should launch on EC2. This is configured on this page of SMS:

  1. On the Configure launch settings page, for the IAM CloudFormation role, provide an IAM role for launch settings. Refer AWS Documentation on IAM Roles for AWS SMS.

    Configure launch settings

    Configure launch settings

  2. Under Specify launch order, configure a launch order for your groups. For this SharePoint application, you may prefer Active Directory first, followed by the SQL database, and then the application servers.
  3. Under Configure launch settings for the application, edit the server settings individually:
    Target instance details

    Target instance details

    • Logical ID: AWS CloudFormation resource ID. This is the logical ID of the CloudFormation template that AWS SMS generates for the application. A value is created automatically when you use the console, but you must supply it manually when using the API, CLI, or SDKs. For more information, see Resources in the AWS CloudFormation User Guide.
    • Instance type: specifies the EC2 instance type on which to launch the server.
    • Key pair: specifies the SSH key pair for access to the server.
    • Configuration script: a script to run configuration commands at the startup of EC2 instances launched as part of an application. This is important for your SharePoint migration, as you can provide registry settings and configuration settings using PowerShell script for your SharePoint servers and SQL database servers.

    For example, the PowerShell script below retrieves the IP address of the current SQL Server and replaces the old SQL Server IP in the SharePoint configuration database and connection strings. You can automate many SharePoint configuration tasks using PowerShell scripts.

      Start-Transcript -Path "C:\UserData.log" -Append
      $oldIP = <<Your old IP goes here>>
      $newIP = ([System.Net.Dns]::GetHostAddresses("sp-ip-sql-server.aws.local")).IPAddressToString
      $registryPath = "HKLM:\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\16.0\Secure\ConfigDB"
      $Name = "dsn"
      $regValue = (Get-ItemProperty -Path $registryPath -Name $Name).dsn
      $updatedRegValue = $regValue.Replace($oldIP, $newIP)
      New-ItemProperty -Path $registryPath -Name $Name -Value $updatedRegValue -PropertyType String -Force | Out-Null
  4. Configure the target network: VPC, subnets and security groups. Refer to the SharePoint on AWS documentation (Reference Deployment) for more guidance. The network topology varies based on platform requirements. Here is a reference architecture for SharePoint on AWS:

    Architecture topology

    Architecture topology

Start application replication

To start replication, select Start Replication under the Actions menu on the Applications page. The replication time depends on the amount of data replicated and available network bandwidth. On the application details page, you can observe the status of the replication in the Replication status field. If the replication fails, the status message field shows the reason.

Start replication

Start replication

Launch SharePoint in AWS

  1. On the Application page, choose Actions, Launch application. A replication job must complete before you perform this action.
  2. In the Launch application window, choose Launch. On the application details page, you can observe the status of the launch in the Launch status field. If the launch fails, you are able to find the reason in the status message field. You can also generate a CloudFormation template and download this template to use in different AWS accounts.

    Launch application

    Launch application

Test your migration

When the SharePoint application is launched, you can connect to Amazon EC2 instances based servers via Remote Desktop Protocol (RDP). You can access the application based on your Internet Information Services (IIS) Server settings runs on SharePoint Web Front End (WFE) application server (on Amazon EC2).  It is also recommended to investigate optimizations using the AWS Compute Optimizer. For this blog, we have not verified the migration steps with SharePoint 2007 and 2010.

Conclusion

AWS Server Migration Service simplifies SharePoint application migration. Using AWS SMS, you can easily migrate a SharePoint farm and reduce your migration timeline using the launch setting and launch order features.

To learn more, watch Application migration Using AWS Server Migration Service (SMS) or view a demo on the AWS Online Tech Talks Channel. If you have feedback, let us know in the comments section below.

Running Web Applications on Amazon EC2 Spot Instances

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/running-web-applications-on-amazon-ec2-spot-instances/

Author: Isaac Vallhonrat, Sr. EC2 Spot Specialist SA

Amazon EC2 Spot Instances allow customers to save up to 90% compared to On-Demand pricing by leveraging spare EC2 capacity. Spot Instances are a perfect fit for fault tolerant workloads that are flexible to run on multiple instance types such as batch jobs, code builds, load tests, containerized workloads, web applications, big data clusters, and High Performance Computing clusters. In this blog post, I walk you through how-to and best practices to run web applications on Spot Instances, so you can benefit from the scale and savings that they provide.

As Spot Instances are interruptible, your web application should be stateless, fault tolerant, loosely coupled, and rely on external data stores for persistent data, like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, etc.

Spot Instances recap

While Spot Instances have been around since 2009, recent updates and service integrations have made them easier than ever to integrate with your workloads. Before I get into the details of how to use them to host a web application, here is a brief recap of how Spot Instances work:

First, Spot Instances are a purchasing option within EC2. There is no difference in hardware when compared to On-Demand or Reserved Instances. The only difference is that these instances can be reclaimed with a two-minute notice if EC2 needs the capacity back.

A set of unused EC2 instances with the same instance type, operating system, and Availability Zone is a Spot capacity pool. Spot Instance pricing is set by Amazon EC2 and adjusts gradually based on long-term trends in supply and demand of EC2 instances in each pool. You can expect Spot pricing to be stable over time, meaning no sudden spikes or swings. You can view historical pricing data for the last three months in both the EC2 console and via the API. The following image is an example of the pricing history for m5.xlarge instances in the N. Virginia (us-east-1).

spot instance pricing history

Figure 1. Spot Instance pricing history for an m5.xlarge Linux/UNIX instance in us-east-1. Price changes independently for each Availability Zone and in small increments over time based on long term supply and demand.

As each capacity pool is independent from each other, I recommend you spread the fleet of instances that power your workload across multiple Availability Zones (AZs) and be flexible to use multiple instance types. This way, you effectively increase the amount of spare capacity you can use to launch Spot Instances and are able to replace interrupted instances with instances from other pools.

The best way to adhere to Spot Instance flexibility best practices and managing your fleet of instances is using an Amazon EC2 Auto Scaling group. This allows you to combine multiple instance types and purchase options. Once you identify a list of suitable instance types for your workload, you can use Auto Scaling features to launch a combination of On-Demand and Spot Instances from optimal Spot Instance pools in each Availability Zone.

Stateless web applications are a natural fit for Spot Instances as they are fault tolerant, used to handle interruptions through scale-in events in an Auto Scaling group, and commonly run across multiple Availability Zones. It’s also normally easy to implement instance type flexibility on web applications, so they tap on Spot Instance capacity from multiple pools.

Identifying suitable instance types for your web application

Being instance type flexible is key to being successful with Spot Instances. At the time of this writing, AWS provides more than 270 instance types, so it’s likely that your web application can run on multiple of them. For example, if your application runs on the m5.xlarge instance type, it can also run on the m5d.xlarge instance type as it is the same instance type, but with local SSDs. Also, running your application on the r5.xlarge or r5d.xlarge performs similarly since they use the same processor family. Instead, your application runs on an instance with more memory and you benefit from the savings Spot Instances provide.

You may be able to qualify additional similarly-sized instance types from other families or generations, like the AMD-based variants m5a.xlarge and r5a.xlarge, or the higher bandwidth variants m5n.xlarge and r5n.xlarge. These may present some performance differences compared to other types due to their hardware and/or the virtualization system. However, Application Load Balancer has mechanisms in place to spread the load across your instances according to the number of outstanding requests they are processing. This means, instances handling longer requests or that have lower processing power are not overloaded. This allows you to acquire capacity from even more Spot capacity pools regardless of hardware differences. Later in the blog post, I dive into the details on the load balancing mechanisms.

Configure your Auto Scaling group

Now that you have qualified multiple instance types to run your web application, you need to create an EC2 Auto Scaling group.

The first step is to create a Launch Template where you specify configuration for your fleet of instances including the AMI, Security Groups, Tags, user data scripts, etc. You configure a single instance type on the template, here you configure the instance type you commonly use to run your web application. Also, do not mark the Request Spot Instances checkbox on the template. You configure multiple instance types and the Spot Instance settings when configuring the Auto Scaling group. Here’s my Launch Template:

launch templates console

Figure 2. Launch templates console

Now, go to the EC2 Auto Scaling console to create an Auto Scaling group. In the wizard, select the launch template you created and then click next. Then, on the second step of the wizard, select Combine purchase options and instance types. You can see the configuration options in the following image.

combine EC2 purchase options

Figure 3. Combine purchase options and instance types configuration.

Here is a detailed look at the configuration details:

  • Optional On-Demand Base: This parameter specifies a baseline of On-Demand Instances within your Auto Scaling group. This is also handy if you have Reserved Instances or Savings Plans to cover your compute baseline, since On-Demand Instances apply towards your Savings Plans or the instances matching your reservation are billed as Reserved Instances.
  • On-Demand percentage above base: Once the size of the Auto Scaling group grows over the (optional) On-Demand base capacity, this parameter defines the percentage of instances that are On-Demand and Spot Instances. The default settings are a good starting point, set at 70% On-Demand and 30% Spot. You can adjust these percentages as suitable for your workload.
  • Spot Allocation Strategy per Availability Zone: This defines the logic that Auto Scaling uses to select which instance type to launch in each Availability Zone when launching Spot Instances. The possible Spot allocation strategies are:
    • Capacity Optimized: This strategy allocates your instances from the Spot Instance pool with optimal capacity for the number of instances that are launching. This is the default selection and the recommended one, as launching from capacity pools with optimal capacity can lower the chance of interruption. Every time EC2 Auto Scaling scales out, the strategy is evaluated, so as your Auto Scaling group scales out and scales in, your instances are recycled and you make use of Spot Instances in the optimal pools.
    • Lowest price: This strategy allocates your instances from the number (N) of Spot Instance pools that you specify and from the pools with the lowest price per unit at the time of fulfillment.

Next, you choose your instance types from the Instance Types section of the wizard. Here, there is a primary instance type inherited from your Launch Template. The console then automatically populates a list of same size instance types across other families and generations that could also fit your web application requirements. You can add or remove instance types as you see fit.

Selection of instance types

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4. Selection of instance types

Note: There is an option to combine instance types of different sizes which is very helpful for queue worker nodes and similar workloads. We won’t cover this in this blog post as it is not relevant for web applications. You can read more about this feature here.

Once you select your instance types, your Auto Scaling group uses the configured Spot Instance allocation strategy to decide which Spot Instance types to launch on each Availability Zone. By specifying multiple instance types, EC2 Auto Scaling replenishes capacity from other Spot Instance pools applying your configured allocation strategy if some of your instances get interrupted. This maintains the size of your fleet and keeps your service available.

For the On-Demand part of your Auto Scaling group, the allocation strategy is prioritized. This means that EC2 Auto Scaling launches the primary instance type, and moves to the next types in the order you selected, in the unlikely event it cannot launch the capacity. If you have Reserved Instances to cover your compute baseline, set the instance type you reserved as the primary instance type in your Auto Scaling group, so the On-Demand Instances get the reservation discounts. You can learn more about how Reserved Instances are applied here.

After completing this step, complete the Auto Scaling group creation wizard, and you can move on to set up your Application Load Balancer.

Load balancing with Application Load Balancer

To balance the load across the fleet of instances running your web application you use an Application Load Balancer (ALB). EC2 Auto Scaling groups are fully integrated with ALB and Target Groups that manage the fleet of instances fronted by ALB.

Target groups are set up by default with a deregistration delay of 300 seconds. This means that when EC2 Auto Scaling needs to remove an instance out of the fleet, it first puts the instance in draining state, informs ALB to stop sending new requests to it, and allows the configured time for in-flight requests to complete before the instance is finally deregistered. This way, removing an instance from the fleet is not perceptible to your end users.

As Spot Instances are interrupted with a 2 minute warning, you need to adjust this setting to a slightly lower time. Recommended values are 90 seconds or less. This allows time for in-flight requests to complete and gracefully close existing connections before the instance is interrupted.

At the time of this writing, EC2 Auto Scaling is not aware of Spot Instance interruptions and does not trigger draining automatically when a Spot Instance is going to be interrupted. However, you can put the instance in draining state by catching the interruption notice and calling the EC2 Auto Scaling API. I cover this in more detail in the next section.

Target Groups allow you to configure a routing algorithm, which by default is Round Robin. As with Spot Instances you combine multiple instance types in your fleet, your web application may see some performance differences if they use different hardware and in some cases a different virtualization platform (Xen for the older generation instances and AWS Nitro for the newer ones). While the impact to response time may be negligible to the end user, you need to ensure your instances handle load fairly accounting for the processing power differences of each instance type.

With the Least Outstanding Requests load balancing algorithm, instead of distributing the requests across your fleet in a round-robin fashion, as a new request comes in, the load balancer sends it to the target with the least number of outstanding requests. This way backend instances with lower processing capabilities or processing long-standing requests are not burdened with more requests and the load is spread evenly across targets. This also helps the new targets effectively take load from overloaded targets.

These settings can be configured on the Target Groups section of the EC2 Console and editing the attributes of your target group.
application load balancer attributes

Figure 5. ALB Target group attribute configuration.

Your architecture will look like the following image.

stateless web application architecture

Figure 6. Stateless web application hosted on a combination of On-Demand and Spot Instances

Leveraging Instance Termination Notices for graceful termination

When a Spot Instance is going to be interrupted, EC2 triggers a Spot Instance interruption notice that is presented via the EC2 Metadata endpoint on the instance. It can also be caught by an Amazon EventBridge rule, for example, trigger an AWS Lambda function. You can find the specific documentation here.

By catching the interruption notice, you can take actions to gracefully take the instance out of the fleet without impacting your end users. For example, stop receiving new requests from the Load Balancer, allowing in-flight requests to finish and launch replacement capacity.

Below is a sample Spot Instance Interruption Notice caught by Amazon EventBridge:

{
  "version": "0",
  "id": "72d6566c-e6bd-117d-5bbc-2779c679abd5",
  "detail-type": "EC2 Spot Instance Interruption Warning",
  "source": "aws.ec2",
  "account": "12345678910",
  "time": "2020-03-06T08:25:40Z",
  "region": "eu-west-1",
  "resources": [
    "arn:aws:ec2:eu-west-1a:instance/i-0cefe48f36d7c281a"
  ],
  "detail": {
    "instance-id": "i-0cefe48f36d7c281a",
    "instance-action": "terminate"
  }
}

Upon intercepting the interruption event, you can leverage the DetachInstances API call of EC2 Auto Scaling. Invoking this API puts the instance in Draining state, which makes the configured Load Balancer stop sending new requests to the instance that is going to be interrupted. The instance remains in this state during the configured deregistration delay on the Target Group to allow time for in-flight requests to complete.

Target Group console

Figure 7. Target Group console showing instance draining.

Also, if you set the ShouldDecrementDesiredCapacity parameter of the API call to False, EC2 Auto Scaling attempts to launch replacement Spot Instance capacity according to your instance type selection and allocation strategy. Below is an example code snippet calling the Auto Scaling API using Python Boto3.

def detach_instance_from_asg(instance_id,as_group_name):
   try:
       # detach instance from ASG and launch replacement instance
       response = asgclient.detach_instances(
           InstanceIds=[instance_id],
           AutoScalingGroupName=as_group_name,
           ShouldDecrementDesiredCapacity=False)
       logger.info(response['Activities'][0]['Cause'])
   except ClientError as e:
       error_message = "Unable to detach instance {id} from AutoScaling Group {asg_name}. ".format(
           id=instance_id,asg_name=as_group_name)
       logger.error( error_message + e.response['Error']['Message'])
       raise e

In some cases, you may also want to take actions at the operating system level when a Spot Instance is going to be interrupted. For example, you may want to gracefully stop your application so it closes open connections to a database, deregister agents running on the instance or other clean-up activities. You can leverage AWS Systems Manager Run Command to invoke commands on the instance that is going to be interrupted to perform these actions.

You may also want to delay the execution of commands to capitalize on the two minute warning. As a first step, you can check the instance termination time on the instance metadata and wait, for example, until 30 seconds before the interruption. Then, gracefully stop your application and issue other termination commands. Note that there are no charges for using Run Command (limits apply) and the API call is asynchronous so it doesn’t hold your Lambda function running.

Note: To use Run Command you must have the AWS Systems Manager agent running on your instance and meet a set of pre-requisites, like configuring an IAM instance profile. You can find more information here.

We provide you with a sample solution that handles all the steps outlined above here that you can easily deploy on your AWS account.

The sample solution also allows you to optionally execute commands upon Spot Instance interruption by creating an AWS Systems Manager parameter in Parameter Store specific to each Auto Scaling group with the commands you want to run, so you can easily customize termination commands to the needs of each application. You can find detailed configuration instructions in the README.md file. The high-level architecture of this solution is detailed below:

serverless architecture spot interruptions

Figure 8. Serverless architecture to handle Spot Instance interruptions

Tying it all together

Now, that I covered all the relevant features to run web applications on Spot Instances, let’s put them all together and summarize:

  • Using mixed instance types and purchase options in EC2 Auto Scaling groups you can configure a combination of On-Demand and Spot Instances. Leverage this feature to configure multiple instance types to run your web application on Spot Instances, so you can leverage unused capacity from multiple Spot capacity pools and replace interrupted instances.
  • Using the capacity-optimized Spot Instance allocation strategy you can launch Spot Instances from your instance type selection based on the most available Spot Instance capacity pools. This reduces the chances of interruptions. Auto Scaling then replenishes Spot Instance capacity from the optimal pools if some of your instances are interrupted as Spot Instance capacity fluctuates.
  • With Application Load Balancer you can adjust the deregistration delay of your Target Group to a slightly lower time than the Spot Instance termination notice time (e.g. 90 seconds), so when an instance is going to be interrupted, you stop sending new requests to it and leverage the notice time to let in-flight requests complete and avoid impact to your end users. You can also configure the Least Outstanding Requests load balancing algorithm in your ALB Target Group so the load is distributed evenly across your fleet accounting for differences in processing power for different instance families.
  • You can leverage Amazon EventBridge, AWS Lambda and AWS Systems Manager Run Command to take interruption handling actions like put the instance in draining state, request EC2 Auto Scaling to launch a replacement instance and run commands to gracefully shutdown your application or trigger clean-up activities. You can also subscribe an SNS topic or other Lambda functions to the interruption event for alarming, logging or other purposes.

I hope you found this blog post useful, and if you are not using Spot Instances to run your stateless web applications today, I encourage you to try them to get the cost savings and scale they provide.


Isaac Vallhonrat is a Sr. Specialist Solutions Architect for EC2 Spot at AWS. He helps customers build resilient, fault tolerant and cost-effective architectures with EC2 Spot Instances across multiple workload types like web applications, containers, big data, CI/CD, etc.

 

 

 

New – EC2 M6g Instances, powered by AWS Graviton2

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-m6g-ec2-instances-powered-by-arm-based-aws-graviton2/

Starting today, you can use our first 6th generation Amazon Elastic Compute Cloud (EC2) General Purpose instance: the M6g. The “g” stands for “Graviton2“, our next generation Arm-based chip designed by AWS (and Annapurna Labs, an Amazon company), utilizing 64-bit Arm Neoverse N1 cores.

Graviton 2 chipset

These processors support 256-bit, always-on, DRAM encryption. They also include dual SIMD units to double the floating point performance versus the first generation Graviton, and they support int8/fp16 instructions to accelerate machine learning inference workloads. You can read this full review published by AnandTech for in-depth details.

The M6g instances are available in 8 sizes with 1, 2, 4, 8, 16, 32, 48, and 64 vCPUs, or as bare metal instances. They support configurations with up to 256 GiB of memory, 25 Gbps of network performance, and 19 Gbps of EBS bandwidth. These instances are powered by AWS Nitro System, a combination of dedicated hardware and a lightweight hypervisor.

For those of you running typical open-source application stacks, generally deployed on x86-64 architectures, migrating to Graviton will give you up to 40% improvement on cost-performance ratio, compared to similar-sized M5 instances. M6g instances are well-suited for workloads such as application servers, gaming servers, mid-size databases, caching fleets, web tier and the likes.

We ran an extensive preview program to collect customer feedback on this 6th generation instance type. For example, Honeycomb uses 30% fewer instances vs C5, KeyDB observes 65% better performance and 20% cost reduction vs M5, Inter Systems reported 28% performance improvement and 20% cost reduction compared to equivalent M5 instances, and Treasure Data benchmarked 30% increase of performance for 20% cost reduction compared to similarly sized M5 instances. You can read more customer stories including Hotelbeds, Redbox, Nielsen, Mobiuspace, RayGun on the M6g web page.

Several AWS services teams are evaluating these instances too. For example, during their testing, the [elasticcache] service team found that M6g instances deliver up to 50% throughput improvement over M5 instances on Redis.

Major Linux distributions are available on Arm architecture, just select the Amazon Machine Image (AMI) corresponding to the Arm version of your favorite distribution when launching an instance in the AWS Management Console. Be sure to select the 64-bit (Arm) button on the right part of the screen.

Launch ARM AMI in the console

If you choose the AWS Command Line Interface (CLI) instead, use the corresponding image-id for your region, architecture, and distribution. For example, to start an Amazon Linux 2 instance:

AMI_ID=$(aws ssm get-parameters-by-path --path /aws/service/ami-amazon-linux-latest --output text --query "Parameters[?contains(Name, 'ami-hvm-arm64')].Value")
aws ec2 run-instances --image-id $AMI_ID --instance-type m6g.large --key-name my-ssh-key-name --security-group-ids sg-1234567

(you need to adjust the ssh key name and the security group ID in the above command)

Once the instance is started, it behaves like any Amazon Elastic Compute Cloud (EC2) instance:

~ % ssh [email protected]
Warning: Permanently added 'ec2-01-01-01-01.compute-1.amazonaws.com,01.01.01.01' (ECDSA) to the list of known hosts.
Last login: Wed Apr 22 12:26:44 2020 from 01-01-01-01.amazon.com

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
[[email protected] ~]$ uname -a
Linux ip-172-31-16-155.ec2.internal 4.14.171-136.231.amzn2.aarch64 #1 SMP Thu Feb 27 20:25:45 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

The Arm software ecosystem is broad and deep, from Linux distributions (Amazon Linux 2, Ubuntu, Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Fedora, Debian, FreeBSD), to language runtimes (Java with Amazon Corretto , NodeJS, Python, Go,…), container services (Docker, Amazon ECS, Amazon Elastic Kubernetes Service, Amazon Elastic Container Registry), agents (Amazon CloudWatch, AWS Systems Manager, Amazon Inspector), developer tools (AWS Code Suite, Jenkins, GitLab, Chef, Drone.io, Travis CI), and security & monitoring solutions such as Datadog, Crowdstrike, Qualys, Rapid7, Tenable, or Honeycomb.io.

You will find Arm versions of commonly used software packages available for installation through the same mechanisms that you currently use (yum, apt-get, pip, npm …). While some applications may require re-compilation, the vast majority of applications that are based on interpreted languages (such as Java, NodeJS, Python, Go) should run unmodified on M6g instances. In the rare cases where you will need to recompile or debug code, we have assembled some resources to help you to get started.

We are not going to stop at general purposes M6g instances, compute optimized C6g instances and memory optimized R6g instances are coming soon, stay tuned.

Now it’s you turn to give it a try in one the following AWS Regions : US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Tokyo).

As usual, let us know your feedback.

— seb

EC2 Price Reduction – For EC2 Instance Saving Plans and Standard Reserved Instances

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/ec2-price-reduction-for-ec2-instance-saving-plans-and-standard-reserved-instances/

It is my great pleasure to tell you about a price reduction for Amazon Elastic Compute Cloud (EC2) customers who plan to use, Standard Reserved Instances or EC2 Instance Saving Plans. The price changes are already in effect, and so anyone buying news RIs or a new EC2 Instance Saving Plan will be able to take advantage of the lower prices.

Our engineering investments, coupled with our scale and our time-tested ability to manage our capacity, allow us to identify and pass on cost savings to you.

The price reduction you receive, depends on the region you choose, whether you take out a 1 or 3 year term and finally the instance family you commit to in your agreement. Price reductions vary from between 1% to a massive 18% on what you were previously paying. Below I’ve given a snapshot of some of the savings across the M5, C5, and R5 instance types, however there are also price reductions for the instance types C5n, C5d, M5a, M5n, M5ad, M5dn, R5a, R5n, R5d, R5ad, R5dn, T3, T3a, Z1d, and A1.

Price Reduction for 1 Year TermsPrice Reduction for 3 Year Terms
RegionM5C5R5M5C5R5
Europe (Stockholm)10%8%0%13%11%5%
Middle East (Bahrain)10%8%0%13%11%5%
Asia Pacific (Mumbai)2%7%0%2%5%4%
Europe (Paris)10%8%0%13%11%5%
US East (Ohio)2%1%0%2%0%5%
Europe (Ireland)10%8%0%13%11%5%
Europe (Frankfurt)12%9%0%18%13%5%
South America (São Paulo)0%12%0%4%7%4%
Asia Pacific (Hong Kong)3%7%0%2%4%5%
US East (N. Virginia)2%0%0%2%0%5%
Asia Pacific (Seoul)0%7%0%6%12%5%
Asia Pacific (Osaka)10%12%0%14%16%5%
Europe (London)10%8%0%14%11%5%
Asia Pacific (Tokyo)10%12%0%14%16%5%
AWS GovCloud (US-East)8%0%0%5%0%5%
AWS GovCloud (US-West)8%0%0%5%0%5%
US West (Oregon)2%0%0%2%0%5%
US West (N. California)16%8%0%15%7%5%
Asia Pacific (Singapore)2%6%0%2%4%5%
Asia Pacific (Sydney)9%1%0%13%3%5%
Canada (Central)1%6%0%2%0%5%

There are a few caveats that I’d like to make you aware of, firstly, this price reduction is not available for Convertible Reserved Instances, Compute Saving Plans or On-Demand instances. Secondly, Windows instances will see a different price reduction due to the licensing involved.

The price reduction is available in all regions effective immediately. Going forward, you will pay the new lower price for EC2.

— Martin

Serving Billions of Ads in Just 100 ms Using Amazon Elasticache for Redis

Post Syndicated from Rodrigo Asensio original https://aws.amazon.com/blogs/architecture/serving-billions-of-ads-with-amazon-elasticache-for-redis/

This post was co-written with Lucas Ceballos, CTO of Smadex

Introduction

Showing ads may seem to be a simple task, but it’s not. Showing the right ad to the right user is an incredibly complex challenge that involves multiple disciplines such as artificial intelligence, data science, and software engineering. Doing it one million times per second with a 100-ms constraint is even harder.

In the ad-tech business, speed and infrastructure costs are the keys to success. The less the final user waits for an ad, the higher the probability of that user clicking on the ad. Doing that while keeping infrastructure costs under control is crucial for business profitability.

About Smadex

Smadex is the leading mobile-first programmatic advertising platform specifically built to deliver best user acquisition performance and complete transparency.

Its state-of-the-art digital signal processing (DSP) technology provides advertisers with the tools they need to achieve their goals and ROI, with measurable results from web forms, post-app install events, store visits, and sales.

Smadex advertising architecture

What does showing ads look like under the hood? At Smadex, our technology works based on the OpenRTB (Real-Time Bidding) protocol.

RTB is a means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction, which is similar to financial markets.

To show ads, we participate in auctions deciding in real time which ad to show and how much to bid trying to optimize the cost of every impression.

High level diagram

  1. The final user browses the publisher’s website or app.
  2. Ad-exchange is called to start a new auction.
  3. Smadex receives the bid request and has to decide which ad to show and how much to offer in just 100 ms (and this is happening one million times per second).
  4. If Smadex won the auction, the ad must be sent and rendered on the publisher’s website or app.
  5. In the end, the user interacts with the ad sending new requests and events to Smadex platform.

Flow of data

As you can see in the previous diagram, showing ads is just one part of the challenge. After the ad is shown, the final user interacts with it in multiple ways, such as clicking it, installing an application, subscribing to a service, etc. This happens during a determined period that we call the “attribution window.” All of those interactions must be tracked and linked to the original bid transaction (using the request_id parameter).

Doing this is complicated: billions of bid transactions must be stored and available so that they can be quickly accessed every time the user interacts with the ad. The longer we store the transactions, the longer we can “wait” for an interaction to take place, and the better for our business and our clients, too.

Detailed diagram

Challenge #1: Cost

The challenge is: What kind of database can store billions of records per day, with at least a 30-day retention capacity (attribution window), be accessed by key-value, and all by spending as little as possible?

The answer is…none! Based on our research, all the available options that met the technical requirements were way out of our budget.

So…how to solve it? Here is when creativity and the combination of different AWS services comes into place.

We started to analyze the time dispersion of the events trying to find some clues. The interesting thing we spotted was that 90% of what we call “post-bid events” (impression, click, install, etc.) happened within one hour after the auction took place.

That means that we can process 90% of post-bid events by storing just one hour of bids.

Under our current workload, in one hour we participate in approximately 3.7 billion auctions generating 100 million bid records of an average 600 bytes each. This adds up to 55 gigabytes per hour, an easier amount of data to process.

Instead of thinking about one single database to store all the bid requests, we decided to split bids into two different categories:

  • Hot Bid: A request that took place within the last hour (small amount and frequently accessed)
  • Cold Bid: A request that took place more than our hour ago (huge amount and infrequently accessed)

Amazon ElastiCache for Redis is the best option to store 55 GB of data in memory, which gives us the ability to query in a key-value way with the lowest possible latency.

Hot Bids flow

Hot Bids flow diagram

  1. Every new bid is a hot bid by definition so it’s going to be stored in the hot bids Redis cluster.
  2. At the moment of the user interaction with the ad, the Smadex tracker component receives an HTTPS notification, including the bid request UUID that originated it.
  3. Based on the date of occurrence extracted from the received UUID, the tracker component can determine if it’s looking for a hot bid or not. If it’s a hot bid, the tracker reads it directly from Redis performing a key-value lookup query.

It’s been easy so far but what to do with the other 29 days and 23 hours we need to store?

Challenge #2: Performance

As we previously mentioned, cold bids are a huge infrequently accessed number of records with only 10% of post-bid events pointing to them. That sounds like a good use case for an inexpensive and slower data store like Amazon S3.

Thanks to the S3 low-cost storage prices combined with the ability to query S3 objects directly using Amazon Athena, we were able to optimize our costs by storing and querying cold bids by implementing a serverless architecture.

Cold Bids Flow

Cold Bids flow diagram

  1. Incoming bids are buffered by Fluentd and flushed to S3 every one minute in JSON format. Every single file flushed to S3 contains all the bids processed by a specific EC2 instance for one minute.
  2. An AWS Lambda function is automatically triggered on every new PutObject event from S3. This function transforms the JSON records to Parquet format and will save it back the S3 bucket, but this time into a specific partition folder based on file creation timestamp.
  3. As seen on the hot bids flow, the tracker component will determine if it’s looking for a hot or a cold bid based on the extracted timestamp of the request UUID. In this case, the cold bid will be retrieved by running an Amazon Athena look-up query leveraging the use of partitions and Parquet format to reduce as much as possible the latency and data that needs to be scanned.

Conclusion

Thanks to this combined approach using different technologies and a variety of AWS services we were able to extend our attribution window from 30 to 90 days while reducing the infrastructure costs by 45%.

 

 

Cost Optimize your Jenkins CI/CD pipelines using EC2 Spot Instances

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/cost-optimize-your-jenkins-ci-cd-pipelines-using-ec2-spot-instances/

Author: Rajesh Kesaraju, Sr. Specialist Solution Architect, EC2 Spot Instances

In this blog post, I go over using Amazon EC2 Spot Instances on continuous integration and continuous deployment (CI/CD) workloads, via the popular open-source automation server Jenkins. I also break down the steps required to adopt Spot Instances into your CI/CD pipelines for cost optimization purposes. In this blog, I explain how to configure your Jenkins environment to achieve significant cost savings by using Spot Instances with the EC2 Fleet Jenkins plugin.

Overview of EC2 Spot, CI/CD, and Jenkins

AWS offers multiple purchasing models for its EC2 instances. This particular blog post focuses on Amazon EC2 Spot Instances, which lets you take advantage of unused EC2 capacity in the AWS Cloud at a steep discount.

You can use Spot Instances for various stateless, fault-tolerant, or flexible applications such as big data, containerized workloads, CI/CD, web servers, high performance computing (HPC), and other test and development workloads.

CI/CD pipelines are familiar to many readers via a popular piece of open-source software called Jenkins. Jenkins’ automation of development, testing and deployment scenarios, courtesy of more than 2000 plugins, plays a key role in many organizations’ software development and delivery ecosystems. Jenkins accelerates software development through multiple stages, including building and documenting, packaging and analytics, staging and deploying, etc.

Lyft began using EC2 Spot Instances for their Jenkins CI pipelines, and discovered they could save up to 90 percent compared to their previous non-Spot EC2 implementations. They moved their entire CI/CD pipeline to EC2 Spot Instances by modifying just four lines of their deployment code.

In this blog, I walk through how to configure your Jenkins environment to achieve significant cost savings by using Spot Instances with the EC2 Fleet Jenkins plugin.

Solution Overview

For the following tutorial, you need both an AWS account and Jenkins downloaded and installed on your system.

This blog post uses Spot Instances. If your Jenkins server runs on On-Demand Instances, you can easily switch to Spot Instances with EC2-Fleet Plugin. Now, let’s look at how this plugin can be configured to make your Jenkins elastically scale up/down depending on pending jobs, and save significantly on compute costs.

Solution

Create a new EC2 key pair

To access the SSH interfaces of your Jenkins instances, you must have an EC2 key pair. Please follow below steps to create a new EC2 key pair.

1. Log in to your AWS Account;
2. Switch to your preferred Region;
3. Provision a new EC2 key pair:

    1. Go to the EC2 console and click on the key pairs option from the left frame.
    2. Click on the Create key pair button;
    3. Provide key pair name and click on the Create button;
    4. Your web browser should download a .pem file – keep this file as it will be required to access the EC2 instances that you create in this workshop. If you’re using a Windows system, convert the .pem file to a PuTTY .ppk file. If you’re not sure how to do this, instructions are available here.

Create an AWS IAM User for EC2-Fleet Plugin

To control Spot Instances from the EC2-Fleet plugin, you first create an IAM user (with programmatic access) in your AWS account. Then configure an IAM policy for AWS permissions. Use the following code to achieve this step.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:*",
                "autoscaling:*",
                "iam:ListInstanceProfiles",
                "iam:ListRoles",
                "iam:PassRole"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

These permissions allow you to configure the plugin, allow programmatic access to AWS resources to create and terminate Spot Instances, and control Auto Scaling Group (ASG) parameters.

In the IAM dashboard, click User and select the Jenkins User you created. Next, click Create access key, and save the Access key ID and Secret access key for use in next steps.

IAM > Users > Jenkins User > Security Credentials > Create Access Key

Create Access Key for Jenkins user fig. 1

Create Access Key for Jenkins user fig. 2

Create an Auto Scaling Group

Auto Scaling groups help you configure your Jenkins EC2-Fleet plugin to control Jenkins build agents and scale up or down depending on the job queue. It also replaces instances that were terminated due to demand spike in specific Spot Instance pools.

Documentation on how to create Auto Scaling Group is here.

Set your ASG to diversify your Spot Fleet across multiple Spot pools to increase your chances of getting a Spot Instance for your Jenkins jobs, and set an allocation strategy. I used “capacity-optimized” as the allocation strategy in the following example.

The “capacity-optimized” option allocates Spot Instances from the deepest pools of available spare capacity, which lowers the chance of interruptions. Alternatively, you may choose the “lowest-price” allocation strategy if you have builds that finish quicker, and the cost of re-processing of failed jobs due to interruption isn’t that significant. Learn more about allocation strategies in this blog.

The Jenkins EC2-Fleet plugin overrides and controls ASG’s capacity configuration. So, start with one instance for now. I also cover a scenario that starts with 0 instances to further minimize costs.

A sample ASG configuration looks like the following image. Notice there are 6 instance types across 3 Availability Zones, this means EC2 Spot capacity is provided from 18 (6×3) Spot Instance pools! This configuration increases the likelihood of getting Spot Instances from the deepest Spot pools at a steep discount.

An example ASG configuration Image

Install and Configure an EC2-Fleet plugin in Jenkins

Install the latest version of EC2-Fleet plugin in Jenkins. This plugin launches Spot Instances using ASG or EC2 Spot Fleet where you can run your build jobs. In this blog, I launch EC2 Spot Instances using ASG.

After installing you see it in the plugin manager. This blog uses current version 2.0.0.

Go to Manage Jenkins > Plugin Manager then install EC2 Fleet Jenkins Plugin

Jenkins Server EC2-Fleet Plugin

In the first part of this solution, you created an AWS user. Now, configure this user in your Jenkins Amazon EC2 Fleet configuration section.

Navigate to Manage Jenkins -> Configure Clouds -> Add a New Cloud -> Amazon EC2 Fleet.

Configure Amazon EC2 Fleet plugin as Cloud Setting

Create a name for your EC2-Fleet plugin configuration. I use Amazon EC2 Spot Fleet. Then configure your AWS Credentials.

Configure ASG in Jenkins EC2-fleet plugin

1. Change the Kind to AWS Credentials;
2. Change the Scope to System (Jenkins and nodes only) – you don’t want your builds to have access to these credentials!
3. At the ID (optional) field. Enter this if need to access this using scripts
4. Provide Access key ID and Secret access key fields, saved when you created Jenkins user before, then click Add.
5. Once you are done adding credentials, select the corresponding AWS Region to your ASG.
6. EC2 Fleet dropdown automatically populates the ASG that you created earlier.
7. Once the ASG is selected, check your configuration. The following image shows this test:

Testing Jenkins EC2 Fleet Plugin configuration

Configure Launcher

1. Change the Kind to SSH Username with private key;
2. Change the Scope to System (Jenkins and nodes only) – you also don’t want your builds to have access to these credentials;
3. Enter ec2-user as the Username.
4. Select the Enter directly button for the Private Key. Open the .pem file that you downloaded previously, and copy the contents of the file to the Key field including the BEGIN RSA PRIVATE KEY and END RSA PRIVATE KEY fields.
5. Verify your launcher looks like as below, and click on the Add button

Configuring Launcher and providing Jenkins credentials

Once your credentials are added, you move on to complete rest of the Launcher configuration.

1. Select the ec2-user option from the Credentials drop-down.
2. Select the “Non verifying Verification Strategy” option from the Host Key Verification Strategy drop-down. Select this option because Spot Instances have a random SSH host fingerprint.

Configuring Jenkins launch agents by ec2-user via SSH

3. Mark the Connect Private check box to ensure that your Jenkins Master always communicates with the Agents via their internal VPC IP addresses (in real-world scenarios, your build      agents would likely not be publicly addressable).
4. Change the Label field to spot-agents.
5. Set the Max Idle Minutes Before Scaledown. In this example, I used AWS launched per-second billing in 2017, so there’s no need to keep a build agent running for too much longer than it’s required.
6. Change the Maximum Cluster Size depending on your need. For example, I set the Maximum Cluster Size to 2.

After saving these configurations, your screen should look similar to the following image.

Configuring Launcher with cluster scaling parameters

Configure Number of Executors

Determine the number of executors based on your build requirements, such as how many builds on average can be executed concurrently on each machine based on machine’s vCPU and RAM allocations.

If you cram many executors into one machine, each build average execution may increase, which slows down the pipeline.

Once you determine optimum executors per machine, any additional pending jobs get executed on scaled out machines by auto scaling.

Some Important aspects about Cluster size settings

Jenkins EC2-Fleet agent settings override ASG settings. So, “Minimum Cluster Size” and “Maximum Cluster Size” values mentioned here override ASG’s settings dynamically.

If you set minimum cluster size as 0, then when there are no pending jobs there won’t be any idle servers after Max Idle Minutes before shut down minutes are met. In this scenario, when there is a new build request, it takes roughly two to five minutes for new EC2 Spot Instances to start processing after boot strapping and installing necessary Jenkins agents.

If your jobs are time-insensitive, this strategy maximizes savings, as you eliminate spending money on idle instances.

Alternatively, you may set “Minimum Cluster Size” as 1 or 2 so you have running instances all the time if you have a need to process several builds/tests a day and/or builds taking very long time and occur on a daily basis.

In this blog the idea is to cost optimize CI/CD pipelines, to avoid idle instances.

Configure Jenkins build Jobs to utilize EC2 Spot Instances

Finally, configure your build job to check “Restrict where this project can be run” and enter “spot-agents” as the label expression. When builds are initiated, they are executed against Spot Instance.

Configuring Jenkins builds to utilize EC2 Spot Instances with Label Expression

At this point, you are ready to run your builds on Spot Instances and save significantly!

Configuring Jenkins server to run on EC2 Spot Instances

Now you started saving on your build agents, is it possible to save on Jenkins server also using Spot Instances? Yes! Let’s see how this can be done with a few simple techniques.

By running your Jenkins server on Spot, you optimize the compute costs associated with the whole Jenkins CI/CD environment. With Spot Instance diversification strategy and making use of ASG features, you can move your Jenkins server also to EC2 Spot Instance.

There is one slight wrinkle in the above process. Jenkins requires persistent data on a local file system, whereas a Spot Instance cannot be guaranteed to be persistent due to the chance of interruptions. Therefore, to switch your Jenkins server over to Spot, you must first move your Jenkins data to an Amazon Elastic File System (EFS) volume — which your Spot Instance can then access.

Amazon Elastic File System provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. More here

Here are the steps to mount EFS volume to an existing Jenkins server, and move its content to Amazon EFS managed store. Then, you can point the Jenkins server to use EFS mount point for its operations and maintain server state. This way when a Spot Instance gets interrupted, server state is not lost, and another Spot Instance can pick up from where the previous server left off.

Here are the steps move data from JENKINS_HOME to Amazon EFS

1. Mount EFS volume:

sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 \
$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)\
.%FILE-SYSTEM-ID%.efs.<AWS Region>.amazonaws.com:/ (http://efs.<AWS Region>.amazonaws.com/) /mnt

2. Copy existing JENKINS_HOME (/var/lib/Jenkins) content to EFS after shutting down Jenkins server
3. >> sudo chown jenkins:jenkins /mnt
>> sudo cp -rpv /var/lib/jenkins/* /mnt

Once you’ve moved the contents of JENKINS_HOME to Amazon EFS, now it’s safe to run Jenkins Server to EC2 Spot Instance.

Spot Instances can be interrupted by AWS, so you may lose your Jenkins server access momentarily when an interruption occurs.

Since you already externalized the state to Amazon EFS, you don’t lose any previous state. So, when the instance running your Jenkins server gets interrupted, in a matter of a few minutes, ASG replenishes new EC2 Spot Instance to run your Jenkins Server.

To get this into service automatically complete the following steps:

1. Ensure ASG launch instances into target group that is pointed by Application Load Balancer (ALB)
2. If Spot Instance is terminated with the two minute warning, ASG launches a replacement instance. The new Spot Instance will be bootstrapped just as your original Spot Instance was and then mount to EFS as configured in user data of “Launch Template”
3. Configure ASG from Launch Template (Sample UserData section as below)

#!/bin/bash
# Install all pending updates to the system
yum -y update
# Configure YUM to be able to access official Jenkins RPM packages
wget -O /etc/yum.repos.d/jenkins.repo https://pkg.jenkins.io/redhat-stable/jenkins.repo
# Import the Jenkins repository public key
rpm --import https://pkg.jenkins.io/redhat-stable/jenkins.io.key
# Configure YUM to be able to access contributed Maven RPM packages
wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
# Update the release version in the Maven repository configuration for this mainline release of Amazon Linux
sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
# Install the Java 8 SDK, Git, Jenkins and Maven
yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel git jenkins apache-maven
# Set the default version of java to run out of the Java 8 SDK path (required by Jenkins)
update-alternatives --set java /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
update-alternatives --set javac /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/javac
# Mount the Jenkins EFS volume at JENKINS_HOME
mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone).${EFSJenkinsHomeVolume}.efs.${AWSRegion}.amazonaws.com:/ /var/lib/jenkins
# Start the Jenkins service
service jenkins start

That’s it, you are now ready to run your Jenkins server on EC2 Spot Instances and save up to 90% compared to On-Demand price.

Conclusion

You are now ready to leverage the power and flexibility of EC2 Spot Instances.

With a few modifications to your deployment you can significantly reduce your compute costs or accelerate throughput by accessing 10x compute for the same cost.

For users getting started on their Amazon EC2 Spot Instances, we are here to help. Please also share any questions in the comments section below.

Here is where to begin in the Amazon EC2 Spot Instance console — and start transforming your Jenkins workloads today.


About the author

Rajesh Kesaraju

Rajesh Kesaraju is a Sr. Specialist SA for EC2 Spot with Amazon AWS. He helps customers to cost optimize their workloads by utilizing EC2 Spot instances in various types of workloads such as Big Data, Containers, HPC, CI/CD, and Sateless Applications Etc.

Capacity-Optimized Spot Instance Allocation in Action at Mobileye and Skyscanner

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/capacity-optimized-spot-instance-allocation-in-action-at-mobileye-and-skyscanner/

Amazon EC2 Spot Instances were launched way back in 2009. The instances are spare EC2 compute capacity that is available at savings of up to 90% when compared to the On-Demand prices. Spot Instances can be interrupted by EC2 (with a two minute heads-up), but are otherwise the same as On-Demand instances. You can use Amazon EC2 Auto Scaling to seamlessly scale Spot Instances, On-Demand instances, and instances that are part of a Savings Plan, all within a single EC2 Auto Scaling Group.

Over the years we have added many powerful Spot features including a Simplified Pricing Model, the Capacity-Optimized scaling strategy (more on that in a sec), integration with the EC2 RunInstances API, and much more.

EC2 Auto Scaling lets you use two different allocation strategies for Spot Instances:

lowest-price – Allocates instances from the Spot Instance pools that have the lowest price at the time of fulfillment. Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. As the lowest-price strategy does not account for pool capacity depth as it deploys Spot Instances, this allocation strategy is a good fit for fault-tolerant workloads with a low cost of interruption.

capacity-optimized – Allocates instances from the Spot Instance pools with the optimal capacity for the number of instances that are launching, making use of real-time capacity data. This allocation strategy is appropriate for workloads that have a higher cost of interruption. It thrives on flexibility, empowered by the instance families, sizes, and generations that you choose.

Today I want to show you how you can use the capacity-optimized allocation strategy and to share a pair of customer stories with you.

Using Capacity-Optimized Allocation
First, I switch to the the new Auto Scaling console by clicking Go to the new console:

The new console includes a nice diagram to explain how Auto Scaling works. I click Create Auto Scaling group to proceed:

I name my Auto Scaling group and choose a launch template as usual, then click Next:

If you are not familiar with launch templates, read Recent EC2 Goodies – Launch Templates and Spread Placement, to learn all about them.

Because my launch template does not specify an instance type, Combine purchase options and instance types is pre-selected and cannot be changed. I ensure that the Capacity-optimized allocation strategy is also selected, and set the desired balance of On-Demand and Spot Instances:

Then I choose a primary instance type, and the console recommends others. I can choose Family and generation (m3, m4, m5 for example) flexibility or just size flexibility (large, xlarge, 12xlarge, and so forth) within the generation of the primary instance type. As I noted earlier, this strategy thrives on flexibility, so choosing as many relevant instances as possible is to my benefit.

I can also specify a weight for each of the instance types that I decide to use (this is especially useful when I am making use of size flexibility):

I also (not shown) select my VPC and the desired subnets within it, click Next, and proceed as usual. Flexibility with regard to subnets/Availability Zones is also to my benefit; for more information, read Auto Scaling Groups with Multiple Instance Types and Purchase Options.

And with that, let’s take a look at how AWS customers Skyscanner and Mobileye are making use of this feature!

Capacity-Optimized Allocation at Skyscanner
Skyscanner is an online travel booking site. They run the front-end processing for their site on Spot Instances, making good use of up to 40,000 cores per day. Skyscanner’s online platform runs on Kubernetes clusters powered entirely by Spot Instances (watch this video to learn more). Capacity-optimized allocation has delivered many benefits including:

Faster Time to Market – The ability to access more compute power at a lower cost has allowed them to reduce the time to launch a new service from 6-7 weeks using traditional infrastructure to just 50 minutes on the AWS Cloud.

Cost Savings – Diversifying Spot Instances across Availability Zones and instance types has resulted in an overall savings of 70% per core.

Reduced Interruptions – A test that Skyscanner ran in preparation for Black Friday showed that their old configuration (lowest-price) had between 200 and 300 Spot interruptions and the new one (capacity-optimized) had between 10 and 15.

Capacity-Optimized Allocation at Mobileye
Mobileye (an Intel company) develops vision-based technology for self-driving vehicles and advanced driver assistant systems. Spot Instances are used to run their analytics, machine learning, simulation, and AWS Batch workloads packaged in Docker containers. They typically use between 200K and 300K concurrent cores, with peak daily usage of around 500K, all on Spot. Here’s a instance count graph over the course of a day:

After switching to capacity-optimized allocation and making some changes in accord with our Spot Instance best practices, they reduced the overall interruption rate by about 75%. These changes allowed them to save money on their compute costs while increasing application uptime and reducing their time-to-insight.

To learn more about how Mobileye uses Spot Instances, watch their re:Invent presentation, Navigating the Winding Road Toward Driverless Mobility.

Jeff;

 

10 things you can do today to reduce AWS costs

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/10-things-you-can-do-today-to-reduce-aws-costs/

This post is contributed by Shankar Ramachandran, SA Specialist, Cost Optimization

Introduction

AWS’s breadth of services and pricing options offer the flexibility to effectively manage your costs, and still keep the performance and capacity per your business requirement. While the fundamental process of cost optimization on AWS remains the same – monitor your AWS costs and usage, analyze the data to find savings, take action to realize the savings; in this blog I will take a more tactical approach of reducing cost with changes in user demand.

Before you start

Before taking any actions to reduce cost, find out the costs of AWS services you are consuming. The AWS Free Tier provides customers the ability to explore and try out AWS services free of charge up to specified limits for each service. Use the steps in this video to check if you are exceeding the free tier limit.

Next, use AWS Cost Explorer to view and analyze your AWS costs and usage. This tool provides default reports that help you visualize cost and usage at a high level (e.g. AWS  accounts, AWS service),  or at the resource level (e.g. EC2 instance ID). Start by identifying the top accounts where your costs incur using the “Monthly costs by linked account report”. Next, identify the top services that contribute to the costs within those accounts. You can do this by using the “Monthly costs by service report”. Use the hourly and resource level granularity and tags to filter and identify the top resources incurring costs.

Cost Explorer resource level granularity

Now, you should have an understanding of your AWS costs and usage. Next, let’s walk through 10 tactical things you can do today using existing AWS tools and services to reduce your AWS costs.

#1 Identify Amazon EC2 instances with low-utilization and reduce cost by stopping or rightsizing

Use AWS Cost Explorer Resource Optimization to get a report of EC2 instances that are either idle or have low utilization. You can reduce costs by either stopping or downsizing these instances. Use AWS Instance Scheduler to automatically stop instances. Use AWS Operations Conductor to automatically resize the EC2 instances (based on the recommendations report from Cost Explorer).

Cost Explorer rightsizing recommendations

Use AWS Compute Optimizer to look at instance type recommendations beyond downsizing within an instance family. It gives downsizing recommendations within or across instance families, upsizing recommendations to remove performance bottlenecks, and recommendations for EC2 instances that are parts of an Auto Scaling group.

#2 Identify Amazon EBS volumes with low-utilization and reduce cost by snapshotting then deleting them

EBS volumes that have very low activity (less than 1 IOPS per day) over a period of 7 days indicate that they are probably not in use. Identify these volumes using the Trusted Advisor Underutilized Amazon EBS Volumes Check. To reduce costs, first snapshot the volume (in case you need it later), then delete these volumes. You can automate the creation of snapshots using the Amazon Data Lifecycle Manager. Follow the steps here to delete EBS volumes.

#3 Analyze Amazon S3 usage and reduce cost by leveraging lower cost storage tiers

Use S3 Analytics to analyze storage access patterns on the object data set for 30 days or longer. It makes recommendations on where you can leverage S3 Infrequently Accessed (S3 IA) to reduce costs. You can automate moving these objects into lower cost storage tier using Life Cycle Policies. Alternately, you can also use S3 Intelligent-Tiering, which automatically analyzes and moves your objects to the appropriate storage tier.

#4 Identify Amazon RDS, Amazon Redshift instances with low utilization and reduce cost by stopping (RDS) and pausing (Redshift)

Use the Trusted Advisor Amazon RDS Idle DB instances check, to identify DB instances which have not had any connection over the last 7 days. To reduce costs, stop these DB instances using the automation steps described in this blog post. For Redshift, use the Trusted Advisor Underutilized Redshift clusters check, to identify clusters which have had no connections for the last 7 days, and less than 5% cluster wide average CPU utilization for 99% of the last 7 days. To reduce costs, pause these clusters using the steps in this blog.

Pause underutilized Redshift clusters

#5 Analyze Amazon DynamoDB usage and reduce cost by leveraging Autoscaling or On-demand

Analyze your DynamoDB usage by monitoring 2 metrics, ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits, in CloudWatch. To automatically scale (in and out) your DynamoDB table, use the AutoScaling feature. Using the steps here, you can enable AutoScaling on your existing tables. Alternately, you can also use the on-demand option. This option allows you to pay-per-request for read and write requests so that you only pay for what you use, making it easy to balance costs and performance.

DynamoDB read/write capacity mode

#6 Review networking and reduce costs by deleting idle load balancers

Use the Trusted Advisor Idle Load Balancers check to get a report of load balancers that have RequestCount of less than 100 over the past 7 days. Then, use the steps here, to delete these load balancers to reduce costs. Additionally, use the steps provided in this blog, review your data transfer costs using Cost Explorer.

Cost Explorer filters for EC2 Data Transfer

If data transfer from EC2 to the public internet shows up as a significant cost, consider using Amazon CloudFront. Any image, video, or static web content can be cached at AWS edge locations worldwide, using the Amazon CloudFront Content Delivery Network (CDN). CloudFront eliminates the need to over-provision capacity in order to serve potential spikes in traffic.

#7 Use Amazon EC2 Spot Instances to reduce EC2 costs

If your workload is fault-tolerant, use Spot instances to reduce costs by up to 90%. Typical workload examples include big data, containerized workloads, CI/CD, web servers, high-performance computing (HPC), and other test & development workloads. Using EC2 Auto Scaling you can launch both On-Demand and Spot instances to meet a target capacity. Auto Scaling automatically takes care of requesting for Spot instances and attempts to maintain the target capacity even if your Spot instances are interrupted. You can learn more about Spot by watching this 2019 re:Invent session:

#8 Review and modify EC2 AutoScaling Groups configuration

An EC2 Autoscaling group allows your EC2 fleet to expand or shrink based on demand. Review your scaling activity using either the describe-scaling-activity CLI command or on the console using the steps described here. Analyze the result to see if the scaling policy can be tuned to add instances less aggressively. Also review your settings to see if the minimum can be reduced so as to serve end user requests but with a smaller fleet size.

#9 Use Reserved Instances (RI) to reduce RDS, Redshift, ElastiCache and Elasticsearch costs

Use one year, no upfront RIs to get a discount of up to 42% compared to On-Demand pricing. Use the recommendations provided in AWS Cost Explorer RI purchase recommendations, which is based on your RDS, Redshift, ElastiCache and Elasticsearch usage. Make sure you adjust the parameters to one year, no upfront. This requires a one year commitment but the break-even point is typically seven to nine months. I recommend doing #4 before #9

#10 Use Compute Savings Plans to reduce EC2, Fargate and Lambda costs

Compute Savings Plans automatically apply to EC2 instance usage regardless of instance family, size, AZ, region, OS or tenancy, and also apply to Fargate and Lambda usage. Use one year, no upfront Compute Savings Plans to get a discount of up to 54% compared to On-Demand pricing. Use the recommendations provided in AWS Cost Explorer, and ensure that you have chosen compute, one year, no upfront options. Once you sign up for Savings Plans, your compute usage is automatically charged at the discounted Savings Plans prices. Any usage beyond your commitment will be charged at regular On Demand rates. I recommend doing #1 before #10.

Savings Plans Recommendations

Whats Next

With these 10 steps, you can save costs on EC2, Fargate, Lambda, EBS, S3, ELB, RDS, Redshift, DynamoDB, ElastiCache and Elasticsearch. I recommend setting up a budget using AWS Budgets, so that you get alerted when your cost and usage changes.

Set a budget to track your AWS cost and usage

 

Setup an alert on forecasted costs

Using Budgets, you can set up an alert on forecasted costs too (apart from actual). This gives you the ability to get ahead of the problem and reduce costs proactively.

Conclusion

To learn more quick cost optimization techniques, watch the on-demand webinar – Nine Ways to Reduce Your AWS Bill. We are here to help you, please reach out to AWS Support and your AWS account team if you need further assistance in optimizing your AWS environment.

Halodoc: Building the Future of Tele-Health One Microservice at a Time

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/halodoc-building-the-future-of-tele-health-one-microservice-at-a-time/

Halodoc, a Jakarta-based healthtech platform, uses tele-health and artificial intelligence to connect patients, doctors, and pharmacies. Join builder Adrian De Luca for this special edition of This is My Architecture as he dives deep into the solutions architecture of this Indonesian healthtech platform that provides healthcare services in one of the most challenging traffic environments in the world.

Explore how the company evolved its monolithic backend into decoupled microservices with Amazon EC2 and Amazon Simple Queue Service (SQS), adopted serverless to cost effectively support new user functionality with AWS Lambda, and manages the high volume and velocity of data with Amazon DynamoDB, Amazon Relational Database Service (RDS), and Amazon Redshift.

For more content like this, subscribe to our YouTube channels This is My Architecture, This is My Code, and This is My Model, or visit the This is My Architecture AWS website, which has search functionality and the ability to filter by industry, language, and service.

Running Simcenter STAR-CCM+ on AWS with AWS ParallelCluster, Elastic Fabric Adapter and Amazon FSx for Lustre

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/

This post is contributed by Anh Tran, Senior HPC Specialist Solution Architect, AWS

Introduction

AWS recently introduced many HPC services that boost the performance and scalability of Computational Fluid Dynamics (CFD) workloads on AWS. These services include: Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and AWS ParallelCluster 2.5.1. In this technical post, I walk through these three services. Additionally, I outline an example of using AWS ParallelCluster to set up an HPC system with EFA and Amazon FSx Lustre to run a CFD workload.  The CFD application that you will set up during this blog post is Simcenter STAR-CCM+  – the predominant CFD application from Siemens.

Service and solution overview

Services

This blog primarily uses three services – Amazon FSx for Lustre, EFA, and AWS ParallelCluster. Let’s dig into each of these services before reviewing the solution.

Amazon FSx for Lustre

In December 2018, AWS released Amazon FSx for Lustre. This is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides low latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleashes innovation at an unparalleled pace.  This blog post uses the latest version of Amazon FSx for Lustre which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket. I go into more detail of how to take advantage of this at the end of the blog.

Elastic Fabric Adapter

In April of 2019, AWS released EFA. This enables you to run applications requiring high levels of inter-node communications at scale on AWS.

AWS ParallelCluster 2.5.1.

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use even for if you’re new to the cloud.  AWS recently released AWS ParallelCluster 2.5.1 – which is the version we will use for this blog.

These three AWS HPC components are optimal for CFD applications. Together, they provide simple deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel filesystem. Now, let’s take a look at how these services come together and seamlessly run a real CFD application: Simcenter STAR-CCM+.

Solution

AWS has a long-standing collaboration with Siemens. AWS and Siemens are dedicated to enhancing Siemens’ customer experiences when they run Simcenter STAR-CCM+ apps on AWS. I am excited to walk you through the steps and the best practices for running Simcenter STAR-CCM+, the predominant CFD application from Siemens.

The Simcenter STAR-CCM+ application runs on an HPC system. This system is optimized with EFA and Amazon FSx for Lustre — all of which is managed by AWS ParallelCluster. AWS ParallelCluster simplifies the deployment process to such an extent that you can set up your HPC cluster with a high throughput parallel file system (Amazon FSx for Lustre), a high-throughput and low-latency network interface (EFA), and high-bandwidth network interconnects (100 Gbps using C5n instances) in less than 15 minutes.

Now that you have the services and solution overviews, we can get started. This blog post includes the following steps:

  1. Creating an HPC infrastructure stack on AWS, which will include:
    • How to set up AWS ParallelCluster for best performance
    • How to enable EFA
    • How to enable the Amazon FSx Lustre file system and how to use some basic Amazon FSx for Lustre for STAR-CCM+
    • How to connect to a remote desktop session using NICE DCV
  2. Installing the Simcenter STAR-CCM+ application and how to submit a Simcenter STAR-CCM+ job to HPC cluster.

The following diagram outlines the steps

Steps

Setting up the HPC infrastructure stack

Before you can run Simcenter STAR-CCM+, you need to build an HPC cluster first.  Some best practices that you should consider when setting up a cluster on AWS include:

  • Turn off Hyper-Threading (HT): AWS instances have HT turned on by default.
  • Use EFA and a cluster placement group for your compute fleet to minimize the latency between nodes.
  • Select the right instance type for your compute fleet. Here, I use C5n.18xlarge because of its high-performance CPU, high-bandwidth bandwidth networking, and the EFA network interface capabilities.

With HPC best practices in mind, you can set up your AWS ParallelCluster

Set up AWS ParallelCluster:

$aws s3 mb s3://benchmark-starccm
make_bucket: benchmark-starccm

*Note: As an example, I create an S3 bucket benchmark-starccm, however you should create an S3 bucket with a different name of your choice because the S3 bucket name must be globally unique.

Let’s download the STAR-CCM+ installation file and a case file, then upload them to the S3 bucket that we just created.

  • Download latest Simcenter STAR-CCM+ package from the Siemens portal. It will look something like this: STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
  • Download the Le Mans case file. The Le Mans case is 104 Million cells, which even today is considered large for a CFD case.  The file will look something like this:  LeMans_104M.sim

After downloading the STAR-CCM+ software and LeMans case file, upload them to the S3 bucket created above

aws s3 cp STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip s3://benchmark-starccm
aws s3 cp LeMans_104M.sim s3://benchmark-starccm/

We will use this same S3 bucket to install the Simcenter STAR-CCM+ application later in this tutorial

With all the ground work done, we can now build our HPC cluster. For more detailed instructions, you can consult Getting Started with AWS ParallelCluster.

  • Install AWS ParallelCluster
pip install aws-parallelcluster
  • Configure AWS ParallelCluster with some basic network information such as AWS Region ID, VPC ID, Subnet ID

pcluster configure

Modify your ~/.parallelcluster/config file to include a cluster section that minimally includes the following:

[aws]
aws_region_name = us-east-2

[cluster default]
vpc_settings = public
key_name = <Key-Name>
initial_queue_size = 2
max_queue_size = 100
maintain_initial_size = true
placement_group = DYNAMIC
placement = cluster
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge
cluster_type = ondemand

base_os = centos7
tags = {“Name” : “STARCCM”}

enable_efa = compute
fsx_settings = fsxshared
disable_hyperthreading = true

dcv_settings = hpc-dcv

[vpc public]
master_subnet_id = subnet-<Subnet-ID>
vpc_id = vpc-<VPC-ID>

[global]
update_check = true
sanity_check = true
cluster_template = default

[dcv hpc-dcv]

enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
imported_file_chunk_size = 1024
import_path = s3://benchmark-starccm

export_path = s3://benchmark-starccm

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

  • Now, create your first HPC cluster with the name starccmby running

pcluster create starccm

Your HPC cluster should be ready in about 15 minutes.

While you’re waiting for your cluster to be ready, let’s take a deeper look at what some of the different parameters we used mean for our HPC cluster:

initial_queue_size: We will start with two compute instances after the HPC cluster is up.

max_queue_size: We will limit the maximum compute fleet to 100 instances. This allows us room to scale our jobs up to a large number of cores while putting a limit on the number of compute nodes to help control costs.

base_os: For this blog, we will select centos 7 as a base os. Currently we support Amazon Linux (alinux), Centos 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.

master_instance_type: This can be any instance type. Here we choose c5.xlarge because it is inexpensive and relatively fast for the head node.

compute_instance_type: We select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC. Note that EFA is currently only available on c5n.18xlarge, c5n.metal, i3en.24xlarge, p3dn.24xlarge, inf1.24xlarge, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge, and p3dn.24xlarge. See the docs for Currently supported instances.

placement_group: We use placement_group to ensure our instances are located as physically close to one another as possible to minimize the latency between compute nodes and take advantage of EFA’s low latency networking.

enable_efa: with just one configuration line, we can easily turn on EFA support for our HPC cluster.

dcv_settings = hpc-dcv: With AWS ParallelCluster 2.5.1 you can use NICE DCV to support your remote visualization needs.

disable_hyperthreading: This setting turns off hyper-threading on the cluster

[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory will be mounted, the storage capacity for the filesystem, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.

[dcv hpc-dcv]: This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.

  •  After you set up your config file for AWS ParallelCluster, log in and verify that you can access the cluster’s head node
pcluster ssh starccm -i ~/path/to/ssh_key
  • Verify the compute nodes are up. We should see two c5n.18xlarge nodes.
$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
global - - - - - - - - - -
ip-172-31-14-220 lx-amd64 36 2 36 36 0.49 184.6G 11.7G 0.0 0.0
ip-172-31-2-137 lx-amd64 36 2 36 36 0.45 184.6G 11.7G 0.0 0.0
  • Verify the EFA driver has been loaded successfully.

In order to verify if EFA is installed correctly you will need to ssh into one of compute nodes and run :

[[email protected] ~]$ which mpirun
/opt/amazon/openmpi/bin/mpirun

[[email protected] ~]$ fi_info -p efa

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

At this point, EFA is verified.

Install Simcenter STAR-CCM+ application

Now that the HPC cluster using AWS ParallelCluster is set up, it’s time to install the Simcenter STAR-CCM+ application.  In the prior steps, you uploaded a Simcenter STAR-CCM+ application and a case file to S3 bucket and used that S3 bucket as a source for the Amazon FSx for Lustre /fsx storage. As soon as the cluster created, the installation file and the case file will be available in /fsx

[[email protected] fsx]$ cd /fsx
[[email protected] fsx]$ ls
LeMans_104M.sim  STAR-CCM-14.06.013_01_linux-x86_64.tar.gz  STAR-CCM+14.06.013_01_linux-x86_64.tar.gz

As you can see, Amazon FSx for Lustre has already downloaded the case file from the S3 bucket to the /fsx partition, so now you can start install Simcenter STAR-CCM+ software using the following steps.

  • Install Simcenter STAR-CCM+: install Simcenter STAR-CCM+ on /fsx  – a 1.2TB Lustre filesystem that you configured in a previous step.
cd /fsx
sudo unzip STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
cd STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1
./STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.sh
Select Installation Location : /fsx/Siemens

After following all the standard installation steps from Simcenter STAR-CCM+, the application should be installed at the following location:

/fsx/Siemens/15.02.003/STAR-CCM+15.02.003/

  • Test the installation
[[email protected] fsx]$ /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+ -version
Simcenter STAR-CCM+ 2020.1 Build 15.02.003 (linux-x86_64-2.12/gnu7.1)

When the above code shows up, you correctly installed Simcenter STAR-CCM+, so now you can run the application.

Running Simcenter STAR-CCM+ on AWS ParallelCluster

Before I move on, let’s recap what you’ve have done so far.

  1. You set up an HPC cluster using AWS ParallelCluster with compute-optimized C5n instances, an Amazon FSx for Lustre filesystem, and EFA-enabled networking.
  2. You installed Simcenter STAR-CCM+ application

Now, let us create an SGE submission script and submit a job to the HPC cluster.

  • Create an SGE job submission script :
cd /fsx/

vi star-ccm.qsub

#!/bin/bash
#$ -N check
#$ -cwd
#$ -j Y
#$ -pe mpi 252

date

your_pod_key="your license key"
ccmp=" /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+”
case="/fsx/LeMans_104M.sim"

fabric="OFI"
tag="fsx-${fabric}"

${ccmp} \
-power \
-podkey $your_pod_key \
-licpath [email protected] \
-mpi openmpi \
-bs sge \
-benchmark:"-preits 40 -nits 20 -nps ${NSLOTS} -tag ${tag}" \
${case}

date

mkdir ${NSLOTS}new
mv *xml ${NSLOTS}new
  • Submit your Simcenter STAR-CCM+ job to your HPC cluster
qsub -V star-ccm.qsub

Now you have submitted an HPC job that requests 252 cores of c5n.18xlarge.

  • Check the status of your jobs by running
qstat -f

Simcenter STAR-CCM+ result analysis

Here are sample scaling results for the LeMans 104M cell benchmark case. As you can see, the Simcenter STAR-CCM+ result shows exciting performance and scaling results running on AWS. The performance is fast and the simulation scales very well with EFA-based networking and great CPU performance.

Connect to a NICE DCV session

AWS ParallelCluster is now natively integrated with NICE DCV. You can configure a NICE DCV session to visualize your STAR-CCM+ result or connect to the application remotely.

As you will recall from when we configured our AWS ParallelCluster in the previous section, I named the DCV server as hpc-dcv.  To create and connect to a NICE DCV session, just run:

pcluster dcv connect hpc-dcv -k <Key-Name>

After you connect to NICE DCV session, you will be able to access to a Linux desktop to work with STAR-CCM+ application.

Backup SIMCENTER STAR-CCM+ result to S3 bucket

After you finish your STAR-CCM+ simulation, you can backup data in /fsx to your S3 bucket that you created from running your application. You can now use Data Repository Tasks. Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

*Note: in order to use new Amazon FSx for Lustre feature, you will need to have AWS CLI version 1.16.309 or above.

In this case, I select to export the  STAR-CCM+ application directory Siemens as an example.

  • Exit the HPC head node, and go back to your laptop or Cloud9 environment where you have configured your AWS CLI. Find out your Amazon FSx Lustre ID by running:
aws fsx describe-file-systems
  • After you find the Amazon FSx for Lustre ID, which looks simiar to fs-0d72d520f620d765a, create a backup of the data by running:
create-data-repository-task

aws fsx create-data-repository-task --file-system-id fs-0d72d520f620d765a --type EXPORT_TO_REPOSITORY --paths Siemens,testfsx --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://benchmark-starccm/

Explanation:

–file-system-id: your file system ID

–type EXPORT_TO_REPOSITORY: we will export the data back to the S3 bucket

–paths Siemens,testfsx: the directories you want to export to S3 bucket

Format=REPORT_CSV_20191124:  note this is only name the Amazon FSx Lustre supports. Please keep it the same.

  • Check the status of the backup by running describe-create-data-repository-task
aws fsx describe-data-repository-tasks

As you can see, I can now use FSx for Lustre to install the Simcenter STAR-CCM+ application, and seamlessly move case data between on-premise to Amazon S3 and AWS HPC system. I can create a file system linked to a S3 bucket, create a FSx for Lustre filesystem, and export data back to their S3 bucket after running the CFD app.

Conclusion

Give it a try, set up an HPC environment, and let us know how it goes! If you need more information about running CFD and HPC cases on AWS you can find it on our HPC home page. Please feel free to contact us with questions that you might have.

 

Estimating Amazon EC2 instance needed when migrating ERP from IBM Power

Post Syndicated from Martin Yip original https://aws.amazon.com/blogs/compute/estimating-amazon-ec2-instance-needed-when-migrating-erp-from-ibm-power/

This post courtesy of CK Tan, AWS, Enterprise Migration Architect – APAC

Today there are many enterprise customers who are keen on migrating their mission critical Enterprise Resource Planning (ERP) applications such as Oracle E-Business Suite (Oracle EBS) or SAP running on IBM Power System from on-premises to the AWS cloud platform. They realize running critical ERP applications on AWS will enable their business to be more agile, cost-effective, and secure than on premises. AWS provides cloud native services that streamline your ability to adopt emerging technologies, enabling greater innovation, and faster time-to-value.

However, it is essential to note that both IBM Power System and Amazon Elastic Cloud Compute (EC2) are using different CPU processor architectures. Amazon EC2 instances are running on either x86 or ARM-based processors while IBM Power System is running on IBM Power processors. As a result, there are no direct performance benchmarks mapping the two platforms, or sizing for enterprise customers who intend to migrate their application running on IBM Power System to Amazon EC2.

The purpose of this blog is to share a methodology along with example of how to estimate the sizing of mission critical applications running on IBM Power System to Amazon EC2 running on x86 platform.

Methodology

In order to normalize the CPU performance between IBM Power and Intel x86 processors, we may refer to SAP Application Performance Standard (SAPS) for performance benchmark purposes. SAPS is a hardware-independent unit of measurement that describes the performance of a system configuration in the SAP environment. It is derived from the Sales and Distribution (SD) two-tier benchmark, where 100 SAPS is defined as 2,000 fully business processed order line items per hour.

In technical terms, this throughput is achieved by processing 6,000 dialog steps (screen changes), 2,000 postings per hour in the SD Benchmark, or 2,400 SAP transactions.

In the SD benchmark, fully business processed means the full business process of an order line item: creating the order, creating a delivery note for the order, displaying the order, changing the delivery, posting a goods issue, listing orders, and creating an invoice.

In this methodology, we compare the performance sizing by using SAPS which is not released by other ERP software such as Oracle EBS. However, we believe that this methodology will be most pertinent of indirect performance benchmark from an ERP software perspective as most of the enterprise businesses have similar process of ordering.

Discussion by Example

We will walk you through a real example to demonstrate how you can translate and migrate an SAP ERP 6.0 or Oracle EBS application database that is running on IBM Power 795 (IBM P795) with IBM AIX from on-premises to AWS cloud environment by choosing the right size of Amazon EC2 instances.

Step 1: IBM P795 System Performance Analysis

Users can leverage IBM nmon Analyzer which is a free tool to produce AIX performance analysis reports running on IBM Power System. The tool is designed to work with the latest version of nmon, but it is also tested with older versions for backwards compatibility. The tool is updated whenever nmon is updated, and at irregular intervals for new function. It allows users to:

  • View the data in spreadsheet form
  • Eliminate “bad” data
  • Automatically produce graphs for presentation to clients

Below is the performance analysis being captured for CPU, memory, disk throughput, disk Input / Output (I/O), and network I/O during the system’s database peak demand at month end closing – using IBM nmon Analyzer.

Figure 1: Actual Number of Pyhsical Core of CPU Utilization Analysis on IBM P795

 

Figure 2: Memory Usage Analysis on IBM P795

 

Figure 3: Total Disk Throughput Analysis on IBM P795

 

Figure 4: Total Disk I/O on IBM P795

 

Figure 5: Network I/O Analysis on IBM P795

Step 2: CPU Normalization

Users can find the SAPS results for different hardware manufacturers from this link. Table 1 below shows how to calculate the processing power for single CPU core with respect to different hardware manufacturers. We have chosen Amazon EC2 instances r5.metal and r4.16xlarge as potential candidates because they provide memory sizing which best fits the current demand of 408 GB – refer to Figure 2.

Instance TypeSAPS (2- Tier)Memory (GB)Total CPU CoresSAPS/Core
IBM P795688,6304,0962562,690
EC2 r5.metal140,000768482,917
EC2 r4.16xlarge76,400512362,122

Table 1: Calculate the SAPS Processing Power for Single Core of Physical CPU

In order to calculate the Normalization Factor for any targeted Amazon EC2 instance with respect to source system of IBM Power system, we will apply Equation 1 as shown below:

Normalization Factor = [SAPS Per Core] Amazon EC2 / [SAPS Per Core] IBM POWER System (1)

Therefore, by inserting the SAPS/Core being calculated from Table 1 into Equation (1):

  • The normalization factor for 1 core of Amazon EC2 of r5.metal to IBM P795 is equivalent to 1.08
  • The normalization factor for 1 core of Amazon EC2 of r4.16xlarge to IBM P795 is equivalent to 0.79

Table 2 below shows how to calculate normalized total number of CPU cores being allocated to each system. We need to make sure that the normalized total number of CPU cores from selected Amazon EC2 instance must be greater than or equal to the assigned entitlement capacity of IBM P795 in this case.

 

Instance TypeNormalization Factor (A)Total # CPU Cores (B)Normalized Total # CPU Core (A X B)
IBM P7951.003232
EC2 r5.metal1.084852
EC2 x1.32xlarge0.793628

Table 2: Normalized Total Number of CPU Core

Figure 1 indicates that the peak to meet the total processing power of database ERP running on IBM P795 is 29 cores of physical CPU as compared to assigned entitlement capacity of 32. As a result, the Amazon EC2 instance, r5.metal, should be the right candidate which will deliver sufficient processing power with enough buffer for future business growth in next 12 to 24 months.

Step 3: Amazon EC2 Right Sizing

Next, we need to check and ensure that other specifications such as network I/O, disk throughput and disk I/O will meet the current peak demand of the application. Table 3 shows the different specifications of r5.metal . It is clear from Table 3 that r5.metal meets all the demand requirements of the database ERP which is running on IBM P795.

SpecificationsAmazon EC2 Instance Type of r5.metal
Physical Core of CPU48 > 29 (current CPU utilization)
RAM (GiB)768 > 408 (current memory usage)
Network Perf (GBps)3 > 0.3 (current network I/O throughput)
Dedicated EBS BW (GBps)~ 2.4 > 2.0 (current disk throughput)
Maximum IOPS (16 KiB I/O)80,000 > 25,000 (current disk IO consumption)

Table 3: EC2 r5.metal Specifications

Conclusion

The methodology and example discussed in this blog gives a pragmatic estimation for optimal performance by selecting the right type and size of Amazon EC2 instances when you plan to migrate from IBM Power System, which is running mission critical ERP application, from on-premises datacenter to AWS cloud. This methodology has been applied to many enterprise cloud transformation projects and delivered more predictable performance with significant TCO savings. Additionally, this methodology can be adopted for capacity planning and helps enterprises establish strong business justifications for heterogeneous platform migration.