All posts by Emma White

Supporting AWS Graviton2 and x86 instance types in the same Auto Scaling group

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/supporting-aws-graviton2-and-x86-instance-types-in-the-same-auto-scaling-group/

This post is written by Tyler Lynch, Sr. Solutions Architect – EdTech, and Praneeth Tekula, Technical Account Manager.

As customers seek performance improvements and to cost optimize their workloads, they are evaluating and adopting AWS Graviton2 based instances. This post provides instructions on how to configure your Amazon EC2 Auto Scaling group (ASG) to use both Graviton2 and x86 based Amazon EC2 Instances in the same Auto Scaling group with different AMIs. This allows you to introduce Graviton2 based instances as part of a multiple instance type strategy.

For example, a customer may want to use the same Auto Scaling group definition across multiple Regions, but an instance type might not available in that region yet. Implementing instance and architecture diversity allow those Auto Scaling group definitions to be portable.

Solution Overview

The Amazon EC2 Auto Scaling console currently doesn’t support the selection of multiple launch templates, so I use the AWS Command Line Interface (AWS CLI) throughout this post. First, you create your launch templates that specify AMIs for use on x86 and arm64 based instances. Then you create your Auto Scaling group using a mixed instance policy with instance level overrides to specify the launch template to use for that instance.

Finally, you extend the launch templates to use architecture-specific EC2 user data to download architecture-specific binaries. Putting it all together, here are the high-level steps to follow:

  1. Create the launch templates:
    1. Launch template for x86– Creates a launch template for x86 instances, specifying the AMI but not the instance sizes.
    2. Launch template for arm64– Creates a launch template for arm64 instances, specifying the AMI but not the instance sizes.
  2. Create the Auto Scaling group that references the launch templates in a mixed instance policy override.
  3. Create a sample Node.js application.
  4. Create the architecture-specific user data scripts.
  5. Modify the launch templates to use architecture-specific user data scripts.

Prerequisites

The prerequisites for this solution are as follows:

  • The AWS CLI installed locally. I use AWS CLI version 2 for this post.
    • For AWS CLI v2, you must use 2.1.3+
    • For AWS CLI v1, you must use 1.18.182+
  • The correct AWS Identity and Access Management(IAM) role permissions for your account allowing for the creation and execution of the launch templates, Auto Scaling groups, and launching EC2 instances.
  • A source control service such as AWS CodeCommit or GitHub that your user data script can interact with to git clone the Hello World Node.js application.
  • The source code repository initialized and cloned locally.

Create the Launch Templates

You start with creating the launch template for x86 instances, and then the launch template for arm64 instances. These are simple launch templates where you only specify the AMI for Amazon Linux 2 in US-EAST-1 (architecture dependent). You use the AWS CLI cli-input-json feature to make things more readable and repeatable.

You first must add the lt-x86-cli-input.json file to your local working for reference by the AWS CLI.

  1. In your preferred text editor, add a new file, and copy paste the following JSON into the file.

{
    "LaunchTemplateName": "lt-x86",
    "VersionDescription": "LaunchTemplate for x86 instance types using Amazon Linux 2 x86 AMI in US-EAST-1",
    "LaunchTemplateData": {
        "ImageId": "ami-04bf6dcdc9ab498ca"
    }
}
  1. Save the file in your local working directory and name it lt-x86-cli-input.json.

Now, add the lt-arm64-cli-input.json file into your local working directory.

  1. In a text editor, add a new file, and copy paste the following JSON into the file.

{
    "LaunchTemplateName": "lt-arm64",
    "VersionDescription": "LaunchTemplate for Graviton2 instance types using Amazon Linux 2 Arm64 AMI in US-EAST-1",
    "LaunchTemplateData": {
        "ImageId": "ami-09e7aedfda734b173"
    }
}
  1. Save the file in your local working directory and name it lt-arm64-cli-input.json.

Now that your CLI input files are ready, create your launch templates using the CLI.

From your terminal, run the following commands:


aws ec2 create-launch-template \
            --cli-input-json file://./lt-x86-cli-input.json \
            --region us-east-1

aws ec2 create-launch-template \
            --cli-input-json file://./lt-arm64-cli-input.json \
            --region us-east-1

After you run each command, you should see the command output similar to this:


{
	"LaunchTemplate": {
		"LaunchTemplateId": "lt-07ab8c76f8e021b0c",
		"LaunchTemplateName": "lt-x86",
		"CreateTime": "2020-11-20T16:08:08+00:00",
		"CreatedBy": "arn:aws:sts::111111111111:assumed-role/Admin/myusername",
		"DefaultVersionNumber": 1,
		"LatestVersionNumber": 1
	}
}

{
	"LaunchTemplate": {
		"LaunchTemplateId": "lt-0c65656a2c75c0f76",
		"LaunchTemplateName": "lt-arm64",
		"CreateTime": "2020-11-20T16:08:37+00:00",
		"CreatedBy": "arn:aws:sts::111111111111:assumed-role/Admin/myusername",
		"DefaultVersionNumber": 1,
		"LatestVersionNumber": 1
	}
}

Create the Auto Scaling Group

Moving on to creating your Auto Scaling group, start with creating another JSON file to use the cli-input-json feature. Then, create the Auto Scaling group via the CLI.

I want to call special attention to the LaunchTemplateSpecification under the MixedInstancePolicy Overrides property. This Auto Scaling group is being created with a default launch template, the one you created for arm64 based instances. You override that at the instance level for x86 instances.

Now, add the asg-mixed-arch-cli-input.json file into your local working directory.

  1. In a text editor, add a new file, and copy paste the following JSON into the file.
  2. You need to change the subnet IDs specified in the VPCZoneIdentifier to your own subnet IDs.

{
    "AutoScalingGroupName": "asg-mixed-arch",
    "MixedInstancesPolicy": {
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "lt-arm64",
                "Version": "$Default"
            },
            "Overrides": [
                {
                    "InstanceType": "t4g.micro"
                },
                {
                    "InstanceType": "t3.micro",
                    "LaunchTemplateSpecification": {
                        "LaunchTemplateName": "lt-x86",
                        "Version": "$Default"
                    }
                },
                {
                    "InstanceType": "t3a.micro",
                    "LaunchTemplateSpecification": {
                        "LaunchTemplateName": "lt-x86",
                        "Version": "$Default"
                    }
                }
            ]
        }
    },    
    "MinSize": 1,
    "MaxSize": 5,
    "DesiredCapacity": 3,
    "VPCZoneIdentifier": "subnet-e92485b6, subnet-07fe637b44fd23c31, subnet-828622e4, subnet-9bd6a2d6"
}
  1. Save the file in your local working directory and name it asg-mixed-arch-cli-input.json.

Now that your CLI input file is ready, create your Auto Scaling group using the CLI.

  1. From your terminal, run the following command:

aws autoscaling create-auto-scaling-group \
            --cli-input-json file://./asg-mixed-arch-cli-input.json \
            --region us-east-1

After you run the command, there isn’t any immediate output. Describe the Auto Scaling group to review the configuration.

  1. From your terminal, run the following command:

aws autoscaling describe-auto-scaling-groups \
            --auto-scaling-group-names asg-mixed-arch \
            --region us-east-1

Let’s evaluate the output. I removed some of the output for brevity. It shows that you have an Auto Scaling group with a mixed instance policy, which specifies a default launch template named lt-arm64. In the Overrides property, you can see the instances types that you specified and the values that define the lt-x86 launch template to be used for specific instance types (t3.micro, t3a.micro).


{
    "AutoScalingGroups": [
        {
            "AutoScalingGroupName": "asg-mixed-arch",
            "AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1:111111111111:autoScalingGroup:a1a1a1a1-a1a1-a1a1-a1a1-a1a1a1a1a1a1:autoScalingGroupName/asg-mixed-arch",
            "MixedInstancesPolicy": {
                "LaunchTemplate": {
                    "LaunchTemplateSpecification": {
                        "LaunchTemplateId": "lt-0cc7dae79a397d663",
                        "LaunchTemplateName": "lt-arm64",
                        "Version": "$Default"
                    },
                    "Overrides": [
                        {
                            "InstanceType": "t4g.micro"
                        },
                        {
                            "InstanceType": "t3.micro",
                            "LaunchTemplateSpecification": {
                                "LaunchTemplateId": "lt-04b525bfbde0dcebb",
                                "LaunchTemplateName": "lt-x86",
                                "Version": "$Default"
                            }
                        },
                        {
                            "InstanceType": "t3a.micro",
                            "LaunchTemplateSpecification": {
                                "LaunchTemplateId": "lt-04b525bfbde0dcebb",
                                "LaunchTemplateName": "lt-x86",
                                "Version": "$Default"
                            }
                        }
                    ]
                },
                ...
            },
            ...
            "Instances": [
                {
                    "InstanceId": "i-00377a23630a5e107",
                    "InstanceType": "t4g.micro",
                    "AvailabilityZone": "us-east-1b",
                    "LifecycleState": "InService",
                    "HealthStatus": "Healthy",
                    "LaunchTemplate": {
                        "LaunchTemplateId": "lt-0cc7dae79a397d663",
                        "LaunchTemplateName": "lt-arm64",
                        "Version": "1"
                    },
                    "ProtectedFromScaleIn": false
                },
                {
                    "InstanceId": "i-07c2d4f875f1f457e",
                    "InstanceType": "t4g.micro",
                    "AvailabilityZone": "us-east-1a",
                    "LifecycleState": "InService",
                    "HealthStatus": "Healthy",
                    "LaunchTemplate": {
                        "LaunchTemplateId": "lt-0cc7dae79a397d663",
                        "LaunchTemplateName": "lt-arm64",
                        "Version": "1"
                    },
                    "ProtectedFromScaleIn": false
                },
                {
                    "InstanceId": "i-09e61e95cdf705ade",
                    "InstanceType": "t4g.micro",
                    "AvailabilityZone": "us-east-1c",
                    "LifecycleState": "InService",
                    "HealthStatus": "Healthy",
                    "LaunchTemplate": {
                        "LaunchTemplateId": "lt-0cc7dae79a397d663",
                        "LaunchTemplateName": "lt-arm64",
                        "Version": "1"
                    },
                    "ProtectedFromScaleIn": false
                }
            ],
            ...
        }
    ]
}

Create Hello World Node.js App

Now that you have created the launch templates and the Auto Scaling group you are ready to create the “hello world” application that self-reports the processor architecture. You work in the local directory that is cloned from your source repository as specified in the prerequisites. This doesn’t have to be the local working directory where you are creating architecture-specific files.

  1. In a text editor, add a new file with the following Node.js code:

// Hello World sample app.
const http = require('http');

const port = 3000;

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  res.end(`Hello World. This processor architecture is ${process.arch}`);
});

server.listen(port, () => {
  console.log(`Server running on processor architecture ${process.arch}`);
});
  1. Save the file in the root of your source repository and name it app.js.
  2. Commit the changes to Git and push the changes to your source repository. See the following commands:

git add .
git commit -m "Adding Node.js sample application."
git push

Create user data scripts

Moving on to your creating architecture-specific user data scripts that will define the version of Node.js and the distribution that matches the processor architecture. It will download and extract the binary and add the binary path to the environment PATH. Then it will clone the Hello World app, and then run that app with the binary of Node.js that was installed.

Now, you must add the ud-x86-cli-input.txt file to your local working directory.

  1. In your text editor, add a new file, and copy paste the following text into the file.
  2. Update the git clone command to use the repo URL where you created the Hello World app previously.
  3. Update the cd command to use the repo name.

sudo yum update -y
sudo yum install git -y
VERSION=v14.15.3
DISTRO=linux-x64
wget https://nodejs.org/dist/$VERSION/node-$VERSION-$DISTRO.tar.xz
sudo mkdir -p /usr/local/lib/nodejs
sudo tar -xJvf node-$VERSION-$DISTRO.tar.xz -C /usr/local/lib/nodejs 
export PATH=/usr/local/lib/nodejs/node-$VERSION-$DISTRO/bin:$PATH
git clone https://github.com/<<githubuser>>/<<repo>>.git
cd <<repo>>
node app.js
  1. Save the file in your local working directory and name it ud-x86-cli-input.txt.

Now, add the ud-arm64-cli-input.txt file into your local working directory.

  1. In a text editor, add a new file, and copy paste the following text into the file.
  2. Update the git clone command to use the repo URL where you created the Hello World app previously.
  3. Update the cd command to use the repo name.

sudo yum update -y
sudo yum install git -y
VERSION=v14.15.3
DISTRO=linux-arm64
wget https://nodejs.org/dist/$VERSION/node-$VERSION-$DISTRO.tar.xz
sudo mkdir -p /usr/local/lib/nodejs
sudo tar -xJvf node-$VERSION-$DISTRO.tar.xz -C /usr/local/lib/nodejs 
export PATH=/usr/local/lib/nodejs/node-$VERSION-$DISTRO/bin:$PATH
git clone https://github.com/<<githubuser>>/<<repo>>.git
cd <<repo>>
node app.js
  1. Save the file in your local working directory and name it ud-arm64-cli-input.txt.

Now that your user data scripts are ready, you need to base64 encode them as the AWS CLI does not perform base64-encoding of the user data for you.

  • On a Linux computer, from your terminal use the base64 command to encode the user data scripts.

base64 ud-x86-cli-input.txt > ud-x86-cli-input-base64.txt
base64 ud-arm64-cli-input.txt > ud-arm64-cli-input-base64.txt
  • On a Windows computer, from your command line use the certutil command to encode the user data. Before you can use this file with the AWS CLI, you must remove the first (BEGIN CERTIFICATE) and last (END CERTIFICATE) lines.

certutil -encode ud-x86-cli-input.txt ud-x86-cli-input-base64.txt
certutil -encode ud-arm64-cli-input.txt ud-arm64-cli-input-base64.txt
notepad ud-x86-cli-input-base64.txt
notepad ud-arm64-cli-input-base64.txt

Modify the Launch Templates

Now, you modify the launch templates to use architecture-specific user data scripts.

Please note that the contents of your ud-x86-cli-input-base64.txt and ud-arm64-cli-input-base64.txt files are different from the samples here because you referenced your own GitHub repository. These base64 encoded user data scripts below will not work as is, they contain placeholder references for the git clone and cd commands.

Next, update the lt-x86-cli-input.json file to include your base64 encoded user data script for x86 based instances.

  1. In your preferred text editor, open the ud-x86-cli-input-base64.txt file.
  2. Open the lt-x86-cli-input.json file, and add in the text from the ud-x86-cli-input-base64.txt file into the UserData property of the LaunchTemplateData object. It should look similar to this:

{
    "LaunchTemplateName": "lt-x86",
    "VersionDescription": "LaunchTemplate for x86 instance types using Amazon Linux 2 x86 AMI in US-EAST-1",
    "LaunchTemplateData": {
        "ImageId": "ami-04bf6dcdc9ab498ca",
        "UserData": "IyEvYmluL2Jhc2gKeXVtIHVwZGF0ZSAteQoKVkVSU0lPTj12MTQuMTUuMwpESVNUUk89bGludXgteDY0CndnZXQgaHR0cHM6Ly9ub2RlanMub3JnL2Rpc3QvJFZFUlNJT04vbm9kZS0kVkVSU0lPTi0kRElTVFJPLnRhci54egpzdWRvIG1rZGlyIC1wIC91c3IvbG9jYWwvbGliL25vZGVqcwpzdWRvIHRhciAteEp2ZiBub2RlLSRWRVJTSU9OLSRESVNUUk8udGFyLnh6IC1DIC91c3IvbG9jYWwvbGliL25vZGVqcyAKZXhwb3J0IFBBVEg9L3Vzci9sb2NhbC9saWIvbm9kZWpzL25vZGUtJFZFUlNJT04tJERJU1RSTy9iaW46JFBBVEgKZ2l0IGNsb25lIGh0dHBzOi8vZ2l0aHViLmNvbS88PGdpdGh1YnVzZXI+Pi88PHJlcG8+Pi5naXQKY2QgPDxyZXBvPj4Kbm9kZSBhcHAuanMK"
    }
}
  1. Save the file.

Next, update the lt-arm64-cli-input.json file to include your base64 encoded user data script for arm64 based instances.

  1. In your text editor, open the ud-arm64-cli-input-base64.txt file.
  2. Open the lt-arm64-cli-input.json file, and add in the text from the ud-arm64-cli-input-base64.txt file into the UserData property of the LaunchTemplateData It should look similar to this:

{
    "LaunchTemplateName": "lt-arm64",
    "VersionDescription": "LaunchTemplate for Graviton2 instance types using Amazon Linux 2 Arm64 AMI in US-EAST-1",
    "LaunchTemplateData": {
        "ImageId": "ami-09e7aedfda734b173",
        "UserData": "IyEvYmluL2Jhc2gKeXVtIHVwZGF0ZSAteQoKVkVSU0lPTj12MTQuMTUuMwpESVNUUk89bGludXgtYXJtNjQKd2dldCBodHRwczovL25vZGVqcy5vcmcvZGlzdC8kVkVSU0lPTi9ub2RlLSRWRVJTSU9OLSRESVNUUk8udGFyLnh6CnN1ZG8gbWtkaXIgLXAgL3Vzci9sb2NhbC9saWIvbm9kZWpzCnN1ZG8gdGFyIC14SnZmIG5vZGUtJFZFUlNJT04tJERJU1RSTy50YXIueHogLUMgL3Vzci9sb2NhbC9saWIvbm9kZWpzIApleHBvcnQgUEFUSD0vdXNyL2xvY2FsL2xpYi9ub2RlanMvbm9kZS0kVkVSU0lPTi0kRElTVFJPL2JpbjokUEFUSApnaXQgY2xvbmUgaHR0cHM6Ly9naXRodWIuY29tLzw8Z2l0aHVidXNlcj4+Lzw8cmVwbz4+LmdpdApjZCA8PHJlcG8+Pgpub2RlIGFwcC5qcwoKCg=="
    }
}
  1. Save the file.

Now, your CLI input files are ready. Next, create a new version of your launch templates and then set the newest version as the default.

From your terminal, run the following commands:


aws ec2 create-launch-template-version \
            --cli-input-json file://./lt-x86-cli-input.json \
            --region us-east-1

aws ec2 create-launch-template-version \
            --cli-input-json file://./lt-arm64-cli-input.json \
            --region us-east-1

aws ec2 modify-launch-template \
            --launch-template-name lt-x86 \
            --default-version 2
			
aws ec2 modify-launch-template \
            --launch-template-name lt-arm64 \
            --default-version 2

After you run each command, you should see the command output similar to this:


{
    "LaunchTemplate": {
        "LaunchTemplateId": "lt-08ff3d03d4cf0038d",
        "LaunchTemplateName": "lt-x86",
        "CreateTime": "1970-01-01T00:00:00+00:00",
        "CreatedBy": "arn:aws:sts::111111111111:assumed-role/Admin/myusername",
        "DefaultVersionNumber": 2,
        "LatestVersionNumber": 2
    }
}

{
    "LaunchTemplate": {
        "LaunchTemplateId": "lt-0c5e1eb862a02f8e0",
        "LaunchTemplateName": "lt-arm64",
        "CreateTime": "1970-01-01T00:00:00+00:00",
        "CreatedBy": "arn:aws:sts::111111111111:assumed-role/Admin/myusername",
        "DefaultVersionNumber": 2,
        "LatestVersionNumber": 2
    }
}

Now, refresh the instances in the Auto Scaling group so that the newest version of the launch template is used.

From your terminal, run the following command:


aws autoscaling start-instance-refresh \
            --auto-scaling-group-name asg-mixed-arch

Verify Instances

The sample Node.js application self reports the process architecture in two ways: when the application is started, and when the application receives a HTTP request on port 3000. Retrieve the last five lines of the instance console output via the AWS CLI.

First, you need to get an instance ID from the autoscaling group.

  1. From your terminal, run the following commands:

aws autoscaling describe-auto-scaling-groups \
            --auto-scaling-group-name asg-mixed-arch \
            --region us-east-1
  1. Evaluate the output. I removed some of the output for brevity. You need to use the InstanceID from the output.

{
    "AutoScalingGroups": [
        {
            "AutoScalingGroupName": "asg-mixed-arch",
            "AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1:111111111111:autoScalingGroup:a1a1a1a1-a1a1-a1a1-a1a1-a1a1a1a1a1a1:autoScalingGroupName/asg-mixed-arch",
            "MixedInstancesPolicy": {
                ...
            },
            ...
            "Instances": [
                {
                    "InstanceId": "i-0eeadb140405cc09b",
                    "InstanceType": "t4g.micro",
                    "AvailabilityZone": "us-east-1a",
                    "LifecycleState": "InService",
                    "HealthStatus": "Healthy",
                    "LaunchTemplate": {
                        "LaunchTemplateId": "lt-0c5e1eb862a02f8e0",
                        "LaunchTemplateName": "lt-arm64",
                        "Version": "2"
                    },
                    "ProtectedFromScaleIn": false
                }
            ],
          ....
        }
    ]
}

Now, retrieve the last five lines of console output from the instance.

From your terminal, run the following command:


aws ec2 get-console-output –instance-id d i-0eeadb140405cc09b \
            --output text | tail -n 5

Evaluate the output, you should see Server running on processor architecture arm64. This confirms that you have successfully utilized an architecture-specific user data script.


[  58.798184] cloud-init[1257]: node-v14.15.3-linux-arm64/share/systemtap/tapset/node.stp
[  58.798293] cloud-init[1257]: node-v14.15.3-linux-arm64/LICENSE
[  58.798402] cloud-init[1257]: Cloning into 'node-helloworld'...
[  58.798510] cloud-init[1257]: Server running on processor architecture arm64
2021-01-14T21:14:32+00:00

Cleaning Up

Delete the Auto Scaling group and use the force-delete option. The force-delete option specifies that the group is to be deleted along with all instances associated with the group, without waiting for all instances to be terminated.


aws autoscaling delete-auto-scaling-group \
            --auto-scaling-group-name asg-mixed-arch --force-delete \
            --region us-east-1

Now, delete your launch templates.


aws ec2 delete-launch-template --launch-template-name lt-x86
aws ec2 delete-launch-template --launch-template-name lt-arm64

Conclusion

You walked through creating and using architecture-specific user data scripts that were processor architecture-specific. This same method could be applied to fleets where you have different configurations needed for different instance types. Variability such as disk sizes, networking configurations, placement groups, and tagging can now be accomplished in the same Auto Scaling group.

Running cost optimized Spark workloads on Kubernetes using EC2 Spot Instances

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/

This post is written by Kinnar Sen, Senior Solutions Architect, EC2 Spot 

Apache Spark is an open-source, distributed processing system used for big data workloads. It provides API operations to perform multiple tasks such as streaming, extract transform load (ETL), query, machine learning (ML), and graph processing. Spark supports four different types of cluster managers (Spark standalone, Apache Mesos, Hadoop YARN, and Kubernetes), which are responsible for scheduling and allocation of resources in the cluster. Spark can run with native Kubernetes support since 2018 (Spark 2.3). AWS customers that have already chosen Kubernetes as their container orchestration tool can also choose to run Spark applications in Kubernetes, increasing the effectiveness of their operations and compute resources.

In this post, I illustrate the deployment of scalable, resilient, and cost optimized Spark application using Kubernetes via Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EC2 Spot Instances. Learn how to save money on big data workloads by implementing this solution.

Overview

Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. Capacity pools are a group of EC2 instances that belong to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand Instance usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

Amazon Elastic Kubernetes Service

Amazon EKS is a fully managed Kubernetes service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane. It provides a highly available and scalable managed control plane. It also provides managed worker nodes, which let you create, update, or terminate shut down worker nodes for your cluster with a single command. It is a great choice for deploying flexible and fault tolerant containerized applications. Amazon EKS supports creating and managing Amazon EC2 Spot Instances using Amazon EKS-managed node groups following Spot best practices. This enables you to take advantage of the steep savings and scale that Spot Instances provide for interruptible workloads running in your Kubernetes cluster. Using EKS-managed node groups with Spot Instances requires less operational effort compared to using self-managed nodes. In addition to launching Spot Instances in managed node groups, it is possible to specify multiple instance types in EKS managed node groups. You can find more in this blog.

Apache Spark and Kubernetes

When a spark application is submitted to the Kubernetes cluster the following happens:

  • A Spark driver is created.
  • The driver and the run within pods.
  • The Spark driver then requests for executors, which are scheduled to run within pods. The executors are managed by the driver.
  • The application is launched and once it completes, the executor pods are cleaned up. The driver pod persists the logs and remains in a completed state until the pod is cleared by garbage collection or manually removed. The driver in a completed stage does not consume any memory or compute resources.

Spark Deployment on Kubernetes Cluster

When a spark application runs on clusters managed by Kubernetes, the native Kubernetes scheduler is used. It is possible to schedule the driver/executor pods on a subset of available nodes. The applications can be launched either by a vanilla ‘spark submit’, a workflow orchestrator like Apache Airflow or the spark operator. I use vanilla ‘spark submit’ in this blog. is also able to schedule Spark applications on EKS clusters as described in this launch blog, but Amazon EMR on EKS is out of scope for this post.

Cost optimization

For any organization running big data workloads there are three key requirements: scalability, performance, and low cost. As the size of data increases, there is demand for more compute capacity and the total cost of ownership increases. It is critical to optimize the cost of big data applications. Big Data frameworks (in this case, Spark) are distributed to manage and process high volumes of data. These frameworks are designed for failure, can run on machines with different configurations, and are inherently resilient and flexible.

If Spark deploys on Kubernetes, the executor pods can be scheduled on EC2 Spot Instances and driver pods on On-Demand Instances. This reduces the overall cost of deployment – Spot Instances can save up to 90% over On-Demand Instance prices. This also enables faster results by scaling out executors running on Spot Instances. Spot Instances, by design, can be interrupted when EC2 needs the capacity back. If a driver pod is running on a Spot Instance, which is interrupted then the application fails and the application must be re-submitted. To avoid this situation, the driver pod can be scheduled on On-Demand Instances only. This adds a layer of resiliency to the Spark application running on Kubernetes. To cost optimize the deployment, all the executor pods are scheduled on Spot Instances as that’s where the bulk of compute happens. Spark’s inherent resiliency has the driver launch new executors to replace the ones that fail due to Spot interruptions.

There are a couple of key points to note here.

  • The idea is to start with minimum number of nodes for both On-Demand and Spot Instances (one each) and then auto-scale usingCluster Autoscaler and EC2 Auto Scaling  Cluster Autoscaler for AWS provides integration with Auto Scaling groups. If there are not sufficient resources, the driver and executor pods go into pending state. The Cluster Autoscaler detects pods in pending state and scales worker nodes within the identified Auto Scaling group in the cluster using EC2 Auto Scaling.
  • The scaling for On-Demand and Spot nodes is exclusive of one another. So, if multiple applications are launched the driver and executor pods can be scheduled in different node groups independently per the resource requirements. This helps reduce job failures due to lack of resources for the driver, thus adding to the overall resiliency of the system.
  • Using EKS Managed node groups
    • This requires significantly less operational effort compared to using self-managed nodegroup and enables:
      • Auto enforcement of Spot best practices like Capacity Optimized allocation strategy, Capacity Rebalancing and use multiple instances types.
      • Proactive replacement of Spot nodes using rebalance notifications.
      • Managed draining of Spot nodes via re-balance recommendations.
    • The nodes are auto-labeled so that the pods can be scheduled with NodeAffinity.
      • eks.amazonaws.com/capacityType: SPOT
      • eks.amazonaws.com/capacityType: ON_DEMAND

Now that you understand the products and best practices of used in this tutorial, let’s get started.

Tutorial: running Spark in EKS managed node groups with Spot Instances

In this tutorial, I review steps, which help you launch cost optimized and resilient Spark jobs inside Kubernetes clusters running on EKS. I launch a word-count application counting the words from an Amazon Customer Review dataset and write the output to an Amazon S3 folder. To run the Spark workload on Kubernetes, make sure you have eksctl and kubectl installed on your computer or on an AWS Cloud9 environment. You can run this by using an AWS IAM user or role that has the AdministratorAccess policy attached to it, or check the minimum required permissions for using eksctl. The spot node groups in the Amazon EKS cluster can be launched both in a managed or a self-managed way, in this post I use the former. The config files for this tutorial can be found here. The job is finally launched in cluster mode.

Create Amazon S3 Access Policy

First, I must create an Amazon S3 access policy to allow the Spark application to read/write from Amazon S3. Amazon S3 Access is provisioned by attaching the policy by ARN to the node groups. This associates Amazon S3 access to the NodeInstanceRole and, hence, the node groups then have access to Amazon S3. Download the Amazon S3 policy file from here and modify the <<output folder>> to an Amazon S3 bucket you created. Run the following to create the policy. Note the ARN.

aws iam create-policy --policy-name spark-s3-policy --policy-document file://spark-s3.json

Cluster and node groups deployment

Create an EKS cluster using the following command:

eksctl create cluster –name= sparkonk8 --node-private-networking  --without-nodegroup --asg-access –region=<<AWS Region>>

The cluster takes approximately 15 minutes to launch.

Create the nodegroup using the nodeGroup config file. Replace the <<Policy ARN>> string using the ARN string from the previous step.

eksctl create nodegroup -f managedNodeGroups.yml

Scheduling driver/executor pods

The driver and executor pods can be assigned to nodes using affinity. PodTemplates can be used to configure the detail, which is not supported by Spark launch configuration by default. This feature is available from Spark 3.0.0, requiredDuringScheduling node affinity is used to schedule the driver and executor jobs. Sample podTemplates have been uploaded here.

Launching a Spark application

Create a service account. The spark driver pod uses the service account to create and watch executor pods using Kubernetes API server.

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole='edit'  --serviceaccount=default:spark --namespace=default

Download the Cluster Autoscaler and edit it to add the cluster-name. 

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Install the Cluster AutoScaler using the following command:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Get the details of Kubernetes master to get the head URL.

kubectl cluster-info 

command output

Use the following instructions to build the docker image.

Download the application file (script.py) from here and upload into the Amazon S3 bucket created.

Download the pod template files from here. Submit the application.

bin/spark-submit \
--master k8s://<<MASTER URL>> \
--deploy-mode cluster \
--name 'Job Name' \
--conf spark.eventLog.dir=s3a:// <<S3 BUCKET>>/logs \
--conf spark.eventLog.enabled=true \
--conf spark.history.fs.inProgressOptimization.enabled=true \
--conf spark.history.fs.update.interval=5s \
--conf spark.kubernetes.container.image=<<ECR Spark Docker Image>> \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.podTemplateFile='../driver_pod_template.yml' \
--conf spark.kubernetes.executor.podTemplateFile='../executor_pod_template.yml' \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.33 \
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=30 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.driver.memory=8g \
--conf spark.kubernetes.driver.request.cores=2 \
--conf spark.kubernetes.driver.limit.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.kubernetes.executor.limit.cores=4 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.s3a.fast.upload=true \
s3a://<<S3 BUCKET>>/script.py \
s3a://<<S3 BUCKET>>/output 

A couple of key points to note here

  • podTemplateFile is used here, which enables scheduling of the driver pods to On-Demand Instances and executor pods to Spot Instances.
  • Spark provides a mechanism to allocate dynamically resources dynamically based on workloads. In the latest release of Spark (3.0.0), dynamicAllocation can be used with Kubernetes cluster manager. The executors that do not store, active, shuffled files can be removed to free up the resources. DynamicAllocation works well in tandem with Cluster Autoscaler for resource allocation and optimizes resource for jobs. We are using dynamicAllocation here to enable optimized resource sharing.
  • The application file and output are both in Amazon S3.

Output Files in S3

  • Spark Event logs are redirected to Amazon S3. Spark on Kubernetes creates local temporary files for logs and removes them once the application completes. The logs are redirected to Amazon S3 and Spark History Server can be used to analyze the logs. Note, you can create more instrumentation using tools like Prometheus and Grafana to monitor and manage the cluster.

Spark History Server + Dynamic Allocation

Observations

EC2 Spot Interruptions

The following diagram and log screenshot details from Spark History server showcases the behavior of a Spark application in case of an EC2 Spot interruption.

Four Spark applications launched in parallel in a cluster and one of the Spot nodes was interrupted. A couple of executor pods were terminated shut down in three of the four applications, but due to the resilient nature of Spark new executors were launched and the applications finished almost around the same time.
The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.
Spark jobs

The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.

Dynamic Allocation

Dynamic Allocation works with the caveat that it is an experimental feature.

dynamic allocation

Cost Optimization

Cost Optimization is achieved in several different ways from this tutorial.

  • Use of 100% Spot Instances for the Spark executors
  • Use of dynamicAllocation along with cluster autoscaler does make optimized use of resources and hence save cost
  • With the deployment of one driver and executor nodes to begin with and then scaling up on demand reduces the waste of a continuously running cluster

Cluster Autoscaling

Cluster Autoscaling is triggered as it is designed when there are pending (Spark executor) pods.

The Cluster Autoscaler logs can be fetched by:

kubectl logs -f deployment/cluster-autoscaler -n kube-system —tail=10  

Cluster Autoscaler Logs 

Cleanup

If you are trying out the tutorial, run the following steps to make sure that you don’t encounter unwanted costs.

Delete the EKS cluster and the nodegroups with the following command:

eksctl delete cluster --name sparkonk8

Delete the Amazon S3 Access Policy with the following command:

aws iam delete-policy --policy-arn <<POLICY ARN>>

Delete the Amazon S3 Output Bucket with the following command:

aws s3 rb --force s3://<<S3_BUCKET>>

Conclusion

In this blog, I demonstrated how you can run Spark workloads on a Kubernetes Cluster using Spot Instances, achieving scalability, resilience, and cost optimization. To cost optimize your Spark based big data workloads, consider running spark application using Kubernetes and EC2 Spot Instances.

 

 

 

How to monitor Windows and Linux servers and get internal performance metrics

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-monitor-windows-and-linux-servers-and-get-internal-performance-metrics/

This post was written by Dean Suzuki, Senior Solutions Architect.

Customers who run Windows or Linux instances on AWS frequently ask, “How do I know if my disks are almost full?” or “How do I know if my application is using all the available memory and is paging to disk?” This blog helps answer these questions by walking you through how to set up monitoring to capture these internal performance metrics.

Solution overview

If you open the Amazon EC2 console, select a running Amazon EC2 instance, and select the Monitoring tab  you can see Amazon CloudWatch metrics for that instance. Amazon CloudWatch is an AWS monitoring service. The Monitoring tab (shown in the following image) shows the metrics that can be measured external to the instance (for example, CPU utilization, network bytes in/out). However, to understand what percentage of the disk is being used or what percentage of the memory is being used, these metrics require an internal operating system view of the instance. AWS places an extra safeguard on gathering data inside a customer’s instance so this capability is not enabled by default.

EC2 console showing Monitoring tab

To capture the server’s internal performance metrics, a CloudWatch agent must be installed on the instance. For Windows, the CloudWatch agent can capture any of the Windows performance monitor counters. For Linux, the CloudWatch agent can capture system-level metrics. For more details, please see Metrics Collected by the CloudWatch Agent. The agent can also capture logs from the server. The agent then sends this information to Amazon CloudWatch, where rules can be created to alert on certain conditions (for example, low free disk space) and automated responses can be set up (for example, perform backup to clear transaction logs). Also, dashboards can be created to view the health of your Windows servers.

There are four steps to implement internal monitoring:

  1. Install the CloudWatch agent onto your servers. AWS provides a service called AWS Systems Manager Run Command, which enables you to do this agent installation across all your servers.
  2. Run the CloudWatch agent configuration wizard, which captures what you want to monitor. These items could be performance counters and logs on the server. This configuration is then stored in AWS System Manager Parameter Store
  3. Configure CloudWatch agents to use agent configuration stored in Parameter Store using the Run Command.
  4. Validate that the CloudWatch agents are sending their monitoring data to CloudWatch.

The following image shows the flow of these four steps.

Process to install and configure the CloudWatch agent

In this blog, I walk through these steps so that you can follow along. Note that you are responsible for the cost of running the environment outlined in this blog. So, once you are finished with the steps in the blog, I recommend deleting the resources if you no longer need them. For the cost of running these servers, see Amazon EC2 On-Demand Pricing. For CloudWatch pricing, see Amazon CloudWatch pricing.

If you want a video overview of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

Deploy the CloudWatch agent

The first step is to deploy the Amazon CloudWatch agent. There are multiple ways to deploy the CloudWatch agent (see this documentation on Installing the CloudWatch Agent). In this blog, I walk through how to use the AWS Systems Manager Run Command to deploy the agent. AWS Systems Manager uses the Systems Manager agent, which is installed by default on each AWS instance. This AWS Systems Manager agent must be given the appropriate permissions to connect to AWS Systems Manager, and to write the configuration data to the AWS Systems Manager Parameter Store. These access rights are controlled through the use of IAM roles.

Create two IAM roles

IAM roles are identity objects that you attach IAM policies. IAM policies define what access is allowed to AWS services. You can have users, services, or applications assume the IAM roles and get the assigned rights defined in the permissions policies.

To use System Manager, you typically create two IAM roles. The first role has permissions to write the CloudWatch agent configuration information to System Manager Parameter Store. This role is called CloudWatchAgentAdminRole.

The second role only has permissions to read the CloudWatch agent configuration from the System Manager Parameter Store. This role is called CloudWatchAgentServerRole.

For more details on creating these roles, please see the documentation on Create IAM Roles and Users for Use with the CloudWatch Agent.

Attach the IAM roles to the EC2 instances

Once you create the roles, you attach them to your Amazon EC2 instances. By attaching the IAM roles to the EC2 instances, you provide the processes running on the EC2 instance the permissions defined in the IAM role. In this blog, you create two Amazon EC2 instances. Attach the CloudWatchAgentAdminRole to the first instance that is used to create the CloudWatch agent configuration. Attach CloudWatchAgentServerRole to the second instance and any other instances that you want to monitor. For details on how to attach or assign roles to EC2 instances, please see the documentation on How do I assign an existing IAM role to an EC2 instance?.

Install the CloudWatch agent

Now that you have setup the permissions, you can install the CloudWatch agent onto the servers that you want to monitor. For details on installing the CloudWatch agent using Systems Manager, please see the documentation on Download and Configure the CloudWatch Agent.

Create the CloudWatch agent configuration

Now that you installed the CloudWatch agent on your server, run the CloudAgent configuration wizard to create the agent configuration. For instructions on how to run the CloudWatch Agent configuration wizard, please see this documentation on Create the CloudWatch Agent Configuration File with the Wizard. To establish a command shell on the server, you can use AWS Systems Manager Session Manager to establish a session to the server and then run the CloudWatch agent configuration wizard. If you want to monitor both Linux and Windows servers, you must run the CloudWatch agent configuration on a Linux instance and on a Windows instance to create a configuration file per OS type. The configuration is unique to the OS type.

To run the Agent configuration wizard on Linux instances, run the following command:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

To run the Agent configuration wizard on Windows instances, run the following commands:

cd "C:\Program Files\Amazon\AmazonCloudWatchAgent"

amazon-cloudwatch-agent-config-wizard.exe

Note for Linux instances: do not select to collect the collectd metrics in the agent configuration wizard unless you have collectd installed on your Linux servers. Otherwise, you may encounter an error.

Review the Agent configuration

The CloudWatch agent configuration generated from the wizard is stored in Systems Manager Parameter Store. You can review and modify this configuration if you need to capture extra metrics. To review the agent configuration, perform the following steps:

  1. Go to the console for the System Manager service.
  2. Click Parameter store on the left hand navigation.
  3. You should see the parameter that was created by the CloudWatch agent configuration program. For Linux servers, the configuration is stored in: AmazonCloudWatch-linux and for Windows servers, the configuration is stored in:  AmazonCloudWatch-windows.

System Manager Parameter Store: Parameters created by CloudWatch agent configuration wizard

  1. Click on the parameter’s hyperlink (for example, AmazonCloudWatch-linux) to see all the configuration parameters that you specified in the configuration program.

In the following steps, I walk through an example of modifying the Windows configuration parameter (AmazonCloudWatch-windows) to add an additional metric (“Available Mbytes”) to monitor.

  1. Click the AmazonCloudWatch-windows
  2. In the parameter overview, scroll down to the “metrics” section and under “metrics_collected”, you can see the Windows performance monitor counters that will be gathered by the CloudWatch agent. If you want to add an additional perfmon counter, then you can edit and add the counter here.
  3. Press Edit at the top right of the AmazonCloudWatch-windows Parameter Store page.
  4. Scroll down in the Value section and look for “Memory.”
  5. After the “% Committed Bytes In Use”, put a comma “,” and then press Enter to add a blank line. Then, put on that line “Available Mbytes” The following screenshot demonstrates what this configuration should look like.

AmazonCloudWatch-windows parameter contents and how to add a new metric to monitor

  1. Press Save Changes.

To modify the Linux configuration parameter (AmazonCloudWatch-linux), you perform similar steps except you click on the AmazonCloudWatch-linux parameter. Here is additional documentation on creating the CloudWatch agent configuration and modifying the configuration file.

Start the CloudWatch agent and use the configuration

In this step, start the CloudWatch agent and instruct it to use your agent configuration stored in System Manager Parameter Store.

  1. Open another tab in your web browser and go to System Manager console.
  2. Specify Run Command in the left hand navigation of the System Manager console.
  3. Press Run Command
  4. In the search bar,
    • Select Document name prefix
    • Select Equal
    • Specify AmazonCloudWatch (Note the field is case sensitive)
    • Press enter

System Manager Run Command's command document entry field

  1. Select AmazonCloudWatch-ManageAgent. This is the command that configures the CloudWatch agent.
  2. In the command parameters section,
    • For Action, select Configure
    • For Mode, select ec2
    • For Optional Configuration Source, select ssm
    • For optional configuration location, specify the Parameter Store name. For Windows instances, you would specify AmazonCloudWatch-windows for Windows instances or AmazonCloudWatch-linux for Linux instances. Note the field is case sensitive. This tells the command to read the Parameter Store for the parameter specified here.
    • For optional restart, leave yes
  3. For Targets, choose your target servers that you wish to monitor.
  4. Scroll down and press Run. The Run Command may take a couple minutes to complete. Press the refresh button. The Run Command configures the CloudWatch agent by reading the Parameter Store for the configuration and configure the agent using those settings.

For more details on installing the CloudWatch agent using your agent configuration, please see this Installing the CloudWatch Agent on EC2 Instances Using Your Agent Configuration.

Review the data collected by the CloudWatch agents

In this step, I walk through how to review the data collected by the CloudWatch agents.

  1. In the AWS Management console, go to CloudWatch.
  2. Click Metrics on the left-hand navigation.
  3. You should see a custom namespace for CWAgent. Click on the CWAgent Please note that this might take a couple minutes to appear. Refresh the page periodically until it appears.
  4. Then click the ImageId, Instanceid hyperlinks to see the counters under that section.

CloudWatch Metrics: Showing counters under CWAgent

  1. Review the metrics captured by the CloudWatch agent. Notice the metrics that are only observable from inside the instance (for example, LogicalDisk % Free Space). These types of metrics would not be observable without installing the CloudWatch agent on the instance. From these metrics, you could create a CloudWatch Alarm to alert you if they go beyond a certain threshold. You can also add them to a CloudWatch Dashboard to review. To learn more about the metrics collected by the CloudWatch agent, see the documentation Metrics Collected by the CloudWatch Agent.

Conclusion

In this blog, you learned how to deploy and configure the CloudWatch agent to capture the metrics on either Linux or Windows instances. If you are done with this blog, we recommend deleting the System Manager Parameter Store entry, the CloudWatch data and  then the EC2 instances to avoid further charges. If you would like a video tutorial of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

 

 

Powering .NET 5 with AWS Graviton2: Benchmarks

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/powering-net-5-with-aws-graviton2-benchmark-results/

This post was authored by Kirk Davis, Developer Advocate for App Modernization 

In 2019, AWS announced new Amazon EC2 instance types powered by the AWS Graviton2 processor. The AWS Graviton2 processor is based on the ARM64 architecture leveraging 64-bit ARM Neoverse N1 cores. Since 2019, AWS has launched many new EC2 instances built on Graviton2, including general-purpose (M6g), compute-optimized (C6g), memory-optimized (R6g), and general-purpose burstable (T4g) types. These Graviton2 based instances provide up to 40% better price performance over their comparable generation x86-64 instances. These instance types use the same naming convention as other types, but with a “g” appended to the family. For example, a t4g.large, or a c6g.2xlarge. Many customers are already running workloads on these Graviton2 instances, including .NET Core applications. Note that I refer to these 64-bit processors as “x86” for this blog post.

Organizations like AnandTech have done in-depth benchmarking of Graviton2 against x86-architecture EC2 instances and found that Graviton2 has a significant performance and cost advantage. Comparing similar instance families, the Graviton2 instances are about 20% less expensive per hour than Intel x86 instances with up to 40% better performance. With .NET 5 officially released in November, I thought it would be interesting to see what advantages Graviton2 has for .NET 5 web applications as a follow-up to the .NET 5 on AWS blog AWS published earlier. Follow along this blog to learn how I ran the benchmarking tests, the applications I chose to benchmark, and to see the results.

Overview

I decided to run some straight-forward .NET 5 benchmarks that tested ASP.NET Core under load for both x86-based and Graviton2 instances. ASP.NET Core runs application code in thread-pool threads, so it takes advantage of multiple cores to handle multiple requests concurrently. One thing to keep in mind is that x86-based EC2 instance types use simultaneous multi-threading, and a vCPU maps to a logical core. However, for Graviton2 instances a vCPU maps to a physical core. So, for these benchmarks, I used x86 and ARM64 instance types with 4 x vCPUs: m5.xlarge instance types, which have four logical (two physical) x86 cores, and m6g.xlarge instances, which have four physical ARM cores. I wanted to compare the latency and requests/second performance for different scenarios, and then compare the performance adjusted for the instances’ cost per hour. I used the per-hour pricing from the us-east-2 (Ohio) Region:

m5.xlarge m6g.xlarge
Cost $0.192 $0.154
vCPU 4 4
RAM 16 16

Benchmarks and testing framework

I used the open-source Crank software to run the benchmarks and gather results. Crank abstracts away many of the messy details in running benchmarks and delivers consistent results. From the GitHub page:

“Crank is the benchmarking infrastructure used by the .NET team to run benchmarks including (but not limited to) scenarios from the TechEmpower Web Framework Benchmarks.

Crank uses a controller (crank-controller), which communicates to one or more agents (crank-agent). The agents download, compile, and run the code, then report the results back to the controller. In this case, I used three agents: one each on the instances to be tested, and one on a test-runner instance (an m5.xlarge) that ran bombardier, a common load-testing tool that is already integrated into Crank. You can also choose wrk2, or other tools if you prefer (Crank’s readme files provide examples for both). I ran all the instances in the same Availability Zone (AZ) to minimize any other sources of latency. The setup looked like this:

benchmark environment setup

Note:    In order to use Crank’s agent with the .NET 5 release version, I made minor changes to its Startup.cs class. These changes forced Crank to pull down the correct .NET 5 SDK version, and fixed an issue where it wasn’t appending the correct build parameters for arm64 when compiling code on the m6g.xlarge instance. It’s possible the Microsoft.Crank.Agent project has been updated since I used it. I also updated all projects to .NET 5.

Benchmark tests

Since many of the .NET Core workloads customers are running in AWS are ASP.NET Core websites or APIs, I focused only these types of applications. I selected the Mvc project from the ASP.NET Benchmarks GitHub repository. The controller in this project defines an “Entry” class, and then creates and returns them as List<Entry> (which gets serialized to JSON by ASP.NET Core). For the source code for these methods, please refer to the preceding GitHub links. In the project, the Crank configuration YAML file defines three scenarios (note that I used these scenarios but swapped out wrk for bombardier).

  • MvcJsonNet2k: calls JsonController’s Json2k() method (returns eight Entries)
  • MvcJsonOutput60k: calls JsonController’s JsonNk() method for 60,000 bytes
  • MvcJsonOutput2M: calls JsonController’s JsonNk() method for 221 bytes

Additionally, I created another ASP.NET Core Web API application based on the boilerplate ASP.NET Web API project and added EF Core. I did this because many ASP.NET Core applications use Entity Framework Core (EF Core), and do more computationally expensive work than only serializing JSON. To isolate the performance of the two instances, I used the in-memory provider for EF Core, and populated a DbSet with weather summaries at startup. I modified the WeatherForecastController to encrypt each WeatherForecast’s Summary property using .NET’s RSACryptoServiceProvider class, and then added another controller that queries forecasts from the DbSet, and serializes them to strings. For that method, I added an asynchronous delay (using Task.Delay) to simulate querying a relational database. To run the tests, I created a Crank configuration YAML file that defines three scenarios:

  • AsyncParallelJson100: returns 100 forecasts from EF Core serialized to string using Text.Json
  • AsyncParallelJson500: returns 500 forecasts from EF Core serialized to string using Text.Json
  • ParallelEncryptWeather100: encrypts summaries for 100 forecasts and returns the forecasts as IEnumerable<WeatherForecast>

This application uses the 5.0.0 version of the Microsoft.EntityFrameworkCore and Microsoft.EntityFrameworkCore.InMemory NuGet packages. The following is the source code for the two methods I used in the tests:

JsonSerializeController’s Get method:

[HttpGet]
public async Task<IEnumerable<string>> Get(int count = 100)
{
    List<WeatherForecast> forecasts;
    List<string> jsons = new List<string>();

    using (var context = new WeatherContext())
    {
        forecasts = context.WeatherForecasts.Take(count).ToList();
    }
    await Task.Delay(5);
    Parallel.ForEach(forecasts, x => jsons.Add(JsonSerializer.Serialize(x)));

    return jsons;
}

WeatherForecastController’s Get method:

[HttpGet]
public IEnumerable<WeatherForecast> Get(int count = 100)
{
    List<WeatherForecast> forecasts;

    using (var context = new WeatherContext())
    {
        forecasts = context.WeatherForecasts.Take(count).ToList();
    }
    UnicodeEncoding ByteConverter = new UnicodeEncoding();

    using (RSACryptoServiceProvider RSA = new RSACryptoServiceProvider())
    {
        Parallel.ForEach(forecasts, x => x.EncryptedSummary = RSAEncrypt(ByteConverter.GetBytes(x.Summary), RSA.ExportParameters(false), false));
    }
    return forecasts;
}

Note:    The RSAEncrypt method was copied from the sample code in the RSACryptoServiceProvider’s docs.

Setting up the instances

For running the benchmarks, I selected the Amazon Machine Image (AMI) for Ubuntu Server 20.04 LTS, and chose “64-bit (x86)” for the m5.xlarge and “64-bit (Arm)” for the m6g.xlarge. I gave them both 20GB of Amazon Elastic Block Store (EBS) storage, and chose a security group with port 22 open to my home IP address, so that I could SSH into them. While it’s possible to install and use .NET 5 on Amazon Linux 2 (AL2), that’s not currently a supported Linux distribution for .NET 5 on ARM, and I wanted the same distribution for both x86 and ARM64. For details on launching Graviton2 instances from the AWS Management Console, please refer to the .NET 5 on AWS blog post from November 10, 2020.

Ubuntu 20.04 is a supported release for installing .NET 5 using apt-get, but ARM architectures are not yet supported. So instead – and to use the same method on both instances – I manually installed the .NET 5 SDK using the following commands, specifying the architecture-appropriate download link for the binaries*. Instructions for manually installing are also available at the prior “installing .NET 5” link.

curl -SL -o dotnet.tar.gz <link to architecture-specific binary file*>
sudo mkdir -p /usr/share/dotnet
sudo tar -zxf dotnet.tar.gz -C /usr/share/dotnet
sudo ln -s /usr/share/dotnet/dotnet /usr/bin/dotnet
echo "export DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=true" >> ~/.bash_profile

Then, I used SCP to upload the source code for my benchmarking solution to the instances, and SSH’d onto both, using two tabs in the new Windows Terminal.

*At the time this blog was written, the binaries used were:
dotnet-sdk-5.0.100-linux-arm64.tar.gz
dotnet-sdk-5.0.100-linux-x64.tar.gz

Benchmark results

Benchmark runs and units

I used Crank to perform two runs of each of the six benchmarks on each of the two instances and took the average of the two runs for each. There was minimal variation between runs. For each test, I charted the latency in microseconds (μs), with the bars for MvcJsonOutput2M and ParallelEncryptWeather100 scaled by plotting μs/100, and bars for AsyncParallelJson100 and AsyncParallelJson500 scaled with μs/10. For latency, shorter bars are better.

I also charted the performance in requests/second, and the overall value as performance/dollar, where the performance is the requests/second, and dollars is the cost/hour of the given instance type. In order to have the bars legible on the same chart, some values were scaled as shown below the chart (the same scaling was applied to all values for a given benchmark). For both raw performance and performance/price, longer bars are better.

Note that I didn’t do any specific optimization for ARM64 or x86.

Summary of results

The Graviton2 instance had lower latency across the board for the tests I ran, with the m6g.xlarge (Graviton2) instance having up to 24.7% lower latency (for MvcJsonOutput2M) than the m5.xlarge (x86-64). It’s notable that in general, the more work the test method was doing, the bigger the advantage of Graviton2.

The results were broadly similar for requests/second, with Graviton2 delivering up to 31.6% better performance (for MvcJsonOutput2M). For the most computationally-expensive test – ParallelEncryptWeather100 – the Graviton2 instance churned out 16.6% more requests per second. And all of this is without considering the price difference. Also, not reflected in the charts is that the x86 instance had twice as many bad requests (average of 16) as the Graviton2 instance (average of 8) for the ParallelEncryptWeather100 test. ParallelEncryptWeather100 was the only test where there were any bad responses across all the tests.

When scaling the performance for the hourly price of each instance type, the differences are starker. The Graviton2 offers up to 64% more requests/second per hourly cost of the instance (for MvcJsonOutput2M). Even on the test with the least advantage (MvcJsonNet2k), the Graviton2 provided 30.8% better performance/cost, where performance is requests/second. These types of results can translate into significant savings for even modestly sized workloads.

Charts

chart showing mean latency for the benchmark

In the preceding chart, the mean latency is shown in micro-seconds (μs), with the values for some tests divided by either 10 or 100 in order to make all the bars visible in the chart. The Graviton2 instance had 24.7% lower latency for the MvcJsonOutput2M test, and had lower latency across all the tests.

chart showing raw performance for the benchmark

This second chart shows how the m6g.xlarge Graviton2 instance handled more requests for every test. The bars represent the raw requests/second for each test. For the MvcJsonOutput2M test, which serializes two megabytes to JSON, it handled 31.6% more requests per second, and was faster for every test I ran.

chart showing price/performance for benchmark test

This third chart uses the same performance values as the preceding one, but the m5.xlarge values are divided by its hourly cost ($0.192 in the Ohio Region), and the m6g.xlarge bars are divided by $0.154 (also for the Ohio Region). The Graviton2 instance handled 64% more requests per dollar for the MvcJsonOutput2M test, and provides much better performance per dollar across all the tests.

Conclusion

If you’re adopting .NET 5 for your applications, you have a variety of choices for deploying them in AWS. You can run them in containers in Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS) with or without AWS Fargate, you can deploy them as serverless functions in AWS Lambda, or deploy them onto EC2 using either x86-based or Graviton2-based instances.

For running scalable web applications built on ASP.NET Core 5.0, the new Graviton2 instance families offer significant performance advantages, and even more compelling performance/price advantages of up to 64% over the equivalent Intel x86 instance families without making any code changes. Coupled with the ARM64 performance improvements in .NET 5, moving from .NET Core 3.1 on x86 to .NET 5 on Graviton2 promises significant cost savings. It also allows developers to code and locally test on their x86-based development machines (or even new ARM-based macOS laptops), and to use their existing deployment mechanisms. If your application is still based on .NET Framework, consider using the AWS Porting Assistant for .NET to begin porting to .NET Core.

Learn more about AWS Graviton2 based instances.

 

Introducing retry strategies for AWS Batch

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/introducing-retry-strategies-for-aws-batch/

This post is contributed by Christian Kniep, Sr. Developer Advocate, HPC and AWS Batch.

Scientists, researchers, and engineers are using AWS Batch to run workloads reliably at scale, and to offload the undifferentiated heavy lifting in their day-to-day work. But even with a slight chance of failure in the stack, the act of mitigating these failures reminds customers that infrastructure, middleware and software are not error proof.

Many customers use Amazon EC2 Spot Instances to save up to 90% on their computing cost by leveraging unused EC2 capacity. If unused EC2 capacity is unavailable, an EC2 Spot Instance can be reclaimed by EC2. While AWS Batch takes care of rescheduling the job on a different instance, this rescheduling should not be handled differently depending on whether it is an application failure or some infrastructure event interrupting the job.

Starting today, customers can define how many retries are performed in cases where a task does not finish correctly. AWS Batch now allows customers define custom retry conditions, so that failures like an interruption of an instance or an infrastructure agent are handled differently, and do not just exhaust the number of retries attempted.

In this blog, I show the benefits of custom retry with AWS Batch by using different error codes from a job to control whether it should be retried. I will also demonstrate how to handle infrastructure events like a failing container image download, or an EC2 Spot interruption.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (AWS CLI) to set up the following:

  1. IAMroles, policies, and profiles to grant access and permissions
  2. A compute environment (CE) to provide the compute resources to run jobs
  3. A job queue, which supervises the job execution and schedules jobs on the CE
  4. Job definitions with different retry strategies,which use a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit jobs to show how you can handle different scenarios, such as infrastructure failure, application handling via error code or a middleware event.

Prerequisite

To make things easier, I first set up a couple of environment variables to have the information available for later use. I use the following code to set up the environment variables:

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=http://169.254.169.254/latest/meta-data
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')

IAM

When using the AWS Management Console, I must create IAM roles manually.

Trust policies

IAM roles are defined to be used by an individual service. In the simplest case, I want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. The definition of which entity is able to use an IAM role is called a Trust Policy. To set up a Trust Policy for an IAM role, I use the following code snippet:

cat > ec2-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Instance role

With the IAM trust policy, I can now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role. I can set up this role with the following logic:

cat > svc-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

At this point, I have created the IAM roles and policies so that the instances and services are able to interact with the AWS API operations, including trust policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a CE, which will launch instances to run the example jobs.

cat > compute-environment.json << EOF
{
  "computeEnvironmentName": "compute-0",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
    "minvCpus": 2,
    "maxvCpus": 32,
    "desiredvCpus": 4,
    "instanceTypes": [ "m5.xlarge","m5.2xlarge","m4.xlarge","m4.2xlarge","m5a.xlarge","m5a.2xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-instances"},
    "ec2KeyPair": "batch-ssh-key",
    "bidPercentage": 0
  },
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
}
EOF
aws batch create-compute-environment --cli-input-json file:// compute-environment.json 

Once this is complete, my compute environment begins to launch instances. This takes a few minutes. I can use the following command to check on the status of the compute environment whenever I want:

aws batch describe-compute-environments |jq '.computeEnvironments[] |select(.computeEnvironmentName=="compute-0")'

The command uses jq to filter the output to only show the compute environment I just created.

Job queue

Now that I have my compute environment up and running, I can create a job queue, which accepts job submissions and schedules the jobs to the compute environment.

cat > job-queue.json << EOF
{
  "jobQueueName": "queue-0",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "compute-0"
  }]
}
EOF
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. It is referenced in a job submission to specify the defaults of a job configuration, while some of the parameters can be overwritten when you submit.

Within the job definition, different retry strategies can be configured along with a maximum number of attempts for the job.
Three possible conditions can be used:

  • onExitCode will evaluate non-zero exit codes
  • onReason matched against middleware errors
  • onStatusReason can be used to react to infrastructure events such as an instance termination

Different conditions are assigned an action to either EXIT or RETRY the job. Important to note, that a job finishing with an exit code of zero will EXIT the job and not evaluate the retry condition. The default behavior for all non-zero exit code is the following:

{
  "onExitCode" : ""
  "onStatusReason" : ""
  "onReason" : "*"
  "action": retry
}

This condition retries every job that does not succeed (exit code 0) until the attempts are exhausted.

Spot Instance interruptions

AWS Batch works great with Spot Instances and customers are using this to reduce their compute cost. If Spot Instances become unavailable, instances are reclaimed by EC2, which can lead to one or more of my hosts being shut down. When this happens, the jobs running on those hosts are shut down due to an infrastructure event, not an application failure. Previously, separating these kinds of events from one another was only possible by catching the notification on the instance itself or through CloudWatch Events. Now with customer retry, you don’t have to rely on instance notifications or CloudWatch Events.

Using the job definition below, the job is restarted if the instance running the job gets shut down, which includes the termination due to a Spot Instance reclaim. The additional condition makes sure that the job exits whenever the exit code is not zero, otherwise the job would be rescheduled until the attempts are exhausted (see default behavior above).

cat > jdef-spot..json << EOF
{
    "jobDefinitionName": "spot",
    "type": "container",
    "containerProperties": {
        "image": "alpine:latest",
        "vcpus": 2,
        "memory": 256,
        "command":  ["sleep","600"],
        "readonlyRootFilesystem": false
    },
    "retryStrategy": { 
        "attempts": 5,
        "evaluateOnExit": 
        [{
            "onStatusReason" :"Host EC2*",
            "action": "RETRY"
        },{
  		  "onReason" : "*"
            "action": "EXIT"
        }]
    }
}
EOF
aws batch register-job-definition --cli-input-json file://jdef-spot.json

To simulate a Spot Instances reclaim, I submit a job, and manually shut down the host the job is running on. This triggers my condition to ask AWS Batch to make 5 attempts to finish the job before it marks the job a failure.

When I use the AWS CLI to describe my job, it displays the number of attempts to retry.

By shutting down my instance, the job returns to the status RUNNABLE and will be scheduled again until it succeeds or reaches the maximum attempts defined.

Exit code mitigation

I can also use the exit code to decide which mitigation I want to use based on the exit code of the job script or application itself.

To illustrate this, I can create a new job definition that uses a container image that exits on a random exit code between 0 and 3. Traditionally, an exit code of 0 means success, and won’t trigger this retry strategy. For all other (nonzero) exit codes the retry strategy is evaluated. In my example, 1 or 2 reflect situations where a retry is needed, but an exit code of 3 means that AWS Batch should let the job fail.

cat > jdef-randomEC.json << EOF
{
    "jobDefinitionName": "randomEC",
    "type": "container",
    "containerProperties": {
        "image": "qnib/random-ec:2020-10-13.3",
        "vcpus": 2,
        "memory": 256,
        "readonlyRootFilesystem": false
    },
    "retryStrategy": { 
        "attempts": 10,
        "evaluateOnExit": 
        [{
            "onExitCode": "1",
            "action": "RETRY"
        },{
            "onExitCode": "2",
            "action": "RETRY"
        },{
            "onExitCode": "3",
            "action": "EXIT"
        }]
    }
}
EOF
aws batch register-job-definition --cli-input-json file://jdef-randomEC.json

A submitted job retries until the exit code 0 is successful, 3 for a failure or the attempts are exhausted (in this case, 10 of them).

aws batch submit-job  --job-name randomEC-$(date +"%F_%H-%M-%S") --job-queue queue-0   --job-definition randomEC:1

The output of a job submission shows the job name and the job id.

In case the exit code is 1, and the job will be requeued.

Container image pull failure

The first example showed an error on the infrastructure layer and the second showed how to handle errors on the application layer. In this last example, I show how to handle errors that are introduced in the middleware layer, in this case: the container daemon.

It might happen if your Docker registry is down or having issues. To demonstrate this, I used an image name that is not present in the registry. In that case, the job should not get rescheduled to fail again immediately.

The following job definition again defines 10 attempts for a job, except when the container cannot be pulled. This leads to a direct failure of the job.

cat > jdef-noContainer.json << EOF
{
    "jobDefinitionName": "noContainer",
    "type": "container",
    "containerProperties": {
        "image": "no-container-image",
        "vcpus": 2,
        "memory": 256,
        "readonlyRootFilesystem": false
    },
    "retryStrategy": { 
        "attempts": 10,
        "evaluateOnExit": 
        [{
            "onReason": "CannotPullContainerError:*",
            "action": "EXIT"
        }]
    }
}
EOF
aws batch register-job-definition --cli-input-json file://jdef-noContainer.json

Note that the job defines an image name (“no-container-image”) which is not present in the registry. The job is set up to fail when trying to download the image, and will do so repeatedly, if AWS Batch keeps trying.

Even though the job definition has 10 attempts configured for this job, it fell straight through to FAILED as the retry strategy sets the action exit when a CannotPullContainerError occurs. Many of the error codes I can create conditions for are documented in the Amazon ECS user guide (e.g. task error codes / container pull error).

Conclusion

In this blog post, I showed three different scenarios that leverage the new custom retry features in AWS Batch to control when a job should exit or get rescheduled.

By defining retry strategies you can react to an infrastructure event (like an EC2 Spot interruption), an application signal (via the exit code), or an event within the middleware (like a container image not being available).

This new feature allows you to have fine grained control over how your jobs react to different error scenarios.

Fire Dynamics Simulation CFD workflow using AWS ParallelCluster, Elastic Fabric Adapter, Amazon FSx for Lustre and NICE DCV

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/

This post was written by By Kevin Tuil, AWS HPC consultant 

Modeling fires is key for many industries, from the design of new buildings, defining evacuation procedures for trains, planes and ships, and even the spread of wildfires. Modeling these fires is complex. It involves both the need to model the three-dimensional unsteady turbulent flow of the fire and the many potential chemical reactions. To achieve this, the fire modeling community has moved to higher-fidelity turbulence modeling approaches such as the Large Eddy Simulation, which requires both significant temporal and spatial resolution. It means that the computational cost for these simulations is typically in the order of days to weeks on a single workstation.
While there are a number of software packages, one of the most popular is the open-source code: Fire Dynamics Simulation (FDS) developed by National Institute of Standards and Technology (NIST).

In this blog, I focus on how AWS High Performance Computing (HPC) resources (e.g AWS ParallelCluster, Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and Amazon S3) allow FDS users to scale up beyond a single workstation to hundreds of cores to achieve simulation times of hours rather than days or weeks. In this blog, I outline the architecture needed, providing scripts and templates to compile FDS and run your simulation.

Service and solution overview

AWS ParallelCluster

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use, even if you are new to the cloud. AWS released AWS ParallelCluster 2.9.1 and its user guide – which is the version I use in this blog.

These three AWS HPC resources are optimal for Fire Dynamics Simulation. Together, they provide easy deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel file system.

Elastic Fabric Adapter

EFA is a critical service that provides low latency and high-bandwidth 100 Gbps network communication. EFA allows applications to scale at the level of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud. Computational Fluid Dynamics (CFD), among other tightly coupled applications, is an excellent candidate for the use of EFA.

Amazon FSx for Lustre

Amazon FSx for Lustre is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides sub-millisecond latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleash innovation at an unparalleled pace. This blog post uses the latest version of Amazon FSx for Lustre, which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket.

Solution and steps

The overall solution that I deploy in this blog is represented in the following diagram:

solution overview diagram

Step 1: Access to AWS Cloud9 terminal and upload data

There are two ways to start using AWS ParallelCluster. You can either install AWS CLI or turn on AWS Cloud9, which is a cloud-based integrated development environment (IDE) that includes a terminal. For simplicity, I use AWS Cloud9 to create the HPC cluster. Please refer to this link to proceed to AWS Cloud9 set up and to this link for AWS CLI setup.

Once logged into your AWS Cloud9 instance, the first thing you want to create is the S3 bucket. This bucket is key to exchange user data in and out from the corporate data center and the AWS HPC cluster. Please make sure that your bucket name is unique globally, meaning there is only one worldwide across all AWS Regions.

aws s3 mb s3://fds-smv-bucket-unique
make_bucket: fds-smv-bucket-unique

Download the latest FDS-SMV Linux version package from the official NIST website. It looks something like: FDS6.7.4_SMV6.7.14_lnx.sh

For the geometry, it should be renamed to “geometry.fds”, and must be uploaded to your AWS Cloud9 or directly to your S3 bucket.

Please note that once the FDS-SMV package has been downloaded locally to the instance, you must upload it to the S3 bucket using the following command.

aws s3 cp FDS6.7.4_SMV6.7.14_lnx.sh s3://fds-smv-bucket-unique
aws s3 cp geometry.fds s3://fds-smv-bucket-unique

You use the same S3 bucket to install FDS-SMV later on with the Amazon FSx for Lustre File System.

Step 2: Set up AWS ParallelCluster

You can install AWS ParallelCluster running the following command from your AWS Cloud9 instance:

sudo pip install aws-parallelcluster

Once it is installed, you can run the following command to check the version:

pcluster version 

At the time of writing this blog, 2.9.1 is the most up-to-date version.

Then use the text editor of your choice and open the configuration file as follows:

vim ~/.parallelcluster/config

Replace the bolded section, if not yet filled in, by your own information and save the configuration file.

[aws]
aws_region_name = <AWS-REGION>

[global]
sanity_check = true
cluster_template = fds-smv-cluster
update_check = true

[vpc public]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<SUBNET-ID>

[cluster fds-smv-cluster]
key_name = <Key-Name>
vpc_settings = public
compute_instance_type=c5n.18xlarge
master_instance_type=c5.xlarge
initial_queue_size = 0
max_queue_size = 100
scheduler=slurm
cluster_type = ondemand
s3_read_write_resource=arn:aws:s3:::fds-smv-bucket-unique*
placement_group = DYNAMIC
placement = compute
base_os = alinux2
tags = {"Name" : "fds-smv"}
disable_hyperthreading = true
fsx_settings = fsxshared
enable_efa = compute
dcv_settings = hpc-dcv

[dcv hpc-dcv]
enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
import_path = s3://fds-smv-bucket-unique
imported_file_chunk_size = 1024
export_path = s3://fds-smv-bucket-unique

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Let’s review the different sections of the configuration file and explain their role:

  • scheduler: Supported job schedulers are SGE, TORQUE, SLURM and AWS Batch. I have selected SLURM for this example.
  • cluster_type: You have the choice between On-Demand (ondemand) or Spot Instances (spot) for your compute instances. For On-Demand, instances are available for use without condition (if available in the Region selected) at a certain price per hour with the pay-as-you-go model, meaning that as soon as they are started, they are reserved for your utilization. For Spot Instances, you can take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. You can use Spot Instances for various stateless, fault-tolerant, or flexible applications such as HPC, for more information about Spot Instances, feel free to visit this webpage.
  • s3_read_write_resource: This parameter allows you to read and write objects directly on your S3 bucket from the cluster you created without additional permissions. It acts as a role for your cluster, allowing you access to your specified S3 bucket.  
  • placement_groupUse DYNAMIC to ensure that your instances are located as physically close to one another as possible. Close placement minimizes the latency between compute nodes and takes advantage of EFA’s low latency networking.
  • placement: By selecting compute you only enforce compute instances to be placed within the same placement group, leaving the head node placement free.
  • compute_instance_type:Select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC applications. Note that EFA is supported only for specific instance types. Please visit currently supported instances for more information.
  • master_instance_type:This can be any instance type. As traffic between head and compute nodes is relatively small, and the head node runs during the entire lifetime of the cluster, I use c5.xlarge because it is inexpensive and is a good fit for this use case.
  • initial_queue_size:You start with no compute instances after the HPC cluster is up. This means that any new job submitted has some delay (time for the nodes to be powered on) before they are seen as available by the job scheduler. This helps you pay for what you use and keeps costs as low as possible.
  • max_queue_size:Limit the maximum compute fleet to 100 instances. This allows you room to scale your jobs up to a large number of cores, while putting a limit on the number of compute nodes to help control costs.
  • base_osFor this blog, select Amazon Linux 2 (alinux2) as a base OS. Currently we also support Amazon Linux (alinux), CentOS 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.
  • disable_hyperthreading: This setting turns off hyperthreading (true) on your cluster, which is the right configuration in this use case.[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory is mounted, the storage capacity for the file system, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.
  • enable_efa: Mark as (true) in this use case since it is a tightly coupled CFD simulation use case.
  • dcv_settings:With AWS ParallelCluster, you can use NICE DCV to support your remote visualization needs.
  • [dcv hpc-dcv]:This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.
  • import_path: This parameter enables all the objects on the S3 bucket available when creating the cluster to be seen directly from the FSx for Lustre file system. In this case, you are able to access the FDS-SMV package and the geometry under the /fsx mounted folder.
  • export_path: This parameter is useful for backup purposes using the Data Repository Tasks. I share more details about this in step 7 (optional).

Step 3: Create the HPC cluster and log in

Now, you can create the HPC cluster, named fds-smv. It takes around 10 minutes to complete and you can see the status changing (going through the different AWS CloudFormation template steps). At the end of creation, two IP addresses are prompted, a public IP and/or a private IP depending on your network choice.

pcluster create fds-smv
Creating stack named: parallelcluster-fds-smv
Status: parallelcluster-fds-smv - CREATE_COMPLETE                               
MasterPublicIP: X.X.X.X
ClusterUser: ec2-user
MasterPrivateIP: X.X.X.X

In order to log in, you must use the key you specified in the AWS ParallelCluster configuration file before creating the cluster:

pcluster ssh fds-smv -i <Key-Name>

You should now be logged in as an ec2-user (since we are using Amazon Linux 2 base OS).

Step 4: Install FDS-SMV package

Now that the HPC cluster using AWS ParallelCluster is set up, it is time to install the FDS-SMV package.  In the prior steps, you uploaded both the FDS-SMV package and the geometry to your S3 bucket. Since you enabled “import_path” to that bucket, they are already available on the Amazon FSx for Lustre storage under /fsx.

Run the script as follows and select /fsx/fds-smv as final target for installation:

cd /fsx
./FDS6.7.4_SMV6.7.14_lnx.sh
[[email protected] fsx]$ ./FDS6.7.4_SMV6.7.14_lnx.sh 

Installing FDS and Smokeview  for Linux

Options:
  1) Press <Enter> to begin installation [default]
  2) Type "extract" to copy the installation files to:
     FDS6.7.4_SMV6.7.14_lnx.tar.gz
 

FDS install options:
  Press 1 to install in /home/ec2-user/FDS/FDS6 [default]
  Press 2 to install in /opt/FDS/FDS6
  Press 3 to install in /usr/local/bin/FDS/FDS6
  Enter a directory path to install elsewhere
/fsx/fds-smv

It is important to source the following scripts as part of the installed packages to check if the installation is successful with the correct versions. Here is the correct output you should get:

[[email protected] ~]$ source /fsx/fds-smv/bin/SMV6VARS.sh 
[[email protected] ~]$ source /fsx/fds-smv/bin/FDS6VARS.sh 
[[email protected] ~]$ fds -version
FDS revision       : FDS6.7.4-0-gbfaa110-release
MPI library version: Intel(R) MPI Library 2019 Update 4 for Linux* OS

[[email protected] ~]$ smokeview -version

Smokeview  SMV6.7.14-0-g568693b-release - Mar  9 2020

Revision         : SMV6.7.14-0-g568693b-release
Revision Date    : Wed Mar 4 23:13:42 2020 -0500
Compilation Date : Mar  9 2020 16:31:22
Compiler         : Intel C/C++ 19.0.4.243
Checksum(SHA1)   : e801eace7c6597dc187739e51ba6f546bfde4e48
Platform         : LINUX64

Important notes:

The way FDS-SMV package has been installed is the default installation. Binaries are already compiled and Intel MPI libraries are embedded as part of the installation package. It is what one would call a self-contained application. For further builds and source codes, please visit this webpage.

Step 5: Running the fire dynamics simulation using FDS

Now that everything is installed, it is time to create the SLURM submission script. In this step, you take advantage of the FSx for Lustre File System, the compute-optimized instance, and the EFA network to maximize simulation performance.

cd /fsx/
vi fds-smv.sbatch

Here is the information you should specify in your submission script:

#!/bin/bash
#SBATCH --job-name=fds-smv-job
#SBATCH --ntasks=<Total number of MPI processes>
#SBATCH --ntasks-per-node=36
#SBATCH --output=%x_%j.out

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

module load intelmpi 

export OMP_NUM_THREADS=1
export I_MPI_PIN_DOMAIN=omp

cd /fsx/<results>

time mpirun -ppn 36 -np <Total number of MPI processes>  fds geometry.fds

Replace the <results> with the one of your choice, and don’t forget to copy the geometry.fds file in it before submitting your job. Once ready, save the file and submit the job using the following command:

sbatch fds-smv.sbatch 

If you decided to build your HPC cluster with c5n.18xlarge instances, the number of MPI processes per node is 36 since you turned off the hyperthreading, and that the instance has 36 physical cores. That is the meaning of the “#SBATCH --ntasks-per-node=36” line.

For any run exceeding 36 MPI processes, the job is split among multiple instances and take advantage of EFA for internode communication.

It is important to note that FDS only allows the number of MPI processes to be equal to the number of meshes in the input geometry (geometry.fds in this scenario). In case the number of meshes in the input geometry cannot be modified, OpenMP threads can be enabled and efficiently increase performance. Do this using up to four OpenMP Threads across four CPU cores attached to one MPI process.

Please read best practices provided by NIST for that topic on their user guide.

In order to take advantage of the distributed computing capability of FDS, it is mandatory to work first on the input geometry, and divide it into the appropriate number of meshes. It is also highly advised to evenly distribute the number of cells/elements per mesh across all meshes. This best practice optimizes the load balancing for each CPU core.

Step 6: Visualizing the results using NICE DCV and SMV

In order to visualize results, you must connect to the head node using NICE DCV streaming protocol.

As a reminder, the current instance type for the head node is a c5.xlarge, which is not a graphics-accelerated instance. For heavy and GPU intensive visualization, it is important to set up a more appropriate instance such as the G4 instance group.

Go back to your AWS Cloud9 instance, open a new terminal side by side to your session connected to your AWS HPC cluster, and enter the following command in the terminal:

pcluster dcv connect fds-smv -k <Key-Name>

You are provided a one-time HTTPS URL available for a short period of time in order to connect to your head node using the NICE DCV protocol.

Once connected, open the terminal inside your session and source the FDS-SMV scripts as before:

source /fsx/fds-smv/bin/FDS6VARS.sh
source /fsx/fds-smv/bin/SMV6VARS.sh

Navigate to your <results> folder and start SMV with your result.

I have selected one of the geometries named fire_whirl_pool.fds in the Examples folder, part of the default FDS-SMV installation package located here:

/fsx/fds-smv/Examples/Fires/fire_whirl_pool.fds

You can find other scenarios under the Examples folder to run some more use cases if you did not already choose your geometry.fds file.

Now you can run SMV and visualize your results:

smokeview fire_whirl_pool.smv

SMV (smokeview) takes as an input .smv extension files, please replace with your appropriate file. If you have already chosen your geometry.fds, then run the following command:

smokeview geometry.smv

The application then open as follows, and you can visualize the results. The following image is an output of the SOOT DENSITY of the 3D smoke.

fire simulation picture

Step 7 (optional): Back up your FDS-SMV results to an S3 bucket

First update the AWS CLI to its most recent version. It is compatible with 1.16.309 and above.

After running your FDS-SMV simulation, you can back up your data in /fsx to the S3 bucket you used earlier to upload the installation package, and input files using Data Repository Tasks.

Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

Open your AWS Cloud9 terminal and exit the HPC head node cluster. Retrieve your Amazon FSx for Lustre ID using:

aws fsx describe-file-systems

It looks something like, fs-0533eebf1148fc8dd. Then create a backup of the data as follows:

aws fsx create-data-repository-task --file-system-id fs-0533eebf1148fc8dd --type EXPORT_TO_REPOSITORY --paths results --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://fds-smv-bucket-unique/

The following are definitions about the command parameters:

  • file-system-id: Your file system ID.
  • type EXPORT_TO_REPOSITORY: Exports the data back to the S3 bucket.
  • paths results: The directory you want to export to your S3 bucket. If you have more than one folder to back up, use a comma-separated notation such as: results1,results2,…
  • Format=REPORT_CSV_20191124: Note this is only the name the Amazon FSx Lustre supports. Please keep it the same.

You can check the backup status by running:

aws fsx describe-data-repository-tasks

Please wait for the copy to be achieved, once finished you should see on the Lifecycle line "Lifecycle": "SUCCEEDED"

Also go back to your S3 bucket, and your folder(s) should appear with all the files correctly uploaded from your /fsx folder you specified.

In terms of data management, Amazon S3 is an important service. You started by uploading installation package and geometry files from an external source, such as your laptop or an on-premises system. Then made these files available to the AWS HPC cluster under the Amazon FSx for Lustre file system and ran the simulation. Finally, you backed up the results from the Amazon FSx for Lustre to Amazon S3. You can also decide to download the results on Amazon S3 back to your local system if needed.

Step 8: Delete your AWS resources created during the deployment of this blog

After your run is completed and your data backed up successfully (Step 7 is optional) on your S3 bucket, you can then delete your cluster by using the following command in your Cloud9 terminal:

pcluster delete fds-smv

Warning:

If you run the command above all resources you created during this blog are automatically deleted beside your Cloud9 session and your data on your S3 bucket you created earlier.

Your S3 bucket still contains your input “geometry.fds” and your installation package “FDS6.7.4_SMV6.7.14_lnx.sh” files.

If you selected to back up your data during Step 7 (optional), then your S3 bucket also contains that data on top of the two previous files mentioned above.

If you want to delete your S3 bucket and all data mentioned above, go to your AWS Management Console, select S3 service then select your S3 bucket and hit delete on the top section.

If you want to terminate your Cloud9 session, go to your AWS Management Console, select Cloud9 service then select your session and hit delete on the top right section.

After performing these operations, there will be no more resources running on AWS related to this blog.

Conclusion

I showed that AWS ParallelCluster, Amazon FSx for Lustre, EFA, and Amazon S3 are key AWS services and features for HPC workloads such as CFD and in particular for FDS.

You can achieve simulation times of hours on AWS rather than days or weeks on a single workstation.

Please visit this workshop  for a more in-depth tutorial on running Fire Dynamics Simulation on AWS and our HPC dedicated homepage.

 

Custom logging with AWS Batch

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/custom-logging-with-aws-batch/

This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch. 

For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.

With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.

With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as splunk, fluentd, json-files, syslog, gelf, journald.

This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.

In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:

  1. IAM roles, policies, and profiles to grant access and permissions
  2. A compute environment to provide the compute resources to run jobs
  3. A job queue, which supervises the job execution and schedules jobs on a compute environment
  4. A job definition, which uses a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.

Prerequisite

To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=http://169.254.169.254/latest/meta-data
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')

IAM

When using the AWS Management Console, you must create IAM roles manually.

Trust Policies

IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.

cat > ec2-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Instance role

With the IAM trust policy, I now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service Role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role.  You can set up this role with the following logic.

cat > svc-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.

While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.

cat > policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "logs:CreateLogGroup"
    ],
    "Resource": "*",
    "Condition": {
      "StringEqualsIfExists": {
        "batch:LogDriver": ["awslogs"],
        "batch:AWSLogsGroup": ["/aws/batch/custom/*"]
      }
    }
  }]
}
EOF
aws iam create-policy --policy-name batch-awslog-policy \
    --policy-document file://policy.json
aws iam attach-role-policy --policy-arn arn:aws:iam::${AWS_ACCT_ID}:policy/batch-awslog-policy --role-name ecsInstanceRole

At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.

cat > compute-environment.json << EOF
{
  "computeEnvironmentName": "od-ce",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "EC2",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 1,
    "maxvCpus": 8,
    "desiredvCpus": 1,
    "instanceTypes": ["m5.xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-compute"},
    "bidPercentage": 0
  },
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
}
EOF
aws batch create-compute-environment --cli-input-json file://compute-environment.json  

Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.

aws batch  describe-compute-environments

Once it is enabled and valid we can continue by setting up the job queue.

Job Queue

Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.

cat > job-queue.json << EOF
{
  "jobQueueName": "jq",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "od-ce"
  }]
}
EOF
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver awslogs now allows you to change the log group configuration within the job definition.

cat > job-definition.json << EOF
{
  "jobDefinitionName": "alpine-env",
  "type": "container",
  "containerProperties": {
  "image": "alpine",
  "vcpus": 1,
  "memory": 128,
  "command": ["env"],
  "readonlyRootFilesystem": true,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { 
      "awslogs-region": "${AWS_REGION}", 
      "awslogs-group": "/aws/batch/custom/env-queue",
      "awslogs-create-group": "true"}
    }
  }
}
EOF
aws batch register-job-definition --cli-input-json file://job-definition.json

Job Submission

Using the above job definition, you can now submit a job.

aws batch submit-job \
  --job-name test-$(date +"%F_%H-%M-%S") \
  --job-queue arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-queue/jq \
  --job-definition arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-definition/alpine-env:1

Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.

log groups in cloudwatch

Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.

cloudwatch log events

Splunk

Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.

To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the splunk-url and splunk-token to match your Splunk setup.

{
  "jobDefinitionName": "alpine-splunk",
  "type": "container",
  "containerProperties": {
    "image": "alpine",
    "vcpus": 1,
    "memory": 128,
    "command": ["env"],
    "readonlyRootFilesystem": false,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "https://<splunk-url>",
        "splunk-token": "XXX-YYY-ZZZ"
      }
    }
  }
}

This forwards the logs to Splunk, as you can see in the following image.

forward to splunk

Conclusion

This blog post showed you how to apply custom logging to AWS Batch using the awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about fluentd, syslog, json-file and other drivers to find the best driver to match your current logging infrastructure.

 

Deploying your first 5G enabled application with AWS Wavelength

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/deploying-your-first-5g-enabled-application-with-aws-wavelength/

This post was written by Mike Coleman, Senior Developer Advocate, Twitter handle: @mikegcoleman

Today, AWS released AWS Wavelength. Wavelength allows you to deploy applications and services at the edge of a mobile carrier’s 5G network. By combining the benefits of 5G, such as high bandwidth and low latency, with the ability to use AWS tools and services you’re already familiar with, you’re able to build next generation edge applications quickly and easily.

Rather than go into more depth about Wavelength in this blog, I’d recommend reading Jeff Barr’s blog post. His post goes into detail about why we built Wavelength, and how you can get started with deploying AWS resources in a Wavelength Zone.

In this blog, I walk you through deploying one of the most common Wavelength use cases: machine learning inference.

Why inference at the edge?

One of the tradeoffs with machine learning applications is system responsiveness. If your application must be highly responsive, you may need to deploy your inference processing application close to the end user. In the case of mobile devices, this could mean that the inference processing takes place on the device itself. This type of additional processing demand on the device often results in reduced device battery life among other tradeoffs. Additionally, if you need to update your machine learning model, you must push out an update to all the devices running your application.

As I mentioned earlier, one of the key benefits of 5G and Wavelength is significantly lower latencies compared to previous generation mobile networks. For edge applications, this implies you can actually perform inference processing in a Wavelength zone with near real-time responsiveness to the mobile device. By moving the inference processing to the Wavelength zone, you reduce power consumption and battery drain on the mobile device. Additionally, you can simplify application updates.  If you need to make a change to your training model, you simply update your servers in the Wavelength Zone instead of having to ship a new version to all the devices running your code.

Solution Overview

architecture of the wavelength zone

The following tutorial guides you through deploying an object detection application that is comprised of three components:

  • A Wavelength-hosted API endpoint (using Flask)
  • A Wavelength-hosted Inference server (running Torchserve)
  • A React web app being accessed via the browser via mobile device running on the carrier’s 5G network..
  • A server that acts as a bastion host allowing you to SSH into your other instances and as a web server for the React web application.

The API server is built using Python and Flask, and runs on a t3.medium instance based upon a standard Unbuntu 18.04 image. It accepts an image from the client application running on a device connected to the carriers 5G mobile network, which it then forwards to the inference server. The inference server returns the detected object along with coordinates for that object (or an error if it can’t detect any objects). The API server adds a text label and bounding boxes to the image and returns it to the mobile client.

The inference server runs Torchserve, an open source project that provides a flexible and easy way to serve up PyTorch models. Object detection is done using a Faster R-CNN model. It is then deploy it on a g4dn.2xlarge instance running the AWS deep learning Amazon Machine Image (AMI).

You will use the web browser on your mobile device to access the web server which will host the client application which is written in React.

Wavelength is designed to provide access to services and applications that require low latency. It’s important to note that you don’t need to deploy your entire application in a Wavelength Zone. You only need to deploy parts of your application that benefit from being deployed in the Wavelength Zone – such as application components requiring low latency.

In the case of the demo application, the API and inference servers are located in the Wavelength Zone because one of the design goals of the application is low-latency processing of the inference requests.

On the other hand, because the web server is only serving a small single page React web app, it does not have the same latency requirements as the inference processing. For that reason, it’s hosted in the Region instead of the Wavelength Zone.

Prerequisites

To complete the walkthrough below, you need:

  • To be familiar working from the command line, including editing text files.
  • The AWS CLI installed on your local machine. Ensure it’s the latest version so it supports Wavelength.
  • An administrative account with sufficient permissions to create VPC resources (instances, subnets, etc).
  • In order to access resources in a Wavelength Zone, you need a mobile device on a carrier’s 5G mobile network in a city that has access to the Zone. The following tutorial is written to be deployed in the Boston Wavelength Zone, but you can adjust the environment variables for the Zone and Region to deploy it other area.
  • An SSH key pair in the us-east-1 Region.
  • The commands below work on Mac and Linux machines. If you are on a Windows machine the easiest way to run through the tutorial is to spin up a Linux-based EC2 instance, install and configure the AWS CLI, and run the commands from the EC2 instance’s command line.

Create the VPC and associated resources

The first step in this tutorial is deploying to the VPC, internet gateway, and carrier gateway.

Start by configuring some environment variables, and then deploying the resources.

  1. In order to get started, you need to first set some environment variables.
    Note: replace the value for KEY_NAME with the name of the key pair you wish to use.
    Note: these values are specific to the us-east-1 Region. If you wish to deploy into another region, you’ll need to modify them as appropriate. Check the documentation for more info.

    export REGION="us-east-1"
    export WL_ZONE="us-east-1-wl1-bos-wlz-1"
    export NBG="us-east-1-wl1-bos-wlz-1"
    export INFERENCE_IMAGE_ID="ami-029510cec6d69f121"
    export API_IMAGE_ID="ami-0ac80df6eff0e70b5"
    export BASTION_IMAGE_ID="ami-027b7646dafdbe9fa"
    export KEY_NAME=<your key name>
  1. Use the AWS CLI to create the VPC.
    export VPC_ID=$(aws ec2 --region $REGION \
    --output text \
    create-vpc \
    --cidr-block 10.0.0.0/16 \
    --query 'Vpc.VpcId') \
    && echo '\nVPC_ID='$VPC_ID
  1. Create an internet gateway and attach it to the VPC.
    export IGW_ID=$(aws ec2 --region $REGION \
    --output text \
    create-internet-gateway \
    --query 'InternetGateway.InternetGatewayId') \
    && echo '\nIGW_ID='$IGW_ID
    aws ec2 --region $REGION \
    attach-internet-gateway \
    --vpc-id $VPC_ID \
    --internet-gateway-id $IGW_ID
  1. Add the carrier gateway.
    export CAGW_ID=$(aws ec2 --region $REGION \
    --output text \
    create-carrier-gateway \
    --vpc-id $VPC_ID \
    --query 'CarrierGateway.CarrierGatewayId') \
    && echo '\nCAGW_ID='$CAGW_ID

Deploy the security groups

In this section, you add three security groups:

  • Bastion SG allows SSH traffic from your local machine to the bastion host as well as HTTP web traffic from the Internet
  • API SG allows SSH traffic from the Bastion SG and opens up port 5000 to accept incoming API requests
  • Inference SG allows SSH traffic from the Bastion host and communications on port 8080 and 8081 (the ports used by the inference server) from the API SG.

 

  1. Create the bastion security group and add the ingress SSH role.Note: SSH access is only being allowed from your current IP address. You can adjust if you need by changing the –-cidr parameter in the second command.
    export BASTION_SG_ID=$(aws ec2 --region $REGION \
    --output text \
    create-security-group \
    --group-name bastion-sg \
    --description "Security group for bastion host" \
    --vpc-id $VPC_ID \
    --query 'GroupId') \
    && echo '\nBASTION_SG_ID='$BASTION_SG_ID
    
    aws ec2 --region $REGION \
    authorize-security-group-ingress \
    --group-id $BASTION_SG_ID \
    --protocol tcp \
    --port 22 \
    --cidr $(curl https://checkip.amazonaws.com)/32
    
    aws ec2 --region $REGION \
    authorize-security-group-ingress \
    --group-id $BASTION_SG_ID \
    --protocol tcp \
    --port 80 \
    --cidr 0.0.0.0/0
    
  2. Create the API security group along with two ingress rules: one for SSH from the bastion security group and one opening up the port the API server communicates on (5000).
    export API_SG_ID=$(aws ec2 --region $REGION \
    --output text \
    create-security-group \
    --group-name api-sg \
    --description "Security group for API host" \
    --vpc-id $VPC_ID \
    --query 'GroupId') \
    && echo '\nAPI_SG_ID='$API_SG_ID
    
    aws ec2 --region $REGION \
    authorize-security-group-ingress \
    --group-id $API_SG_ID \
    --protocol tcp \
    --port 22 \
    --source-group $BASTION_SG_ID
    
    aws ec2 --region $REGION \
    authorize-security-group-ingress \
    --group-id $API_SG_ID \
    --protocol tcp \
    --port 5000 \
    --cidr 0.0.0.0/0
  3. Create the security group for the inference server along with three ingress rules: one for SSH from the bastion security group, and opening the ports the inference server communicates on (8080 and 8081) to the API security group.
    export INFERENCE_SG_ID=$(aws ec2 --region $REGION \
    --output text \
    create-security-group \
    --group-name inference-sg \
    --description "Security group for inference host" \
    --vpc-id $VPC_ID \
    --query 'GroupId') \
    && echo '\nINFERENCE_SG_ID='$INFERENCE_SG_ID
    
    aws ec2 --region $REGION \
    authorize-security-group-ingress \
    --group-id $INFERENCE_SG_ID \
    --protocol tcp \
    --port 22 \
    --source-group $BASTION_SG_ID
    
    aws ec2 --region $REGION \
    authorize-security-group-ingress \
    --group-id $INFERENCE_SG_ID \
    --protocol tcp \
    --port 8080 \
    --source-group $API_SG_ID
    
    aws ec2 --region $REGION \
    authorize-security-group-ingress \
    --group-id $INFERENCE_SG_ID \
    --protocol tcp \
    --port 8081 \
    --source-group $API_SG_ID
    

Add the subnets and routing tables

In the following steps you’ll create two subnets along with their associated routing tables and routes.

  1. Create the subnet for the Wavelength Zone
    export WL_SUBNET_ID=$(aws ec2 --region $REGION \
    --output text \
    create-subnet \
    --cidr-block 10.0.0.0/24 \
    --availability-zone $WL_ZONE \
    --vpc-id $VPC_ID \
    --query 'Subnet.SubnetId') \
    && echo '\nWL_SUBNET_ID='$WL_SUBNET_ID
    
  2. Create the route table for the Wavelength subnet
    export WL_RT_ID=$(aws ec2 --region $REGION \
    --output text \
    create-route-table \
    --vpc-id $VPC_ID \
    --query 'RouteTable.RouteTableId') \
    && echo '\nWL_RT_ID='$WL_RT_ID
    
  3. Associate the route table with the Wavelength subnet and a route to route traffic to the carrier gateway which in turns routes traffic to the carrier mobile network.
    aws ec2 --region $REGION \
    associate-route-table \
    --route-table-id $WL_RT_ID \
    --subnet-id $WL_SUBNET_ID
    
    aws ec2 --region $REGION create-route \
    --route-table-id $WL_RT_ID \
    --destination-cidr-block 0.0.0.0/0 \
    --carrier-gateway-id $CAGW_ID
    

Next, repeat the same process to create the subnet and routing for the bastion subnet.

  1. Create the bastion subnet
    BASTION_SUBNET_ID=$(aws ec2 --region $REGION \
    --output text \
    create-subnet \
    --cidr-block 10.0.1.0/24 \
    --vpc-id $VPC_ID \
    --query 'Subnet.SubnetId') \
    && echo '\nBASTION_SUBNET_ID='$BASTION_SUBNET_ID
    
  2. Deploy the bastion subnet route table and a route to direct traffic to the internet gateway
    export BASTION_RT_ID=$(aws ec2 --region $REGION \
    --output text \
    create-route-table \
    --vpc-id $VPC_ID \
    --query 'RouteTable.RouteTableId') \
    && echo '\nBASTION_RT_ID='$BASTION_RT_ID
    
    aws ec2 --region $REGION \
    create-route \
    --route-table-id $BASTION_RT_ID \
    --destination-cidr-block 0.0.0.0/0 \
    --gateway-id $IGW_ID
    
    aws ec2 --region $REGION \
    associate-route-table \
    --subnet-id $BASTION_SUBNET_ID \
    --route-table-id $BASTION_RT_ID
    
  3. Modify the bastion’s subnet to assign public IPs by default
    aws ec2 --region $REGION \
    modify-subnet-attribute \
    --subnet-id $BASTION_SUBNET_ID \
    --map-public-ip-on-launch

Create the Elastic IPs and networking interfaces

The final step before deploying the actual instances is to create two carrier IPs, IP addresses associated with the carrier network. These IP addresses will be assigned to two Elastic Network Interfaces (ENIs), and the ENIs will be assigned to our API and Inference server (the bastion host will have it’s public IP assigned upon creation by the bastion subnet).

  1. Create two carrier IPs, one for the API server and one for the inference server
    export INFERENCE_CIP_ALLOC_ID=$(aws ec2 --region $REGION \
    --output text \
    allocate-address \
    --domain vpc \
    --network-border-group $NBG \
    --query 'AllocationId') \
    && echo '\nINFERENCE_CIP_ALLOC_ID='$INFERENCE_CIP_ALLOC_ID
    
    export API_CIP_ALLOC_ID=$(aws ec2 --region $REGION \
    --output text \
    allocate-address \
    --domain vpc \
    --network-border-group $NBG \
    --query 'AllocationId') \
    && echo '\nAPI_CIP_ALLOC_ID='$API_CIP_ALLOC_ID
    
  2. Create two elastic network interfaces (ENIs)
    export INFERENCE_ENI_ID=$(aws ec2 --region $REGION \
    --output text \
    create-network-interface \
    --subnet-id $WL_SUBNET_ID \
    --groups $INFERENCE_SG_ID \
    --query 'NetworkInterface.NetworkInterfaceId') \
    && echo '\nINFERENCE_ENI_ID='$INFERENCE_ENI_ID
    
    export API_ENI_ID=$(aws ec2 --region $REGION \
    --output text \
    create-network-interface \
    --subnet-id $WL_SUBNET_ID \
    --groups $API_SG_ID \
    --query 'NetworkInterface.NetworkInterfaceId') \
    && echo '\nAPI_ENI_ID='$API_ENI_ID
    
  3. Associate the carrier IPs with the ENIs
    aws ec2 --region $REGION associate-address \
    --allocation-id $INFERENCE_CIP_ALLOC_ID \
    --network-interface-id $INFERENCE_ENI_ID   
    
    aws ec2 --region $REGION associate-address \
    --allocation-id $API_CIP_ALLOC_ID \
    --network-interface-id $API_ENI_ID

Deploy the API and inference instances

With the VPC and underlying networking and security deployed, you can now move on to deploying your API and Inference instances. The API server is a t3.instance based on a standard Ubuntu 18.04 AMI. The Inference server is a g4dn.2xlarge running the AWS deep learning AMI. You install and configure the software components in subsequent steps.

 

  1. Deploy the API instance
    aws ec2 --region $REGION \
    run-instances \
    --instance-type r5d.2xlarge \
    --network-interface '[{"DeviceIndex":0,"NetworkInterfaceId":"'$API_ENI_ID'"}]' \
    --image-id $API_IMAGE_ID \
    --key-name $KEY_NAME
    
  2. Deploy the inference instance
    aws ec2 --region $REGION \
    run-instances \
    --instance-type t3.medium \
    --network-interface '[{"DeviceIndex":0,"NetworkInterfaceId":"'$API_ENI_ID'"}]' \
    --image-id $API_IMAGE_ID \
    --key-name $KEY_NAME

Deploy the bastion / web server

You must deploy a bastion server in order to SSH into your application instances. Remember that the carrier gateway in a Wavelength Zone only allows ingress from the carrier’s 5G network. This means that in order to SSH into the API and inference servers you need to first SSH into the bastion host, and then from there SSH into your Wavelength instances.

You are also going to install the client front end application onto the bastion host. You can use the webserver to test the application if you don’t want to install the React Native version of the application onto a mobile device. Remember that even though you’re not using the native application, the website must still be accessed from a device on the carrier’s 5G network.

  1. Issue the command below to create your bastion host
    aws ec2 --region $REGION run-instances \
    --instance-type t3.medium \
    --associate-public-ip-address \
    --subnet-id $BASTION_SUBNET_ID \
    --image-id $BASTION_IMAGE_ID \
    --security-group-ids $BASTION_SG_ID \
    --key-name $KEY_NAME
    

Note: It takes a few minutes for your instances to be ready. Even when the status check in the EC2 console reads 2/2 checks passed, It may still be a few minutes before the instance is done installing additional software packages and configuring itself. If you receive a lock error while running apt-get, wait several minutes and try again.

 

Configure the bastion host / web server

The last server you deployed serves two purposes. It acts as the bastion host allowing you to SSH into your other two servers, and it serves the client web app. In this section you’ll install that web app.

  1. SSH into bastion host (the user name is bitnami).Note: In order to be able to easily SSH from the bastion host to the inference server you should use the -A (agent forwarding) parameter when starting your SSH session e.g.:
    ssh -i /path/to/key.pem -A [email protected]<bastion ip address>
  1. Clone the GitHub repo with the React code
    git clone https://github.com/mikegcoleman/react-wavelength-inference-demo.git
  2. Install the dependencies
    cd react-wavelength-inference-demo && npm install
  1. Build the webpage
    npm run build
  1. Copy the page into web servers root directory
    cp -r ./build/* /home/bitnami/htdocs
  2. Test that the web app is running correctly by navigating to the public IP address of your bastion instance

 

Configure the inference server

In this section you deploy a Torchserve server running on EC2. Torchserve is configured with the fasterrcnn model. It receives the image from the API server, runs the inference, and returns the labels and bounding boxes for the items found in the image.

I’m not going to spend time going into the inner workings of Torchserve in this post. However, if you’re interested in learning more, check out my colleague Shashank’s blog.

  1. SSH into bastion host and the nSSH into the inference server instance.Note: In order to be able to easily SSH from the bastion host to the inference server you will want to use the -A (agent forwarding) parameter when starting your SSH session with the bastion host e.g.:
    ssh -i /path/to/key.pem -A [email protected]<bastion public ip>

    To SSH from the bastion host to the inference server you do not need the -i or -A parameters e.g.:

    ssh [email protected]<inference server private ip>
  1. Update the packages on the server and install the necessary prerequisite packages.
    sudo apt-get update -y \
    && sudo apt-get install -y virtualenv openjdk-11-jdk gcc python3-dev
  2. Create a virtual environment.
    mkdir inference && cd inference
    virtualenv --python=python3 inference
    source inference/bin/activate
  3. Install Torchserve and its related components
    pip3 install \
    torch torchtext torchvision sentencepiece psutil \
    future wheel requests torchserve torch-model-archiver
  1. Install the inference model that the application will use.
    mkdir torchserve-examples && cd torchserve-examples
    
    git clone https://github.com/pytorch/serve.git
    
    mkdir model_store
    
    wget https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
    
    torch-model-archiver --model-name fasterrcnn --version 1.0 \
    --model-file serve/examples/object_detector/fast-rcnn/model.py \
    --serialized-file fasterrcnn_resnet50_fpn_coco-258fb6c6.pth \
    --handler object_detector \
    --extra-files serve/examples/object_detector/index_to_name.json
    
    mv fasterrcnn.mar model_store/
  2. Create a configuration file for Torchserve (config.properties) and configure Torchserve to listen on your instance’s private IP.Be sure to substitute the private IP of your instance below, you can find the private IP for your instance in the EC2 console.The contents of config.properties should look as follows:
    inference_address=http://<your instance private IP>:8080
    management_address=http://<your instance private IP>:8081

    For example:

    inference_address=http://10.0.0.253:8080
    management_address=http://10.0.0.253:8081
  3. Start the Torchserve server.
    torchserve --start --model-store model_store --models
    fasterrcnn=fasterrcnn.mar --ts-config config.properties

    It takes a few seconds for the server to startup, when it’s ready you should see a line that ends with:

    State change WORKER_STARTED -> WORKER_MODEL_LOADED

Leave this SSH session running so you can watch the inference server’s logs to see when it receives requests from the API server.

 

Configure the API server

In this section, you deploy the Flask-based API server.

  1. SSH into bastion host and then SSH into the API server instance.Note: In order to be able to easily SSH from the bastion host to the API server you should use the -A (agent forwarding) parameter when starting your SSH session with the bastion host e.g.:
    ssh -i /path/to/key.pem -A [email protected]<bastion public ip>

    To SSH from the bastion host to the API server you do not need the -i or -A parameters e.g.:

    ssh [email protected]<api server private ip>
  1. Test your inference server (being sure to substitute the INTERNAL IP of the inference instance in the second line below):
    curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg
    
    curl -X POST \
    http://<your_inf_server_internal_IP>:8080/predictions/fasterrcnn \
    -T kitten.jpg

    You should see something similar to

    [
           {
        "cat": "[(228.7825, 82.63463), (583.77545, 677.3058)]"
         },
      {
        "car": "[(124.427414, 69.34327), (270.15457, 205.53458)]"
         }
    ]
    

    The inference server returns the labels of the objects it detected, and the corner coordinates of boxes that surround those objects.

    Now that you have verified the API server can connect to the inference server, you can configure the API server.

  1. Run the following command to update system package information and install necessary prerequisites.
    sudo apt-get update -y \
    && sudo apt-get install -y \
    libsm6 libxrender1 libfontconfig1 virtualenv
  1. Clone the Python code into the application directory
    mkdir apiserver && cd apiserver
    git clone https://github.com/mikegcoleman/flask_wavelength_api .
  2. Create and activate a virtual environment.
    virtualenv --python=python3 apiserver
    source apiserver/bin/activate
  3. Install necessary Python packages.
    pip3 install opencv-python flask pillow requests flask-cors
  4. Create a configuration file (config_values.txt) with the following line (substituting the INTERNAL IP of your inference server).
    http://<your_inf_server_internal_IP>:8080/predictions/fasterrcnn
  5. Start the application.
    python api.py

    You should see output similar to the following:

    * Serving Flask app "api" (lazy loading)
    * Environment: production
    WARNING: This is a development server. Do not use it in a production
    deployment.
    Use a production WSGI server instead
    * Debug mode: on
    * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
    * Restarting with stat
    * Debugger is active!
    * Debugger PIN: 311-750-351

 

Test the client application

To test the application, you need to have a device on the carrier’s 5G network. From your device’s web browser navigate the bastion / web server’s public IP address. In the text box at the top of the app enter the public IP of your API server.

Next, choose an existing photo from your camera roll, or take a photo with the camera and press the process object button underneath the preview photo (you may need to scroll down).

The client will send the image to the API server, which forwards it to the inference server for detection. The API server then receives back the prediction from the inference server, adds a label and bounding boxes, and return the marked-up image to the client where it will be displayed.

example screenshot of the image

If the inference server cannot detect any objects in the image, you will receive a message indicating the prediction failed.

Conclusion and next steps

In this blog I covered some of the architectural considerations when deploying applications into Wavelength Zones. You then deployed a sample application designed to give you an idea of how you might architect an inference-at-the-edge solution. I hope this has inspired you to go off build something new to take advantage of the exciting capabilities that Wavelength and 5G enable. Visit https://aws.amazon.com/wavelength/  to request access and check out documentation and other resources.

 

 

Building a Graylog server to run on an Amazon Lightsail instance

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/building-a-graylog-server-to-run-on-an-amazon-lightsail-instance/

This post is part of a collection by the Amazon Lightsail team to highlight how builders are using Lightsail to get started on AWS building interesting solutions. If you’re interested in contributing a post on how you’re using Lightsail reach out to us at [email protected]! This post is guest contributed by Amazon Lightsail customer, Richard Gate

This post reviews how to build a Graylog server on Amazon Lightsail, the easiest way to get started on AWS. Graylog is an open source log management system that allows textual logging data created by network devices, applications, and servers to be centrally stored, searched, and reported on.

This blog is relevant to those working from home with various pieces of network equipment and a need to centralize log data for these devices. My personal networking equipment includes a pfSense gateway managing a couple of broadband lines, routers, and Wi-Fi access points. With Graylog, you can centralize the log data collection for these devices and automate looking for issues raised by them in their log messages.

In this post, I walk you through how I built a Graylog server on a Lightsail instance running Ubuntu 18.04 LTS with the pre-requite packages, mainly Elasticsearch, and MongoDB. This server receives log messages from my pfSense server, routers and access points. Also, taking into account that the devices being used are inside a private network NATing out to the internet but that must be uniquely identified in Graylog.

Network design

The following diagram shows where the various parts of the network fit and provides details of the TCP and UDP ports involved at different points in the network. You can see, the internal Wi-Fi AP and router behind the pfSense server with its own firewall, outbound NAT (Network Address Translation) and outbound load balancing (over two broadband lines, not shown). Traffic flowing over the internet to the Lightsail edge firewall and on into the Lightsail instance running Graylog and the Elasticsearch and MongoDB services.

The following image is a simple diagram of the network.

architecture diagram

Network access to the Ubuntu instance is restricted by the Lightsail firewall which allows TCP/UDP ports (and PING) to be allowed or blocked. Ports TCP:22 (SSH) and UDP (syslog from pfSense), UDP:51401 (syslog from the Wi-Fi AP) and UDP:51402 (Syslog from the router). These separate UDP ports are used so that Graylog can have a listener on each of the separate ports and can tag a source on them for the individual devices. This is needed as the Source IP is one of two IPs of the two broadband lines that pfSense routes traffic through (outbound load balancing). The pfSense and other devices are configured to use the Public IP of the Ubuntu Lightsail instance as their remote Syslog server with the relevant destination UDP Port. Recent changes to the Lightsail firewall now allow for the source IP address of inbound traffic to be used to limit where the Syslog data comes from. This is useful to prevent when whole internet trying to send Syslog data to the Graylog server.

Lightsail instance setup

Now that you have an idea of the network architecture, I can walk through how to set up Graylog on Amazon Lightsail.

The following section details the setup and configuration of the Lightsail instance to be used to run Graylog under the Ubuntu operating system (OS). This gets the instance ready to connect to and to start the process of installing Graylog.

The Lightsail Ubuntu 18.04 LTS instance is a 4-GB RAM instance, based on the minimum server specification given in the Graylog installation guide.

  1. From the Lightsail console, click Create instance.
  2. From Select a platform, choose Linux/Unix.
  3. From Select a blueprint, choose OS Only and then Ubuntu 18.04 LTS.

instance platform and blueprint

  1. From Choose your instance plan, choose the $20 bundle, with 4 GB, 2 vCPUs and 80 GB SSD.
  2. In Identify your instance, enter a unique name for your instance.

instance pricing plans

  1. Then click Create instance.

You are then taken back to the main Lightsail home page with your new instance showing grayed out and in a state of “Pending” until it has been created. Once it is running, the state changes to “Running.”

pending instance

instance running

  1. Click on the three dots at the top right of the new instance’s panel and select Manage.
  2. Then select Networking.
  3. Click Attach static IP in the “IP addresses” box.

create a static ip address to your instance

  1. If you already have a static IP available, select it from the dropdown list and click the green tick icon to the right of the “Select static IP” dropdown list.
  2. If not, click Create static IP, select your new instance, give the IP a unique name, and click Create.
  3. Under the firewall remove (click) the TCP:80 rule.
    As a best practice you should restrict any incoming traffic to your Graylog server to the IP addresses to the specific IP address (or addresses) that will need to access your instance.  
  4. Click the SSH (TCP:22) rule and click the edit icon, then check the Restrict to IP address box,  enter the IP address of the system you will use to SSH into the instance in the Source IP address box, and click Save.
  5. Click on Add rule, set Application as Custom, Protocol as TCP and Range as 9000 (this is later used for web access to Graylog), specify the IP you will use to access the system as you did in the previous step, and click Create.
  6. Click on Add rule, use Application as Custom, Protocol as UDP and Range as 51400-51402 (one port of each of the devices sending syslog data), specify the IP you will use to access the system as you did in the previous step, and click Create.

add firewall rule

The static IP address used preceding should  be assigned to a DNS name (“A” record) on your domain’s DNS server. The exact mechanism for doing depends on where and how your DNS is hosted. This forms the Fully Qualified Domain Name (FQDN) used to connect to the Lightsail instance. But, you can also use the public IP address  toconnect via SSH, the Graylog web interface and for device to send logging data.

Access the Lightsail instance to configure and install the software.

Having set up the Lightsail instance, the next step is to connect to the Ubuntu operating system to be able to run commands to configure Ubuntu and install Graylog. The remote command-line connection utility “SSH” is used. This secure (encrypted) connection method requires the security to be set up before use.

The Lightsail browser-based SSH client can also be used to connect and enter the command to install and configure the system without the need to manage the SSH authentication key file. However, I prefer to use a standalone SSH client for two main reasons. Firstly, I have a number of servers in different hosting environments and I prefer to use the same method to connect to them all. Secondly, I automate the installation and configuration using ansible, which connects via SSH and needs access to the authentication key file.

An SSH connection is used to enter commands into the Lightsail instance. Lightsail protects SSH connections using an authentication key (pem). The preceding procedure assumes you are using the default pem for SSH connections to the new Lightsail instance. The pem must be downloaded and saved for SSH use.

  1. From the Lightsail console, click Account, and select Account from the menu.

search in lightsail console

  1. Click SSH keys and Download to the right of the “Default” key.

manage ssh keys in console

  1. Download () the pem file as “aws.pem” for later use by SSH.
  2. On UNIX systems from the command line chmod 0600 aws.pem.

Test the SSH connection to the Lightsail instance. Use the directory where you saved the “aws.pem” file to, use the command “SSH -l ubuntu -i aws.pem <FQDN>” where “<FQDN>” is the Full Qualified Domain Name of the Lightsail instance. Your SSH client may ask for the initial connection to be confirmed or may reject it if the name or IP of the Lightsail instance already exists in the local SSH “.ssh/known_hosts” file, if so, edit the file and delete the record.

Configuring Ubuntu from the Command Line (SSH)

Now that you created the Lightsail instance, you are ready to connect to your instance using your SSH client of choice. After you connect, there is a small amount of Ubuntu operating system configuration required to make certain the software that is pre-installed on the Lightsail instance is up to date, to set the hostname/timezone and create a swap file (which allows more memory to be used than actually exists by swapping out unused parts until needed again).

Update the operating system to the latest level and reboot:

apt –y update

apt –y upgrade

reboot

Set the hostname (e.g. mygraylog):

hostname mygraylog

Edit “/etc/hosts” and add the new host name to the “127.0.0.1” record

127.0.0.1 localhost mygraylog

Set your local timezone (mine is “Europe/London”):

timedatectl set-timezone Europe/London

Create a swap file, activate, and make available at boot time:

dd if=/dev/zero of=/swap count=8192 bs=1MiB

chmod 600 /swap

mkswap /swap

swapon /swap

Edit “/etc/fstab” add the following at the end of the file

/swap swap swap 0 0

Install Graylog and pre-requisites from the Command Line (SSH)

Finally, Graylog itself (and pre-requisite software packages that Graylog uses) can be installed.

Generate secrets to be used by Graylog:

This is required to create an encrypted version of the Graylog login password.

apt –y install pwgen

Save the string create by the next command to be used as <secret> later

pwgen -N 1 -s 96

Save the string create by the next command to be used as <password-sha2> later

<yourpassword> will be the password for the user “admin” for the Graylog web interface

echo –n “<yourpassword>” | sha256sum

The quotes around <yourpassword> are needed.

Install pre-requisite software packages:

These packages are required for the Graylog server to operate.

apt –y install apt-transport-https openjdk-8-jre-headless

apt –y install uuid-runtime curl dirmngr

Set up install for Elasticsearch:

Elasticsearch is used by Graylog to store all the received messages and for searching the stored messages in a flexible way. First, the location to install Elasticsearch from must be configured.

(the following is a single-line command)

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

(the following is a single-line command)

echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list

apt –y update

Install Elasticsearch, enable it to start at boot and start it:

apt –y install elasticsearch

Edit “/etc/elasticsearch/elasticsearch.yml” and change cluster.name: my-application to cluster.name: graylog

systemctl enable elasticsearch

systemctl start elasticsearch

Set up install for MongoDB:

MongoDB is used by Graylog to store its configuration. First, the location to install MongoDB from must be configured.

(the following is a single-line command)

wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc | sudo apt-key add -

(the following is a single-line command)

echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list

apt –y update

Install MongoDB, enable it to start at boot and start it:

apt –y install mongodb-org

systemctl enable mongod

systemctl start mongod

Set up install for Graylog:

(the following is a single-line command)

wget https://packages.graylog2.org/repo/packages/graylog-3.2-repository_latest.deb

(the following is a single-line command)

dpkg -i graylog-3.2-repository_latest.deb

apt –y update

Install Graylog:

apt –y install graylog-server

Update the Graylog configuration:

Before starting the Graylog server, a few file updates are required for the network and security environment in which it runs.

Edit “/etc/graylog/server/server.conf” and make the following changes

  • Change “password_secret =” to “password_secret = <password-sha2>” (see preceding)
  • Change “elasticsearch_shards = 4” to “elasticsearch_shards = 1”
  • Change “http_bind_address = 127.0.0.1:9000” to “http_bind_address = 0.0.0.0:9000”
  • Change “http_publish_uri = …” to “http_publish_uri = http://<FQDN>:9000” (see preceding)
  • Uncomment “#root_email = ….” and enter your email address
  • Uncomment “#root_timezone = ….” And change to “root_timezone = UTC”

Edit “/etc/default/graylog-server” and the make the following change.

  • Add “-Djava.net.preferIPv4Stack=true” at the start of the “GRAYLOG_SERVER_JAVA_OPTS”

Enable Graylog to start at boot and start it:

systemctl enable graylog-server

systemctl start graylog-server

Connect and log in to Graylog

The Graylog server is now ready to be connected to via its Web interface so that final configuration to be completed.

Assuming all the preceding ran without error, you can now log in to Graylog via;

http://<FQDN>:9000

<FQDN> is the Fully Qualified Domain Name of your Lightsail instance. Logon as the user “admin” with the password that you used to generate the <password_sha2> preceding.

enter username and password in graylog

Graylog basic configuration.

Assuming that the devices that send their syslog records to Graylog have been configured to forward to <FQDN>:51400 (51401 and 51402), Graylog listeners must be set up to receive the syslog records. Repeat the following for each of the ports;

  • From the top menu bar, go to System then Inputs.
  • From the Select input dropdown list, select Syslog UDP.
  • Click Launch new input.

syslog udp

  • On the Launch new input pop-up, tick Global, fill in the Title, Port, Override source (the source name that shows on messages received via this Listener) and click Save.

syslog udp input

Having completed the creation and configuration of a Lightsail instance, configuring Ubuntu, installing the Graylog server and additional services, with a small amount of Graylog configuration, you start to see messages from the devices appearing in Graylog. Additional devices can be added and the numerous other features of Graylog can be tried out.

Graylog provides an excellent way of bringing all the logging data from various devices into one central management server, allowing you to see the effects of issues within a network in a single time line, making problem determination a much simpler process.

Author

Richard Gate, CommuniG8 Ltd

Email: [email protected]

Twitter: @communig8

Improving website performance with Lightsail Content Delivery Network

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/improving-website-performance-with-lightsail-content-delivery-network/

This post was written by Mike Coleman, Senior Developer Advocate 

Amazon Lightsail recently announced the release of Lightsail Content Delivery Network (CDN). With this launch customers can now distribute their content more securely to users across the globe. Content is served from the edge location closest to the end user which improves performance while reducing server load.

In this blog, I walk through exactly how to configure Lightsail distribution to work with both a standard web server in addition to WordPress. I will take advantage of the fact that Lightsail CDN offers pre-configured settings optimized for WordPress.  I cover creating a new distribution, verifying that the distribution is working correctly, and how to use a custom domain name with Lightsail CDN.

What is a CDN?

A CDN is a globally distributed set of network endpoints that cache your website’s content so it’s closer to your end users. When a user requests content from your site, that request is first routed to one of the CDN endpoints, if the content is available in the cache then it is served from that location. If it’s not available in the cache, then it is retrieved from your web server and presented to the requestor. Additionally, the content is placed in the cache so subsequent requests from that part of the world can be served from the cache without having to make a call to the web server.

Using Lightsail CDN with your websites offer a variety of benefits:

  1. End-users access your web content from the closest Lightsail CDN edge location which greately reduce response times.
  2. Serving content from the endpoint cache reduces the load on your actual web server since your server won’t need to service as many requests directly.
  3. Lightsail distributions make it easy to deliver content over Hypertext Transfer Protocol Secure (HTTPS) by providing SSL certificates and TLS support.

I am particularly excited about the third point. Before the release of Lightsail distributions, applying an SSL certificate to a standalone website required several manual steps. With Lightsail CDN, you can secure your web traffic with a few clicks.

One final point, Lightsail CDN is designed to cache what’s often called “static content.” Static content is content that is the same regardless of who requests it, or, stated another way, the content is not rendered on a per-user basis. This could include non-dynamic webpages, but also things like CSS stylesheets, images and videos, in addition to files containing JavaScript code.

The rest of this post covers how to set up a Lightsail distribution with either a typical web server or WordPress. Additionally, I talk about how to encrypt the traffic going from your users to your endpoint.

 

Prerequisites

You should have either a standard webserver (for example, Apache or NGINX) or a WordPress server running in your Amazon Lightsail account. Your server should also have a static IP address. Our documentation has you covered if you need some help getting a server deployed.

In order to use WordPress with Lightsail CDN, you’ll need to edit a configuration file from the Linux command line. You should be familiar with both how to SSH into your Lightsail instance in addition to how to use a Linux text editor such as Vim.

Configuring a custom domain requires the ability to manage the DNS for your domain. The DNS does not need to be managed by Lightsail or AWS, but you do need to have the ability to add domain records.

 

Creating the Lightsail distribution

The actual resource that you deploy to manage your web traffic is called a “distribution,” and the endpoint Origins can be either a Lightsail instance running a web server, a Lightsail instance running WordPress, or a Lightsail Load Balancer. This blog covers the web server and WordPress use cases.

  1. From the Lightsail console, choose Networking.
  2. Click Create distribution.
  3. Under Select your origin choose the server you previously created. Notice that your server is automatically listed in the dropdown.
    lightsail console: select your origin
  4. If your instance does not have a static IP attached to it already, you will need to either assign an existing static IP or create a new one.
    lightsail console: assign an existing static ip                                                                                                     Note: If you’re configuring Lightsail distribution to work with a WordPress server, you will be prompted to confirm you wish to use the WordPress preset. By providing smart presets for WordPress instances, Lightsail CDN reduces the time and complexity usually associated with creating a traditional CDN distribution.Click Yes, apply.
  5. Leave caching behavior set to the default (either Best for static content for a typical web server or Best for WordPress if you are using a WordPress server).This setting controls which directories are cached on your distribution’s endpoints.
  6. Leave the rest of the settings at their defaults and click Create Distribution.

It takes several minutes for your distribution to become ready.
distribution status updating settings

 

Additional steps for WordPress

In this section, you edit your WordPress configuration file (wp-config.php) to allow HTTPS connections to your server.

  1. SSH into your WordPress server.
  2. Create a backup of your wp-condfig.phpsudo cp /opt/bitnami/apps/wordpress/htdocs/wp-config.php /opt/bitnami/apps/wordpress/htdocs/wp-config.php.backup
  3. Open wp-config.php in your text editor of choice.sudo vi /opt/bitnami/apps/wordpress/htdocs/wp-config.php
  4. Delete the following two lines.define('WP_SITEURL', 'http://' . $_SERVER['HTTP_HOST'] . '/');
    define('WP_HOME', 'http://' . $_SERVER['HTTP_HOST'] . '/');
  5. Copy and paste the following into your wp-config.phpdefine('WP_SITEURL', 'https://' . $_SERVER['HTTP_HOST'] . '/');
    define('WP_HOME', 'https://' . $_SERVER['HTTP_HOST'] . '/');if (isset($_SERVER['HTTP_CLOUDFRONT_FORWARDED_PROTO'])
    && $_SERVER['HTTP_CLOUDFRONT_FORWARDED_PROTO'] === 'https') {
    $_SERVER['HTTPS'] = 'on';
    }
  6. Save the file.
  7. Restart the Apache web server.

sudo /opt/bitnami/ctlscript.sh restart Apache

After the server restarts, you can test to ensure that the Lightsail distribution is configured correctly.

 

Testing your distribution

Behind the scenes, Lightsail distributions use Amazon CloudFront. Any static content from your site will be served up by the CloudFront network of edge locations. You can verify this behavior with your browser’s developer tools. In the following steps, I use Google Chrome, but the steps are similar for other browsers.

  1. In your web browser, navigate to the URL of the distribution you just created. You can find the URL at the top of the details page for your distribution.                                                                                                                                      distribution default domain
  2. Open the developer tools console by clicking on the three-dot menu at the end of address bar and choosing More tools and then Developer tools.
    developer tools
  3. Click the Sources tab and notice that net is listed as the source for the web site content. This shows you that your website traffic is now being served via the Lightsail distribution.
    sources in cloudfront.net

(Optional) Adding a custom domain

At this point, your website is accessed via a randomly generated URL (for example,  d3b09eq0j1fbdq.cloudfront.net). In a production deployment, you’d want to use your own registered domain name (for example, www.example.com). In this next section, you configure Lightsail distribution to work with a custom domain by creating an SSL certificate for your domain, and a DNS CNAME record that maps your domain to the distribution URL.

As mentioned previously, your DNS does not need to be managed by Lightsail to perform the steps, but you do need to have the ability to create records for the domain on whichever provider you’re currently using.

 

  1. Select Domains and HTTPS from your distribution’s menu.
  2. Click +Create certificate
  3. Under Primary domain enter the fully qualified domain name (FQDN) you want to use for your server, and click Create.
    creating a certificate in ls console
  4. You’ll be prompted to create a DNS CNAME record to validate that you own the requested domain. Use the values in the dialog below to populate the record. If you need more assistance with this step, checkout the documentation.Note: that the text is truncated on the page, but the entire string will be copied if you highlight the fragment.
    certificate validation pending
  5. It can take several minutes for the domain validation to occur. Once the validation has finalized, the certificate status changes to Valid, not in use. Click the Custom domains are disabled slider to activate the new certificate.

disable custom domains

Wait several minutes until the distribution status is Enabled before moving to the final step

.status is enabled

The last step is to create a CNAME record that maps your domain name to the URL for the distribution. If you’re using Lightsail to manage you DNS, follow the steps below. If your domain name is managed by a 3rd party, consult their documentation.

  1. From the Lightsail home page click Networking on the horizontal menu.
  2. Click on the name of the DNS zone you wish to use.
  3. Click + Add record.
  4. Enter the subdomain you want to use (e.g. www or @ for an apex record). Click in the Resolves to text box, and notice that Lightsail automatically populates the name of your distribution. Click on your distribution name.
    lightsail console: DNS screenshot
  5. Click the green check mark to save your DNS record.

 

At this point, you should be able to access your domain by navigating to your FQDN into your browser.

Conclusion

So that’s all there is to accelerating and securing the deliver your website content with Lightsail Content Delivery Network. If you’ve already got a web server running on Lightsail, why not take advantage of the one-year free tier and configure it to work with Lightsail distribution. If you need more information on Lightsail distribution be sure to check out the documentation.

Must-know best practices for Amazon EBS encryption

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/must-know-best-practices-for-amazon-ebs-encryption/

This blog post covers common encryption workflows on Amazon EBS. Examples of these workflows are: setting up permissions policies, creating encrypted EBS volumes, running Amazon EC2 instances, taking snapshots, and sharing your encrypted data using customer-managed CMK.

Introduction

Amazon Elastic Block Store (Amazon EBS) service provides high-performance block-level storage volumes for Amazon EC2 instances. Customers have been using Amazon EBS for over a decade to support a broad range of applications including relational and non-relational databases, containerized applications, big data analytics engines, and many more. For Amazon EBS, security is always our top priority. One of the most powerful mechanisms we provide you to secure your data against unauthorized access is encryption.

Amazon EBS offers a straight-forward encryption solution of data at rest , data in transit, and all volume backups. Amazon EBS encryption is supported by all volume types, and includes built-in key management infrastructure without having you to build, maintain, and secure your own keys. We use AWS Key Management Service (AWS KMS) envelope encryption with customer master keys (CMK) for your encrypted volumes and snapshots. We also offer an easy way to ensure all your newly created Amazon EBS resources are always encrypted by simply selecting encryption by default. This means you no longer need to write IAM policies to require the use of encrypted volumes. All your new Amazon EBS volumes are automatically encrypted at creation.

You can choose from two types of CMKs: AWS managed and customer managed. AWS managed CMK is the default on Amazon EBS (unless you explicitly override it), and does not require you to create a key or manage any policies related to the key. Any user with EC2 permission in your account is able to encrypt/decrypt EBS resources encrypted with that key. If your compliance and security goals require more granular control over who can access your encrypted data- customer-managed CMK is the way to go.

In the following section, I dive into some best practices with your customer-managed CMK to accomplish your encryption workflows.

Defining permissions policies

To get started with encryption, using your own customer-manager CMK, you first need to create the CMK and set up the policies needed. For simplicity, I use a fictitious account ID 111111111111 and an AWS KMS customer master key (CMK) named with the alias cmk1 in Region us-east-1.
As you go through this post, be sure to change the account ID and the AWS KMS CMK to match your own.

  1. Log on to AWS Management Console with admin user. Navigate to AWS KMS service, and create a new KMS key in the desired Region.

kms console screenshot

      2. Go to the AWS Identity and Access Management (IAM) console and navigate to policies console. On create policy wizard, click on the JSON tab, and add the following policy:

{

    "Version": "2012-10-17",

    "Statement": [

            {

        "Sid": "VisualEditor0",

        "Effect": "Allow",

        "Action": [

            "kms:GenerateDataKeyWithoutPlaintext",

            "kms:ReEncrypt*",

            "kms:CreateGrant"

            ],

            "Resource": [

            "arn:aws:kms:us-east-1:<111111111111>:key/<key-id of cmk1>"

             ]

     }

  ]

}
  1. Go to IAM Users, click on Add permissions and Attach existing policies directly. Select the preceding policy you created along with AmazonEC2FullAccess policy.

You now have all the necessary policies to start encrypting data with you own CMK on Amazon EBS.

Enabling encryption by default

Encryption by default allows you to ensure that all new EBS volumes created in your account are always encrypted, even if you don’t specify encrypted=true request parameter. You have the option to choose the default key to be AWS managed or a key that you create. If you use IAM policies that require the use of encrypted volumes, you can use this feature to avoid launch failures that would occur if unencrypted volumes were inadvertently referenced when an instance is launched. Before turning on encryption by default, make sure to go through some of the limitations in the consideration section at the end of this blog.

Use the following steps to opt in to encryption by default:

  1. Logon to EC2 console in the AWS Management Console.
  2. Click on Settings- Amazon EBS encryption on the right side of the Dashboard console (note: settings are specific to individual AWS regions in your account).
  3. Check the box Always Encrypt new EBS volumes.
  4. By default, AWS managed key is used for Amazon EBS encryption. Click on Change the default key and select your desired key. In this blog, the desired key is cmk1.
  5. You’re done! Any new volume created from now on will be encrypted with the KMS key selected in the previous step.

Creating encrypted Amazon EBS volumes

To create an encrypted volume, simply go to Volumes under Amazon EBS in your EC2 console, and click Create Volume.

Then, select your preferred volume attributes and mark the encryption flag. Choose your designated master key (CMK) and voila- your volume is encrypted!

If you turned on encryption by default in the previous section, the encryption option is already selected and grayed out. Similarly, in the AWS CLI, your volume is always encrypted regardless if you set encrypted=True, and you can override the default encryption key by specifying a different one. The following image shows:

encryption and master key

Launching instances with encrypted volumes

When launching an EC2 instance, you can easily specify encryption with your CMK even if the Amazon Machine Image (AMI) you selected is not encrypted.

Follow the steps in the Launch Wizard under EC2 console, and select your CMK in the Add Storage section. If you previously set encryption by default, you see your selected default key, which can be changed to any other key of your choice as the following image shows:

adding encrypted storage to instance
Alternatively, using RunInstances API/CLI, you can provide the kmsKeyID for encrypting the volumes that are created from the AMI by specifying encryption in the block device mapping (BDM) object. If you don’t specify the kmsKeyID in BDM but set the encryption flag to “true”, then your default encryption key will be used for encrypting the volume. If you turned on encryption by default- any RunInstance call will result in encrypted volume, even if you haven’t set encryption flag to “true.”

For more detailed information on launch encrypted EBS-backed EC2 instances see this blog.

Auto Scaling Groups and Spot Instances

When you specify a customer-managed CMK, you must give the appropriate service-linked role access to the CMK so that EC2 Auto Scaling / Spot Instances can launch instances on your behalf (AWSServiceRoleForEC2Spot / AWSServiceRoleForAutoScaling). To do this, you must modify the CMK’s key policy. For more information, click here.

Creating and sharing encrypted snapshots

Now that you’ve launched an instance and have some encrypted EBS volumes, you may want to create snapshots to back up the data on your volumes. Whenever you create a snapshot from an encrypted volume, the snapshot is always be encrypted with the same key you provided for the volume. Other than create-snapshot permission, users do not need any additional key policy setting for creating encrypted snapshots.

Sharing encrypted snapshots

If you want another account at your org to create a volume from that snapshot (for use cases such as test/dev accounts, disaster recovery (DR) etc.), you can take that encrypted snapshot and share it with different accounts. To do that you need create a policy setting for the source (111111111111) and target (222222222222) accounts.

In the source account, complete the following steps:

  1. Select snapshots at the EC2 console.
  2. Click Actions- Modify Permissions
  3. Add the AWS Account Number of your target account
  4. Go to AWS KMS console and select the KMS key associated with your Snapshot
  5. In Other AWS accounts section click on Add other AWS Account and add the target account

Target account:
Users in the target account have several options with the shared snapshot. They can launch an instance directly or copy the snapshot to the target account. You can use the same CMK as in the original account (cmk1), or re-encrypt it with a different CMK.

I recommend that you re-encrypt the snapshot using a CMK owned by the target account. This protects you if the original CMK is compromised, or if the owner revokes permissions, which could cause you to lose access to any encrypted volumes that you created using the snapshot.
When re-encrypt with a different CMK (cmk2 in this example), you only need ReEncryptFrom permission on cmk1 (source). Also, make sure you have the required permissions on your target account for cmk2.

The following JSON policy document shows an example of these permissions:

{

    "Version": "2012-10-17",

    "Statement": [

    {

    "Effect": "Allow",

    "Action": [

            "kms:ReEncryptFrom"

            ],

    "Resource": [

    "arn:aws:kms:us-east-1:<111111111111>:key/<key-id of cmk1>"

    ]

  }

 ]

} ,

{

    "Version": "2012-10-17",

    "Statement": [

    {

        "Effect": "Allow",

        "Action": [

            "kms:GenerateDataKeyWithoutPlaintext",

            "kms:ReEncrypt*",

            "kms:CreateGrant"

        ],

        "Resource": [

        "arn:aws:kms:us-east-1:<222222222222>:key/<key-id of cmk2>"

        ]

   }

  ]

}

You can now select snapshots at the EC2 console in the target account. Locate the snapshot by ID or description.

If you want to copy the snapshot, you also must allow “kms:Describekey” policy. Keep in mind that changing the encryption status of a snapshot during a copy operation results in a full (not incremental) copy, which might incur greater data transfer and storage charges.

 

The same sharing capabilities can be apply to sharing AMI. Check out this blog for more information.

Considerations

  • A few old instance types don’t support Amazon EBS encryption. You won’t be able to launch new instances in the C1, M1, M2, or T1 families.
  • You won’t be able to share encrypted AMIs publicly, and any AMIs you share across accounts need access to your chosen KMS key.
  • You won’t be able to share snapshots / AMI if you encrypt with AWS managed CMK
  • Amazon EBS snapshots will encrypt with the key used by the volume itself.
  • The default encryption settings are per-region. As are the KMS keys.
  • Amazon EBS does not support asymmetric CMKs. For more information, see Using Symmetric and Asymmetric Keys

Conclusion

In this blog post, I discussed several best practices to use Amazon EBS encryption with your customer-managed CMK, which gives you more granular control to meet your compliance goals. I started with the policies needed, covered how to create encrypted volumes, launch encrypted instances, create encrypted backup, and share encrypted data. Now that you are an encryption expert – go ahead and turn on encryption by default so that you’ll have the peace of mind your new volumes are always encrypted on Amazon EBS. To learn more, visit the Amazon EBS landing page.
If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the Amazon EC2 forum or contact AWS Support.

 

 

How to enable X11 forwarding from Red Hat Enterprise Linux (RHEL), Amazon Linux, SUSE Linux, Ubuntu server to support GUI-based installations from Amazon EC2

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-enable-x11-forwarding-from-red-hat-enterprise-linux-rhel-amazon-linux-suse-linux-ubuntu-server-to-support-gui-based-installations-from-amazon-ec2/

This post was written by Sivasamy Subramaniam, AWS Database Consultant.

In this post, I discuss enabling X11 forwarding from Red Hat Enterprise Linux (RHEL), Amazon Linux, SUSE Linux, Ubuntu servers running on Amazon EC2. This is helpful for system and database administrators, and application teams that want to perform software installations on Amazon EC2 using GUI method. This blog provides detailed steps around SSH and x11 tools, various network and operating system (OS) level settings, and best practices to achieve the X11 forwarding on Amazon EC2 when installing databases like Oracle using GUI.

There are several techniques to connect Amazon EC2 instances to manage OS level configurations. Typically, you use SSH clients (such as PuTTY or SSH client) to establish the connection from the Windows OS-based bastion or jump servers to connect with Amazon EC2 instances running linux-based OS. Most commonly, database administrators use a common Database Management, bastion host, or jump servers to connect database servers. They do this instead of directly using their laptops connecting to the database servers. They can install all the needed tools in one server to perform database administrative or support activities. During the application installation or configuration, you might need to install software such as an Oracle database or a third-party database using GUI methods. This blog talks about steps that must be done in order to forward the X11 screen to your highly secure Windows OS-based bastion hosts. You can consider using NICE DCV as an alternative option for running GUI-based applications. Please refer to the prior link for more details and steps to enable NICE DCV.

Prerequisites

To complete this walkthrough the following is required:

  • Ensure that you have a bastion host running on Amazon EC2 with Windows OS for this blog. This OS must have access to the EC2 machines running Linux such as RHEL, Amazon Linux, SUSE Linux, and Ubuntu servers. If not, please configure a bastion host using Windows operating system with needed SSH access via port 22 to EC2 instance running linux-based operating systems. You can use any OS-based systems as a bastion host as long you have corresponding client tools installed or X11 supported by that OS.
  • I recommend having bastion hosts in the same Availability Zone or Region as the EC2 Linux hosts that you plan to connect and forward X11 to. This is to avoid any high latency in X11 forwarding during your application installations.
  • Install tools such as PuTTY and Xming on the Windows-based bastion host from which you want to SSH to Linux EC2 host and X11 forwarding.
  • In order to securely configure or install PuTTY, refer to the section Configuring ssh-agent on Windows in the blog post Securely Connect to Linux Instances Running in a Private Amazon VPC.
  • You may need sudo permission to run X11 forwarding commands as a root user in order to complete the setup.

Solution

Connect to your EC2 instance using SSH client, and perform following setup as needed.

Step 1: Install required X11 packages

Install X11 packages with following command based on your operating system release and version:

Installing xclock or xterm packages are optional as this is installed in this post to test the X11 forwarding using xclock or xterm commands.

 

Amazon Linux 2:

To install X11 related packages:

$ sudo yum install xorg-x11-xauth

To install X11  testing tools:

$ sudo yum install xclock xterm

 

Red Hat Enterprise Linux 8:

To install X11 related packages:

$ sudo yum install xorg-x11-xauth

To install X11 testing tools:

$ sudo yum install xterm

Note: The xorg-x11-apps package has been provided in the CodeReady Linux Builder Repository for RHEL8. So, I skipped installing this package, which has xclock and I used only xterm to test the X11 forwarding.

 

SUSE Linux Enterprise Server 15 SP1:

To install X11 related packages:

$ sudo zypper install xauth

To install X11 testing tools:

$ sudo zypper install xclock

 

Ubuntu Server 18:

To install X11 related packages and tools:

$ sudo apt install x11-apps

Step 2: configure X11 forwarding

To enable X11 Forwarding, change the “X11Forwarding” parameter using vi editor to “yes” in the /etc/ssh/sshd_config file if either commented out or set to no.

$ sudo vi /etc/ssh/sshd_config

 

To Verify X11Forwarding parameter:

$ sudo cat /etc/ssh/sshd_config |grep -i X11Forwarding

You should see similar output as the following:

X11Forwarding yes

To restart ssh service if you changed the value in /etc/ssh/sshd_config:

 

Amazon Linux 2, RHEL 8 and SUSE Linux OS:

$ sudo service sshd restart

 

Ubuntu Servers:

$ sudo service ssh restart

 

Step 3: Configure putty and Xming to perform X11 forwarding connect and verify X11 forwarding

Log in to your Windows bastion host. Then, open a fresh PuTTY session, and use a private key or password-based authentication per your organization setup. Then, test the xclock or xterm command to see x11 forwarding in action.

  • Click the xming utility you installed on Windows bastion host and have it running.

click on xming icon

  • Select Session from the Category pane on left. Set Host Name as your private IP, port 22, and Connection Type as SSH. Please note that you use the Private IP of EC2 instance later when you connect inside from the VPC/network.

putty configuration details

  • Go to Connection, and click Then, set Auto-login username as ec2-user, Ubuntu (Ubuntu OS), or whichever user you are allowed to logging in as.
  • Go to Connection, select SSH, and then click Then, click on Browse to select the private key generated earlier If you are using key based authentication.
  • Go to Connection, select SSH, and then click on Then, select enable X11 forwarding.
  • Set X display location as localhost:0.0

putty configuration screenshot 2

  • Go back to Session and click on Save after creating a session name in Saved session.

 

Now that you set up PuTTY, xming, and configured the x11 settings, you can click on load button and then Open button. This opens up a new SSH terminal with x11 forwarding enabled. Now, I move on to the testing X11 forwarding.

Test the X11 from the use you logged in:

Example:

$ xauth list

$ export DISPLAY=localhost:10.0

$ xclock or xterm

You should see the sample output and xclock or xterm window opened similar to the following image. This means your x11 forwarding setup working as expected, and you can start using GUI-based application installation or configuration by running the installer or configuration tools.

sample output xclock

Step 4: Configure the EC2 Linux session to forward X11 if you are switching to different user after login to run GUI-based installation / commands

In this example: ec2-user is the user logged in with SSH and then switched to oracle user.

From the Logged User to identify the xauth details:

$ xauth list

$ env|grep DISPLAY

$ xauth list | grep unix`echo $DISPLAY | cut -c10-12` > /tmp/xauth

Switch to the user where you want to run GUI-based installation or tools:

$ sudo su - oracle

$ xauth add `cat /tmp/xauth`

$ xauth list

$ env|grep DISPLAY

$ export DISPLAY=localhost:10.0

$ xclock

You should see the sample output and xclock or xterm window opened similar to the following image. This means your x11 forwarding setup is working as expected even after switched to different user. You can start using GUI-based application such as running the installer or configuration tools.

  sample output success

Conclusion

In this blog, I demonstrated how to configure Amazon EC2 instances running on various linux-based operating systems to forward X11 to the Windows OS-based bastion host. This is helpful to any application installation that requires GUI-based installation methods. This is also helpful to any bastion hosts that provide highly secure and low latency environments to perform SSH related operations including GUI-based installations as this does not require any additional network configuration other than opening the port 22 for standard SSH authentication. Please try this tutorial for yourself, and leave any comments following!

 

 

Using WebSockets and Load Balancers Part two

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/using-websockets-and-load-balancers-part-two/

This post was written by Robert Zhu, Principal Developer Advocate at AWS. 

This article continues a blog I posted earlier about using Load Balancers on Amazon Lightsail. In this article, I demonstrate a few common challenges and solutions when combining stateful applications with load balancers. I start with a simple WebSocket application in Amazon Lightsail that counts the number of seconds the client has been connected. Then, I add a Lightsail Load Balancer, and show you how the application performs routing and retries. Let’s get started.

WebSockets

WebSockets are persistent, duplex sockets that enable bi-directional communication between a client and server. Applications often use WebSockets to provide real-time functionality such as chat and gaming. Let’s start with some sample code for a simple WebSocket server:

const WebSocket = require("ws");
const name = require("./randomName");
const server = require("http").createServer();
const express = require("express");
const app = express();

console.log(`This server is named: ${name}`);

// serve files from the public directory
server.on("request", app.use(express.static("public")));

// tell the WebSocket server to use the same HTTP server
const wss = new WebSocket.Server({
  server,
});

wss.on("connection", function connection(ws, req) {
  const clientId = req.url.replace("/?id=", "");
  console.log(`Client connected with ID: ${clientId}`);

  let n = 0;
  const interval = setInterval(() => {
    ws.send(`${name}: you have been connected for ${n++} seconds`);
  }, 1000);

  ws.on("close", () => {
    clearInterval(interval);
  });
});

const port = process.env.PORT || 80;
server.listen(port, () => {
  console.log(`Server listening on port ${port}`);
});

We serve static files from the public directory and WebSocket connection requests on the same port. An incoming HTTP request from a browser loads public/index.html, and a WebSocket connection initiated from the client triggers the wss.on(“connection”, …) code. Upon receiving a WebSocket connection, I set up a recurring callback where I tell the client how long it has been connected. Now, let’s look at a look at the client code:

buttonConnect.onclick = async () => {
  const serverAddress = inputServerAddress.value;
  messages.innerHTML = "";
  instructions.parentElement.removeChild(instructions);

  appendMessage(`Connecting to ${serverAddress}`);

  try {
    let retries = 0;
    while (retries < 50) {
      appendMessage(`establishing connection... retry #${retries}`);
      await runSession(serverAddress);
      await sleep(1500);
      retries++;
    }

    appendMessage("Reached maximum retries, giving up.");
  } catch (e) {
    appendMessage(e.message || e);
  }
};

async function runSession(address) {
  const ws = new WebSocket(address);

  ws.addEventListener("open", () => {
    appendMessage("connected to server");
  });

  ws.addEventListener("message", ({ data }) => {
    console.log(data);
    appendMessage(data);
  });

  return new Promise((resolve) => {
    ws.addEventListener("close", () => {
      appendMessage("Connection lost with server.");
      resolve();
    });
  });
}

I use the WebSocket DOM API to connect to the server. Once connected, I append any received messages to console and on screen via the custom appendMessage function. If the client loses connectivity, it will try to reconnect up to 50 times. Let’s run it:

"slimy-cardinal" is a randomly generated server name

Now, suppose I am running a very demanding real-time application, and need to scale the server capacity beyond a single host. How would I do this? I create two Ubuntu 18.04 instances. Once the instances are up, I SSH to each one, and run the following commands:

sudo apt-get update
sudo apt-get -y install nodejs npm
git clone https://github.com/robzhu/ws-time
cd ws-time && npm install
node server.js

During installation, select Yes when presented with the prompt:

Select "Yes" when prompted to install libssl for npm

Keep these SSH sessions open, you need them shortly. Next, you create the Load Balancer in Amazon Lightsail, and attach the instances:

screenshot of target instances for load balancers

Note: the Lightsail load balancer only works for port 80, which is part of the reason I use the same port for HTTP and WebSocket requests.

Copy the DNS name for the load balancer, open it in a new browser tab, and paste it into the WebSocket server address with the format:

ws://<DNSName>

screenshot of what the correct server address should look like

Make sure the server address does not accidentally start with “ws://http://…”

Next, locate the SSH session that accepted the connection. It looks like this:

accepted ssh section

The server logs the client ID when it receives a connection.

If you kill this process, the client disconnects and runs its retry logic, hopefully causing the load balancer to route the client to a healthy node. Next, hit connect from the client. After a few seconds, kill the process on the server, and you should see the client reconnect to a healthy instance:

what you should see when you connect to a healthy instance

The client retry was routed to a healthy instance on the first attempt. This is due to the round-robin algorithm that the Lightsail load balancer uses. In production, you should not expect the load balancer to detect an unhealthy node immediately. If the load balancer continues to route incoming connections to an unhealthy node, the client will need more retry attempts before reconnecting. If this is a large scale system, we will want to implement an exponential backoff on the retry intervals to avoid overwhelming other nodes in the cluster (aka the thundering herd problem).

Notice that the message “you have been connected for X seconds” reset X to 0 after the client reconnected. What if you want to make the failover transparent to the user? The problem is that the connection duration (X) is stored in the NodeJS process that we killed. That state is lost if the process dies or if the host goes down. The solution is unsurprising: move the state off the WebSocket server and into a distributed cache, such as redis.

Deep health checks

When you attached your instances to the load balancer, your health checks passed because the Lightsail load balancer issues an HTTP request for the default path (where you serve index.html). However, if you expect most of your server load to come from I/O on the WebSocket connections, the ability to serve our index.html file is not a good health check.  You might implement a better health check like so:

app.get('/healthcheck', (req, res) => {
  const serverHasCapacity = getAverageNetworkUsage() < 0.6;
  if (serverHasCapacity) res.status(200).send("ok");
  res.status(400).send("server is overloaded");
});

This caused the load balancer to consider a node as “unhealthy” when the target node’s network usage reaches a threshold value. In response, the load balancer stops routing new incoming connections to that node. However, note that the load balancer will not terminate existing connections to an over-subscribed node.

When working with persistent connections or sticky sessions, always leave some capacity buffer. For example, do not mark the server as unhealthy only when it reaches 100% capacity. This is because existing connections or sticky clients will continue to generate traffic for that node, and some workloads may increase server usage beyond its threshold (e.g. a chat room that suddenly gets very busy).

 

Conclusion

I hope this post has given you a clear idea of how to use load balancers to improve scalability for stateful applications, and how to implement such a solution using Amazon Lightsail instances and load balancers. Please feel free to leave comments, and try this solution for yourself.

 

Low-latency computing with AWS Local Zones – Part 1

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/low-latency-computing-with-aws-local-zones-part-1/

This post was contributed by: Pranav Chachra, Rob Chen, Alan Goodman

AWS launched a Local Zone in Los Angeles (LA), California, in late 2019 at re:Invent. Since the launch, we have seen a lot of interest from you, and have worked to bring additional features and services based on your feedback.

In this blog series, we share best practices and recommendations for hosting your applications in Local Zones based on what has worked for customers. Part one focuses on sharing more about Local Zones and how customers are already using the LA Local Zone. In this blog, we also address key questions asked by customers since the launch of the LA Local Zone. In the upcoming parts of this blog series, we do a deep dive into specific use cases for which our customers are using the LA Local Zone. We also list best practices to host your applications in Local Zones and plan to provide related architectural considerations.

AWS Local Zones introduction

local zone base picture

Today, there are a significant number of applications that can run in the AWS Cloud. Most applications work well in our Regions. However, for workloads that require low-latency or local data processing, you need us to bring AWS infrastructure closer to you. Without local AWS presence, you have to procure, operate, and maintain IT infrastructure in your own data center or colocation facility for such workloads. And, you end up building and running such application components with a different set of APIs and tools than the other parts of your applications running in the AWS Cloud. This results in a lot of extra effort and expenses.

And that’s why, based on your feedback, we launched AWS Local Zones, a new type of AWS infrastructure deployment that places compute, storage, and other select services closer to large cities. We designed Local Zones as a powerful construct that extends our existing Regions into new locations. This gives you the ability to run applications on AWS that require single-digit millisecond latencies to your end-users or on-premises installations in the local area. Just like Regions, Local Zones are fully managed and supported by AWS, giving you the elasticity, scalability, and security benefits of running on the AWS. The first Local Zone is generally available in LA, and is an extension of the US West (Oregon) Region (the parent Region).

Use cases

Like Regions, customers leverage the LA Local Zone for many different use cases. The top customer use cases include:

  1. Media and Entertainment (M&E) content creation: M&E customers are migrating expensive on-premises workstations to the LA Local Zone. They do this to accelerate content creation by getting rid of capacity constraints while improving security and operational efficiency. For artist workstations, latency is the key to having a jitter-free experience on a remote instance. These customers typically require less than five millisecond latency from their offices to virtual instances. With Direct Connect, most of these customers are able to achieve as low as 1-2 millisecond latency from their animation hubs in LA to the Local Zone, and run latency-sensitive workloads, such as live production, video editing.
  2. Enterprise migration with hybrid architecture: Enterprises have workloads running in their existing on-premises data centers in the LA metro area. These customers use the Local Zone to migrate complex legacy on-premises applications to AWS without expensive revamp of their architecture. Customers have told us that it can be daunting to migrate a portfolio of interdependent applications to the cloud. Now with a Direct Connect to the LA Local Zone, customers can establish a hybrid environment that provides ultra-low latency communication between applications running in the LA Local Zone and on-premises installations. In turn, this enables customers to migrate applications incrementally, simplifying migrations drastically and enabling on-going hybrid deployments in the LA area.
  3. Real-time multiplayer gaming: This category includes gaming companies that deploy game servers for multiplayer sessions all over the world to be closer to gamers. A latency of 20 milliseconds or less is considered ideal for good gameplay experience. Until now, these customers were using on-premises installations in the LA area to supplement AWS presence. However, now, customers are deploying latency-sensitive game servers in the LA Local Zone to run real-time and interactive multiplayer game sessions, enabling them to provide end users in the Southern California area with a great experience.

In the next parts of the blog series, we further dive into these use cases, cover respective best practices, and review architectural guidance for you.

Services and features

AWS launched the LA Local Zone with support for seven Nitro based Amazon EC2 instance types (T3, C5, M5, R5, R5d, I3en, and G4), two EBS volume types (io1 and gp2), Amazon FSx for Windows File Server, Amazon FSx for Lustre, Application Load Balancer, Amazon VPC and Amazon Direct Connect. Since launch, AWS added support for Amazon RDS, Amazon EMR, AWS Shield, and Amazon EC2 Dedicated Hosts, and are adding more services based on your feedback.

Other parts of AWS, like AWS CloudFormation templates, CloudWatch, IAM resources, and Organizations, will continue to work as expected, providing you a consistent experience. You can also leverage the full suite of services like Amazon S3 in the parent Region, US West (Oregon), over AWS’s private network backbone with ~20–30 milliseconds latency.

Getting started and using the Local Zone

Now that we reviewed some common use cases and services of Local Zones, we want to get you started using the Local Zone. First, enable the LA Local Zone from the new “Zone settings” section of the EC2 console, as shown in the following image:

console Zone Settings

Once enabled, Local Zones looks and behaves similarly to an Availability Zones (AZs).  You can access Local Zones through the parent Region’s console and API endpoints. The following image shows that the LA Local Zone is visible as us-west-2-lax-1a along with other AZs in the EC2 console:

service health of zone

Once the LA Local Zone is enabled, you can extend your existing VPC from the parent Region to a Local Zone by creating a new VPC subnet assigned to the LA Local Zone:

creating subnet

Once a VPC subnet is established for the LA Local Zone, simply select the subnet while creating local resources. For example, you can launch an EC2 instance in the LA Local Zone by selecting the local subnet as the following:

configuring instance details

Local resources are then ready within seconds. You can manage these resources in the LA Local Zone just like resources in AZs:

shows the local zone created

Just like Regions, you can set up an Internet Gateway to access your local resources in the LA Local Zone over the internet. Or you can use AWS Direct Connect to route your traffic over a private network connection from any Direct Connect location. To get the best latency performance, you should use one of the Direct Connect locations available in LA:

  • T5 at El Segundo, Los Angeles, CA (recommended for lowest latency to the LA Local Zone)
  • CoreSite LA1, Los Angeles, CA
  • Equinix LA3, El Segundo, CA

Pricing

From a pricing perspective, Instances, and other local AWS resources running in a Local Zone have their own prices that might differ from the parent Region. Billing reports include a prefix, “LAX1,” or location name, “US West (Los Angeles),” that is specific to a group of Local Zones in LA. EC2 instances in Local Zones are available in On-Demand and Spot form, and you can also purchase Savings Plans. For pricing information, you can visit the pricing section on the respective services and filter pricing information by choosing the Local Zone location as “US West (Los Angeles)” in the dropdown.

Data transfer charges in AWS Local Zones are the same as in the AZs in the parent Region today. For example, data transferred between Amazon EC2 instances in the LA Local Zone and Amazon S3 in the parent Region, US West (Oregon), is free. Similarly, data transferred IN and OUT from Amazon EC2 in the Local Zone to the Amazon EC2 in the Parent Region is charged at $0.01/GB in each direction. You can learn more about data transfer prices “in” and “out” of Amazon EC2 here.

Thinking ahead

Later this year, AWS plans to open a second Local Zone in LA (us-west-2-lax-1b). The two Local Zones in LA will be interconnected with high-bandwidth, low-latency AWS networking allowing you to architect your low-latency applications for high availability and fault tolerance. Based on your feedback, we are also working on adding Local Zones in other locations along with the availability of the additional services, including Amazon ECS, Amazon Elastic Kubernetes Service, Amazon ElastiCache, Amazon ES, and Amazon Managed Streaming for Apache Kafka.

Conclusion

Now that we provided AWS Local Zones that you requested; we are really looking forward to seeing what all you can do with them. AWS would love to get your advice on locations or additional local services/features or other interesting use cases, so feel free to leave us your comments!

OpenFOAM on Amazon EC2 C6g Arm-based Graviton2 Instances – up to 37% better price/performance

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/c6g-openfoam-better-price-performance/

This post is contributed to by: Neil Ashton (AWS) – Principal CFD Specialist SA, Karthik Raman (AWS) – Senior HPC Specialist SA, Oliver Perks (Arm) – Principal HPC Engineer

Over the past 30 years, Computational Fluid Dynamics (CFD) has become a key part of many engineering design processes. From aircraft design to modelling the blood flow inside our bodies, the ability to understand the behavior of fluids has enabled countless innovations and improved the time to market for many products. Typically, the modelling of the underlying fluid flow equations remains the major limit to accuracy; however, the scale and availability of HPC resources is arguably the main bottleneck for the typical CFD user. The need for more HPC resources has accelerated over recent years due to the move to higher-fidelity approaches, in addition to the growing use of optimization techniques and machine learning driven workflows. These workflows require the simulation of thousands or even millions of small jobs to explore a design space or train an improved turbulence model. AWS enables engineers to overcome this HPC bottleneck by allowing the quick deployment of a supercomputer. This can scale to practically any size and with the right hardware to match your needs, for example, CPUs, GPUs.

CFD users have a broad choice for instance types on AWS.  Typically the majority of users select the compute-optimized Amazon EC2 C5 family of instances. These offer a better price/performance over the Amazon EC2 M5 family of instances for the majority of CFD cases due to a need for memory bandwidth over total memory needs. This broad set of available instances allows users to easily incorporate Amazon EC2 M5 or Amazon EC2 R5 instances when higher memory is needed, for example, for pre or post-processing.

There are typically two competing needs for the typical CFD user: turn-around time and cost. Depending on the underlying numerical methods, the speed of a simulation does not linearly increase with an increased core count. At some point, the cost of network communication means a non-linear increase. This means the cost of increasing the number of cores is not matched by a similar decrease in runtime. Therefore, the user typically faces a choice between the fastest possible runtime or the most efficient cost for the simulation.

The recently announced Amazon EC2 C6g instances powered by AWS-designed Arm-based Graviton2 processors bring in a new instance option. So, the logical question for any AWS customer is whether this is suitable for their CFD workload. In this blog, we provide benchmarking results for the open-source CFD code OpenFOAM that can be easily compiled and run with the Amazon EC2 C6g instance. We demonstrate that you can achieve 37% better price/performance for both single-node and multi-node cases on more than 200M cells on thousands of cores.

 

AWS Graviton2-based Instances

AWS Graviton2 processors are custom built by AWS using the 64-bit Arm Neoverse cores to deliver great price performance for your cloud workloads running in Amazon EC2. The Graviton2-powered EC2 instances were announced at re:Invent 2019. The new Amazon EC2 C6g compute-optimized instances are available as part of the sixth generation EC2 offering. These instances are powered by AWS Graviton2 processors that use 64-bit Arm Neoverse N1 cores and custom silicon designed by AWS, built using advanced 7-nanometer manufacturing technology.

Graviton2 processor cores feature 64 KB L1 cache and 1 MB L2 cache, includes dual SIMD units to double the floating-point performance versus first-generation Graviton processors.  This targets high performance computing workloads and also support int8/fp16 number formats to accelerate machine learning inference workloads. Every vCPU is a physical core (that is, no simultaneous multithreading – SMT). The instances are single socket, so there are no NUMA concerns since every core sees the same path to memory and other cores. There are 8x DDR4 memory channels running at 3200 MT/s delivering over 200 GB/s of peak memory bandwidth.

The compute-optimized Amazon EC2 C6g instances are available in 9 sizes 1, 2, 4, 8, 16, 32, 48, and 64 vCPUs, or as bare metal instances. The compute-optimized Amazon EC2 C6g instances support configurations with up to 128 GB of DDR4 memory or 2 GB/ CPU. The instances support up-to 25 Gbps of network bandwidth and 19 Gbps of Amazon EBS bandwidth. These instances are powered by the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor.

Benchmarking Results

Single-Node Performance

In order to demonstrate the performance of the Amazon EC2 C6g instance, we have taken two test-cases that reflect the range of cases that CFD users run. For the first case, we simulate a 4 million cell motorbike case that is part of the standard OpenFOAM tutorial suite. It might seem that these cases do not represent the complexity of some production CFD workflows. However, these were chosen to allow readers to easily replicate the results. This case uses the OpenFOAM simpleFoam solver, which uses a steady-state incompressible segregated SIMPLE approach combined with a multigrid method for the linear solver. The k- SST model is used with second order upwinding for both momentum and turbulent equations. The scotch domain decomposition approach is used to ensure best parallel efficiency. However, for this first case the focus is on the single-node performance, given the low cell count.

Figures 1 and 2 show the time taken to run 5000 iterations as well as the cost for that simulation based upon on-demand pricing. It can be seen that the C6g.16xlarge offers a clear cost per simulation benefit over both the C5n.18xlarge (37%), and C5.24xlarge (29%) instances. There is however an increase in total run time compared to the C5.24xlarge (33%) and C5n.18xlarge (13%) which may mean that the C5.24xlarge or C5n.18xlarge will still be preferred for those prioritizing turn-around time.

 

Figure 1 - Single-node OpenFOAM cost per simulation

Figure 1 – Single-node OpenFOAM cost per simulation

Figure 2 - Single-node OpenFOAM run-time performance

Figure 2 – Single-node OpenFOAM run-time performance

Single-Node Performance Profile

Profiling performed on the simpleFoam solver (used in the motorbike example) showed that it is limited by the sustained memory bandwidth on the instance. We therefore measured the peak sustainable memory bandwidth performance (Figure 3) on the three instance types (C5n.18xlarge, C6g.16xlarge, C5.24xlarge) using STREAM Triad. Then calculated the corresponding bandwidth efficiencies, which are shown in Figure 4.

 

Figure 3 - Peak memory bandwidth using STREAM

Figure 3 – Peak memory bandwidth using STREAM

Figure 4 - Memory bandwidth efficiency using STREAM Triad

Figure 4 – Memory bandwidth efficiency using STREAM Triad

The C5n.18xlarge and C5.24xlarge instances have higher peak memory bandwidth due to more memory channels (dual socket) compared to C6g.16xlarge instance. However, the Graviton2 based processor can achieve a higher percentage of peak (~86% bandwidth efficiency) compared to the x86 processors (~79% on C5.24xlarge). This in addition to the cost benefits (47% vs. C5.24xlarge, 44% vs. C5n.18xlarge) is the reason for the improved cost per simulation with the C6g.16xlarge instance as shown in Figure 1.

Multi-Node Performance

A further, larger mesh of 222 million cells was studied on the same motorbike geometry with the same underlying numerical methods. We did this to assess whether the same performance trends continue for much larger multi-node simulations. Figure 5 shows the number of iterations possible per minute for different number of cores for the C5.24xlarge, C5n.18xlarge, and C6g.16xlarge instances. You can see in Figure 5 that the scaling of the C5n.18xlarge instances is much better than C5.24xlarge and C6g.16xlarge instances due to the use of the Elastic Fabric Adapter (EFA), which enables optimum scaling thanks to low latency, high-bandwidth  communication. However, while the scaling is much improved for C5n.18xlarge instances with EFA, the cost per simulation (Figure 6) shows up to 37% better price/performance for the C6g.16xlarge instances over the C5n.18xlarge and C5n.24xlarge instances.

Figure 5 - Multi-node OpenFOAM Scaling Performance

Figure 5 – Multi-node OpenFOAM Scaling Performance

Figure 6 - Multi-node OpenFOAM cost per simulation

Figure 6 – Multi-node OpenFOAM cost per simulation

Conclusions

This blog has demonstrated the ease with which both single-node and multi-node CFD simulations can be ported to the new Amazon EC2 C6g instances powered by Arm-based AWS Graviton2 processors. With no code modifications, we have been able to port an x86 workload to the new C6g instances and achieve competitive single node and multi-node performance with up to 37% better price/performance over other C family instances. We encourage you to test out your applications for yourself and reach out to us if you have any questions!

OpenFOAM compilation on AWS Graviton2 Instances

OpenFOAM v1912 builds on modern Arm-based systems, with support in-place for both the GCC compiler and the Arm Compiler for Linux (ACfL). No source code modifications are required to obtain a working OpenFOAM binary. The process for building OpenFOAM on Arm based systems is equivalent to that on x86 systems, by specifying the desired compiler in OpenFOAM bashrc file.

The only external dependencies are a working MPI library. For this experiment, we used Open MPI version 4.0.3, built with UCX 1.8 and GCC 9.2. We note that we have not applied any additional hardware-specific performance optimizations. Although deploying OpenFOAM on the Amazon EC2 C6g instances was trivial, there is documentation on the Arm community GitLab pages. This documentation covers the installation of different OpenFOAM versions and with different compilers.

For the C5n.18xlarge and C5.24xlarge simulations, OpenFOAM v1912 was compiled using GCC 8.2 and IntelMPI 2019.7. For all simulations Amazon Linux 2 was the operating system. The HPC environment for all tests was created using AWS ParallelCluster, which will soon include official support for AWS’ latest Graviton2 instances, including C6g. You can stay up to date on the latest information and releases of AWS ParallelCluster on GitHub.

Instance Details:

Model vCPU Memory (GiB) Instance Storage (GiB) Network Bandwidth (Gbps) EBS Bandwidth (Mbps)
C5n.18xlarge 72 192 EBS-only 100 19,000
C6g.16xlarge 64 128 EBS-only 25 19,000
C5.24xlarge 96 192 EBS-only 25 19,000

 

Proactively Monitoring System Performance on Amazon Lightsail Instances

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/proactively-monitoring-system-performance-on-amazon-lightsail-instances/

This post is contributed by Mike Coleman, AWS Senior Developer Advocate – Lightsail

I commonly hear from customers that they want to be able to proactively identify issues that could affect system performance before they become a problem. For instance, the ability to be alerted before an instance might become unresponsive to a burst in traffic due to exhaustion of all its CPU burst capacity. Burst capacity can be consumed either from a workload that needs to operate in the burstable zone for long periods of time, or unexpected CPU consumption by system processes. In either case, you’d want to be notified so you could take corrective action such as moving to a larger instance or stopping errant processes. To that end, today Amazon Lightsail launched a new feature allowing you to set up custom alarms to be notified when your burst capacity is running low.

 

Amazon Lightsail instances use burstable CPUs. These CPUs operate in two different zones: the sustainable zone and the burstable zone. The sustainable zone is based on the CPUs baseline performance. As long as the CPU utilization stays below this baseline, the system will perform with no impact to system responsiveness. The burstable zone is entered whenever the CPU utilization climbs above the baseline. The instance can only operate in this zone for a finite amount of time (more on that below) before system performance is negatively impacted.

 

Earlier this year, we announced the release of resource monitoring, alarms, and notifications for Amazon Lightsail instances. This feature introduced a graph that shows when an instance operates in the CPUs burstable zone. However, this was only a partial solution since there was no easy way for you to know how long your system could effectively operate in the burstable zone.

With today’s new feature, Lightsail has augmented that functionality by allowing you to see how much burstable capacity is available to your system at any given time. Additionally, you can create alarms and be notified when that burstable capacity has dropped to a critical level. This allows users to be proactively notified that there is a potential performance issue developing.

 

The remainder of this blog runs through how to configure a burstable capacity alert so you can prevent system performance issues before they impact your users.

 

Burstable capacity overview

Before I begin, it’s important to understand how burst capacity minutes are calculated. One minute of the CPU running at 100% is one minute of CPU burst capacity. By the same token, one minute of the CPU running at 50% is 30 seconds of CPU burst capacity. So, if a system has 72 minutes of available burst capacity, and it’s running at 50% CPU it can run in that state for 144 minutes before system performance is impacted.

CPU burstable capacity can be displayed two ways: percentage available and minutes available, with the default being percentage.  CPU burst capacity percentage is a simple calculation of dividing the available burst capacity (minutes) by the maximum available burst capacity (minutes). For example, an instance with 36 minutes of burst capacity left out of a maximum available limit of 72 minutes would be at 50%.

 

Configuring a Burstable Capacity Alert

Let’s take a look at how to configure an alert for CPU burst capacity (percentage).

 

Note: If you’re not familiar with the general concepts around how to configure alerts and notifications in Lightsail, you can read this blog post.

 

  1. From the Lightsail home page click on the instance you wish to create the alert for.
  2. From the horizontal menu, choose Metrics.
    metrics tab of Lightsail console
  3. You should now see the graphs for both CPU utilization and burst capacity.

Notice how the CPU utilization graph shows the sustainable and the burstable zones.

The burst capacity graph shows you the current percentage of burst capacity you have remaining of minutes that you before system performance is impacted.

In the following graphs, you can see that the CPU is operating at about 15% and has just over 90% of remaining burst capacity.
cpu graphs

  1. Scroll down and click CPU burst capacity (percentage) under
    cpu alarm option
  2. Click +Add alarm.
  3. For my alarm, I choose to be notified whenever the percentage of available CPU burst capacity drops below 25% for 10 consecutive minutes.
    cpu percentage alarm option
  4. At this point, you can enable notifications via email or SMS message (instructions on how to do this can be found in the blog post I linked earlier). Regardless of whether you choose to enable notifications, you’ll receive a banner in the Lightsail console whenever your alarm threshold is breached.
  5. Click Create to create your alarm.

Lightsail is now configured with a single alarm. I consider it a best practice to configure two alarms. The first is a warning level, and the second is the critical level. This ensures that you have additional time to respond to developing problems. If you’d like to create a second alarm, just repeat the previous steps.

 

Conclusion

In this blog, I covered the concepts behind Lightsail’s burstable CPUs, and how you create alarms to respond before your system runs out of burst capacity.  If you see that your system is running out of bursting capacity frequently, you should investigate the processes running on your instance looking for any that are consuming extra CPU. Or you should consider upgrading your instance to a larger plan. Read more about this latest feature here, and check out our Getting Started page for more tutorials and resources.

 

 

 

 

 

Enhancing site security with new Lightsail firewall features

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/enhancing-site-security-with-new-lightsail-firewall-features/

This post is contributed by Mike Coleman, AWS Senior Developer Advocate – Lightsail

Amazon Lightsail provides an easy way to get started with AWS for many customers. The service balances ease of use, security, and flexibility. The Lightsail firewall now offers additional features to help customers secure their Lightsail instances. This update offers three new capabilities:

  • The ability to specify source IP addresses for firewall rules
  • Explicitly allowing or disallowing remote access to instances via Lightsail’s web-based console
  • Support for PING

This blog explores each of these new features in detail, starting with source IP addresses.

Before this update, any open ports in the Lightsail firewall were open to the internet. In many cases, this is a reasonable approach. For example, for new WordPress servers, you likely need broad public access.

However, in some cases you want to restrict access to an instance. If you are staging a new website and it’s not ready for publication, you may want to limit access. One way to ensure that only certain people can visit the site is to only allow certain IP addresses to connect.

Another common use case is limiting remote access to an instance. With the new changes to the Lightsail firewall, you would be able to limit SSH or RDP access by source IP address. Additionally, you can now enable or disable remote access via Lightsail’s built-in web client.

Access can be restricted from one or more IP addresses (for example, the IP address for your home computer) or a continuous range of IP addresses (such as the address range for your corporate network).

Next, I review how you configure these options to restrict remote access via SSH to a single source IP address.

Finding your IP address

Most computers do not have an internet routable IP address assigned. Internet routable IP addresses are scarce and usually assigned to your internet gateway device. The devices on the network are assigned private IP addresses. To communicate between the private IP network and the internet, the network router typically uses network address translation (NAT).

This tutorial assumes you are using NAT. This means the IP address used to restrict SSH access is the IP routable address of your network gateway device (usually your wireless router). Consequently, this limits access to all devices on the network behind this IP address.

There are many ways to find your internet routable IP address. You can log into your network gateway device and find it there (consult your device’s user manual for more details). Alternatively, use one of several public services to determine your IP address – search online for “what is my IP” to list several options.

Restricting SSH access to a single IP address

  1. Start by creating a new Lightsail instance – you can select any blueprint.
  2. Once the instance state shows Running, choose the name of the instance to open the Instance details page.
    firewall test instance
  3. Choose Networking from the menu.
    networking tab
  4. Scroll down to find the current firewall settings. Under Allow connections from, it lists Any IP address for all of the applications. To change this, choose the edit icon for the SSH rule.
    IP address
  5. Check the box next to Restrict to IP address and enter your internet routable IP address under Source IP or range.restrict IP address
    Note: The next section shows how to restrict access from Lightsail’s browser-based SSH client. Currently, Allow Lightsail browser SSH box is checked.
  6. Choose Save.

Now, SSH into your Lightsail instance from your local machine. You can learn more about how to connect to your Lightsail instance using SSH from our documentation.

You should be able to connect to your instance successfully. Next, you test the connection from a different IP address. You do this by restricting a different IP address, and attempting to connect again:

  1. Edit the SSH firewall rule again follows the instructions above. This time, under IP or IP Range enter 192.168.2.150.
  2. Choose Save.

Attempt to connect to your instance once more. The connection fails because your IP address does not match an IP address in the range.

Restricting access from the Lightsail browser-based SSH client

The browser-based SSH client makes it easy to access instances without needing to manage SSH keys on locally. However, there may be cases where you must disable browser-based access.

To do this:

  1. Navigate to the firewall rules for the instance you created earlier.
  2. Choose the edit icon for the SSH rule.edit IP address
  3. Uncheck the box next to Allow Lightsail browser SSH. Choose Save.
  4. From the menu, choose Connect, then choose Connect using SSH. The browser window opens, but you are not connected to the instance.

PING Support

There is now support for PING, a command line utility used to check if a computer is reachable over the network. PING sends a packet to a remote computer, which sends a simple response back. Before this release, you could not PING Lightsail instances.

To activate this feature, add a firewall rule:

  1. Navigate to the networking page for your instance.      add rule to firewall<
  2. Under the firewall section, choose +Add rule.
  3. From the application list, choose PING (ICMP). Choose Save.
  4. From a terminal window on your local machine, send a ping command to your Lightsail instance’s IP address. You can find the IP address from the Connect tab of the instance details page or from instance card on the Lightsail home page.
ping -c 5 192.168.2.143

You see a response similar to:

<ping -c 5 192.168.2.143
PING 192.168.2.143 (192.168.2.143): 56 data bytes
64 bytes from 192.168.2.143: icmp_seq=0 ttl=54 time=19.383 ms
64 bytes from 192.168.2.143: icmp_seq=1 ttl=54 time=16.821 ms
64 bytes from 192.168.2.143: icmp_seq=2 ttl=54 time=16.363 ms
64 bytes from 192.168.2.143: icmp_seq=3 ttl=54 time=27.335 ms
64 bytes from 192.168.2.143: icmp_seq=4 ttl=54 time=19.429 ms

--- 192.168.2.143 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 16.363/19.866/27.335/3.943 ms

Conclusion

In this blog I covered how you can increase the security of your Lightsail instances by taking advantage of three new features: source IP restrictions, limiting access to the Lightsail browser SSH and RDP clients, and the addition of PING (ICMP) as an application type. These new features provide you an extra level of flexibility and security when deploying applications on Lightsail.

To learn more about the Lightsail firewall, see the documentation. Additionally, there are Getting Started tutorials for Lightsail, including launching a LAMP stack application or .NET application.

Using AWS CodeDeploy and AWS CodePipeline to Deploy Applications to Amazon Lightsail

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/using-aws-codedeploy-and-aws-codepipeline-to-deploy-applications-to-amazon-lightsail/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail | Twitter: @mikegcoleman

Introduction

Amazon Lightsail is the easiest way to get started in the cloud, allowing you to get your application running on your own virtual server in a matter of minutes. But, what do you do if you want to update that running application?

In order to automate the process of both deploying and updating software many developers are turning to automated workflows. AWS has a full complement of tools that allow you to build deployment pipelines to cover a wide array of use cases.

This blog post provides guidance on how to configure Lightsail to work with AWS CodePipeline and AWS CodeDeploy to automatically deploy (or update) an application every time you push a change to GitHub. Even though this tutorial provides detailed, step-by-step instructions, you may still want to read some of the docs (CodeDeploy and CodePipeline) if you’re unfamiliar with deployment pipelines.

After completing this walkthrough you’ll be on your way to implementing more complex pipelines that might include build and test steps.

 

Prerequisites

You need the following to complete this walkthrough:

  • A GithHub account in which to fork the demo code
  • git installed on your local machine, and a basic understanding on how to use it
  • An AWS account with sufficient privileges to create AWS Identity and Access (IAM) users, policies, and service roles as well as to create resources with the following services: Amazon S3, CodePipeline, CodeDeploy and Lightsail
  • The AWS CLI installed and configured on your local machine

 

Document Important Values

As you go through this tutorial, you’ll need to take note of a few key values. Open up a text editor of your choice, and copy and paste the template below into a new document. As you go through the guide, when instructed, copy the values into your text document

S3 Bucket Name:

Access Key ID:

Secret Key:

IAM User ARN:

Note: Both the Access key ID and Secret access key should be protected in the same manner you protect any sensitive username / password pair.

 

Solution Overview

The rest of this blog guides you through the process of setting up your deployment pipeline. You start by creating a service role for CodeDeploy, an Amazon S3 bucket, and an IAM user. After deploying these services, you create a Lightsail instance. You also install and configure the CodeDeploy agent, as well as registering the instance with CodeDeploy. Finally, you create an application in CodeDeploy, and configure CodePipeline to kick off a new deployment whenever you push changes to GitHub.

 

Create a service role

In AWS a service role is a role that an AWS service assumes to perform actions on your behalf. The policies that you attach to the service role determine which AWS resources the service can access and what it can do with those resources. Below you use AWS IAM to create a service role for CodeDeploy that has the necessary permissions to work with Lightsail.

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane, choose Roles, and then choose Create role.                                iam role
  3. On the Create role page, choose AWS service, and from the Choose the service that will use this role list, choose CodeDeploy.                                  iam role for service
  4. Near the bottom of the screen under Select your use case, choose CodeDeploy and click Next: Permissions.
  5. On the Attached permissions policy page, the default permission policy (AWSCodeDeployRule) is displayed. You can click on the policy name if you’d like to review the details of the policy.Click Next: Tags, and then click Next: Review.
  6. On the Create Role page, in Role name, enter a name for the service role (for example, CodeDeployServiceRole).role name
  7. Click Create role.

 

Create an S3 bucket

In this section, you create an Amazon S3 bucket to store the deployment artifact created by CodeDeploy. This artifact is a compressed file containing your source code files, and any scripts that need to run as part of the installation or update process.

  1. Sign in to the AWS Management Console and open the S3 console, and click Create Bucket. create s3 bucket
  2. Enter a name under Bucket name. The name must be unique across all of S3.
    Be sure to copy the S3 bucket name into your text document.
  3. Ensure Block all public access is checked.                             name s3 bucket
  4. Click Create bucket.

 

Create IAM policy

An IAM policy is a set of rules that, when attached to an identity or resource, defines their permissions. For this use case, you want a policy that only allows AWS CodeDeploy agent to read the S3 bucket you just created. In the next set of steps, you define the policy. Then in the subsequent section you apply that policy to an IAM user. That user ultimately is associated with the CodeDeploy agent running on your Lightsail instance.

  1. Sign in to the AWS Management Console and open the IAM console.
  2. In the navigation pane, choose Policies, and then choose Create policy. create iam policy
  3. Click on the JSON tab.                                                                                            JSON tab

Erase the content in the editor window, and paste in the code from below.

NOTE: Be sure to replace <S3 Bucket Name> with the name of the S3 bucket you created in the previous step.

{

  "Version": "2012-10-17",

  "Statement": [

    {

      "Effect": "Allow",

      "Action": [

        "s3:Get*",

        "s3:List*"

      ],

      "Resource": [

        "arn:aws:s3:::<S3 Bucket Name>/*"

      ]

    }

  ]

}

JSON editor revised

  1. Click Review policy.
  2. Enter CodeDeployS3BucketPolicy for the policy name.
  3. Click Create policy.

Create an IAM user

Because you cannot assign an IAM role to a Lightsail instance, you need to create your IAM user with the appropriate permissions. In this case the user will need to be able to list the contents of the S3 bucket you just created.

  1. Stay in the IAM console.
  2. In the navigation pane, choose Users, and then choose Add user.                                      IAM user creation
  3. Enter LightSailCodeDeployUser for the User name and click Programmatic access under Select AWS access type (you use programmatic access since this user account will never need to log into the console). Click Next: permissions.        set user name
  4. Click Attach existing polices directly. Enter CodeDeployS3BucketPolicy in the search box, and check the box next to the CodeDeployS3BucketPolicy policy.                                        attach policies to the S3 bucket
  5. Click Next: Tags. Click Next: Review. Click Create user.
  6. Copy the Access key ID and Secret access key into your text document. You will need to click Show to display the secret access key. Note: If you do not copy these values now, you cannot go back and retrieve them from the console. You will need to create a new set of credentials

.access key and secret access key

       7. Click Close.

8. Click on the user you just and copy the User ARN into your document.  USR ARN

At this point you created an S3 bucket where CodeDeploy can store your build artifact, as well as the IAM components you need (a service role, IAM Policy, and an IAM user) to configure the CodeDeploy agent. In the next step, you actually deploy a Lightsail instance with the CodeDeploy agent, and then you register that instance with CodeDeploy

Create a Lightsail instance and install the CodeDeploy agent

In this section you create the Lightsail instance where you want your code to run. In order for the instance to work with CodeDeploy you must install the CodeDeploy agent. The agent installation is done by providing a startup script that runs when the instance is first created.

  1. Log in to the AWS Management Console, and navigate to the Lightsail home page.
  2. Click Create instance.
  3. Lightsail instance location
  4. Ensure that you’re creating your instance in the correct AWS Region.
  5. Under Pick your instance image click on Linux/Unix. Click on OS Only. Select Amazon Linux.instance image lightsail
  6. Scroll down and click + Add launch script. In the code below paste in your Access key ID, Secret access key, and IAM User ARN from your text document. Also replace <Desired Region> with the Region that you deployed instance into (e.g. us-west-2).

After you edit the code below, paste it into the launch script edit window. The configuration below allows the CodeDeploy agent run with the permissions you assigned to the IAM user earlier. These permissions allow the CodeDeploy agent to download the deployment artifact created by the CodeDeploy service from the S3 bucket where it will be stored. Additionally, the agent will use the information in the artifact to deploy or update your application.

mkdir /etc/codedeploy-agent/

mkdir /etc/codedeploy-agent/conf

cat <<EOT >> /etc/codedeploy-agent/conf/codedeploy.onpremises.yml

---

aws_access_key_id: <Access Key ID>

aws_secret_access_key: <Secret Access Key>

iam_user_arn: <IAM User ARN>

region: <Desired Region>

EOT

wget https://aws-codedeploy-us-west-2.s3.us-west-2.amazonaws.com/latest/install

chmod +x ./install

sudo ./install auto

     6. Enter codedeploy for the instance name under Identify your instance. Click Create Instance.

Verify the CodeDeploy agent

Wait about 5-10 minutes for the instance to boot up and run the startup script.

In this section you verify that the CodeDeploy agent is up and running, register your instance with CodeDeploy, and, finally, tag the instance.

Note: You will be using the command line for both your Lightsail instance and on your local machine. Pay careful attention to the instructions to ensure you’re issuing the commands on the right command line.

  1. Start an SSH session by clicking on the terminal icon next to the name of your instancessh session for instance
  2. On the command line of the Lightsail terminal session enter the command below to verify the CodeDeploy agent is running.

sudo service codedeploy-agent status

You should see a response similar to the one below (the PID will be different):

The AWS CodeDeploy agent is running as PID 2783

      3. Enter the command below using the AWS CLI in a terminal session on your local machine to register your Lightsail instance with CodeDeploy.

NOTE: Replace <IAM User ARN> with the value in your document and the <Desired Region> with the appropriate Region.
NOTE: If you did not name your Lightsail instance codedeploy you will need to adjust the –instance-name parameter accordingly.
NOTE: The command does not provide any output

aws deploy register-on-premises-instance --instance-name codedeploy --iam-user-arn <IAM User ARN> --region <Desired Region>

  1. Enter the following command using the AWS CLI in a terminal session on your local machine to tag your Lightsail instance in CodeDeploy. The tag will be used by CodeDeploy to know where to install your code.
    NOTE: If you did not name your Lightsail instance codedeploy you will need to adjust the –instance-name parameter accordingly.
    NOTE: Replace <Desired Region> with the appropriate Region.
    NOTE: The command does not provide any output

aws deploy add-tags-to-on-premises-instances --instance-names codedeploy --tags Key=Name,Value=CodeDeployLightsailDemo --region <Desired Region>

      5. Enter the command below using the AWS CLI in a terminal session on your local machine to verify your machine was successfully registered:
          NOTE: Replace <Desired Region> with the appropriate Region.

aws deploy list-on-premises-instances --region <Desired Region>

You should see output similar to:

{
"instanceNames": [
     "codedeploy"
  ]
}

At this point you are now ready to setup your actual code deployment using CodePipeline and CodeDeploy.

Setup the application in CodeDeploy

  1. Navigate to the CodeDeploy console, make sure you’re in the correct Region, and click Create application.codedeploy create application
  2. Enter CodeDeployLightsailDemo for the Application name and select EC2/On-premises under Compute platform. Click Create application.
  3. In the Deployment groups section Click Create deployment group.
  4. Enter CodeDeployLightsailDemoDeploymentGroup for the Deployment group name.
  5. Click in the text box for Enter a service role and select the service role you created earlier (CodeDeployServiceRole)
  6. Under Environment configuration check the box for On-premises instances. Under Key enter Name and under Value enter. environment configuration
  7. Under Load balancer uncheck Enable load balancing.
  8. Click Create deployment group.

 

Fork the GitHub Repo

In this section, you connect your GitHub account to CodePipeline so that whenever you push a change to GitHub your new code will automatically be deployed. For this tutorial, I placed a small demo application in a GitHub repository. These next steps guide you through forking that code into your own GitHub account.

  1. Sign into GithHub.
  2. Navigate to the demo repository: http://github.com/mikegcoleman/codedeploygithubdemo
  3. To the right of the repository name at the top click the Forkfork github
  4. Click on the account that you want to fork the repository into.
    After a few seconds the fork process completes, and you are redirected to the new repo in your account.

Setup CodePipeline

As the name implies, CodePipeline allows you to create an automated set of steps for your application deployment. For instance, build and test processes that must happen before a final deployment step. In this example you’re going to build a very simple pipeline that will redeploy your application when a change is pushed to the associated GitHub repository.

  1. Navigate to the CodePipeline console, ensure you’re in the correct Region, and click Create pipeline.
  2. Enter CodeDeployLightsailDemoPipeline for the Pipeline name.
  3. Click on Advanced Settings. Under Artifact store click the radio button next to Custom Location. Click into the Bucket text box and select the S3 bucket you created earlier. Click Next.codepipeline advanced settings
  4. From the Source provider drop down choose Click Connect to GitHub and follow any prompts to authorize CodePipeline to access your GitHub account.
    1. Note: If you’ve connected GitHub previously there will not be any additional prompts.
  5. Click in the Repository text box and select the repository you forked earlier. The name should be <your github username>/codedeploygithubdemo.
  6. Click in the Branch box and choose Click Next. Since there isn’t a build stage click Skip build stage and confirm by clicking Skip.
  7. Choose AWS CodeDeploy from the Deploy provider Ensure the appropriate Region is selected. Choose CodeDeployLightsailDemo from the Application name list. Choose CodeDeployLightsailDemoDeploymentGroup from the Deployment group list.
    1. Click Next.
    2. Click Create pipeline.codedeploy configuration
  8. You’ll be taken to the details page for your pipeline, and can watch the status of the pipeline update. Once the Deploy step has a status of succeeded feel free to move on to the next section.                                                                                           source for code

 

Test and Update the Application

In this final section you to verify the application deployed, and then you’ll make an update to the application to kick off a new deployment. Finally, you’ll verify that the update was successfully pushed to your server.

  1. Navigate in your web browser to the IP address of your Lightsail instance. You can find the IP address on the card for your instance on the Lightsail home page. You should see a simple webpage displayed.
    ip address
  2. Move to the command line of your local machine, and clone the GitHub repository, being sure to insert your GitHub username.NOTE: If you are using SSH to authenticate to GitHub adjust the GitHub command accordinglygit clone https://github.com/<your github username>/codedeploygithubdemoYou should see output similar to the following:

    Cloning into 'codedeploygithubdemo'...
    remote: Enumerating objects: 49, done.
    remote: Counting objects: 100% (49/49), done.
    remote: Compressing objects: 100% (33/33), done.
    remote: Total 49 (delta 25), reused 36 (delta 12), pack-reused 0
    Unpacking objects: 100% (49/49), done.

  3. Change into the directory with the website code:cd codedeploygithubdemo
  4. Using an editor of your choice edit the html file by changing the background color to purple:background-color: purple;
  5. Push the changes to GitHub by issuing each of the following commands one at a time:git add index.html
    git commit -m “new background color”
    git push origin master
  6. Navigate back to the CodePipeline console and click on the name of your pipeline. You should see that the change from GitHub has been picked up and the pipeline is deploying your website. Once the pipeline has successfully completed move to the next step.
  7. Reload to demo website to see that the background color has changed.

Conclusion

Congratulations on finishing this tutorial. You should now have an understanding of how to automate the deployment and updating of your application running on Lightsail using CodeDeploy and CodePipeline. As a next step, you might want to try and deploy a more complex application to Lightsail, including adding a build step. You can find information on how to do that in AWS CodeBuild documentation.

 

 

 

Using Load Balancers on Amazon Lightsail

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/using-load-balancers-on-amazon-lightsail/

This post was written by Robert Zhu, Principal Developer Advocate at AWS. 

As your web application grows, it needs to scale up to main performance and improve availability. A Load Balancer is an important tool in any developer’s tool belt. In this post, I show how to load balance a simple Node.js web application using the Amazon Lightsail Load Balancer. I chose Node.js as an example, but the concepts and techniques I discuss work with any other web application stack such as PHP, Djago, .net, or even plain old HTML. To demonstrate these concepts, I deploy a load balancer that splits traffic between two Amazon Lightsail instances using round-robin load balancing. Compared to other load balancing solutions on AWS, such as our Application Load Balancer and Network Load Balancer, the Lightsail Load Balancer is simpler, easier to use, and has a fixed cost of $18/month, regardless of how much bandwidth you use or how many connections are open.

Launch the Cluster

For the demo, first create a cluster consisting of two Lightsail instances. To do so, navigate to the Lightsail console, and launch two micro Lightsail instances using the Ubuntu 18.04 image.

After you select your Region and OS, select an instance plan and identify your instance. Make sure that your instances have enough memory and compute to run your application. For this demo, the smallest instance size suffices.

Instance plan and name

 

Once your instances are ready, click each instance and open an SSH session and run the following commands:

 

sudo apt-get update

sudo apt-get -y install nodejs npm

git clone https://github.com/robzhu/colornode

cd colornode && npm install

 

The above commands update APT package registry, and install Node.js and npm, in order to run the application. When installing npm, select “yes” when you see the following prompt:

npm prompt

Once installation is complete, you can run “node -v” and “npm -v” to make sure everything works. While this is an older version of Node.js, you should be able to use it without any issues. Remember to run these steps for both instances.

The application

Let’s take a tour of the application. Here’s the main.js source file:

 

const color = require("randomcolor")();

const name = require("./randomName");

const app = require("express")();

app.get("/", (_, res) => {

  res.send(`

    <html>

      <body bgcolor=${color}>

      <h1>${name}</h1>

    </html>

  `);

});


app.get("/health", (_, res) => {

  res.status(200).send("I'm OK");

});

app.listen(80, () => console.log(`${name} started on port 80`));

 

Upon startup, the application generates a random name and color. It then uses express-js to serve responses to requests at two routes: “/” and “/health”. When a client requests the “/” route, you respond with an HTML snippet that contains the name and color of the application. When a client requests the “/health” route, you respond with HTTP status code 200. Later, I use the health route to enable Lightsail to determine the health of a node and whether the load balancer should continue routing requests to that node.

 

You can start the application by running “sudo node main.js”. Do this for both instances, and you should see two distinct webpages at each instance’s IP address:

images of two instance urls

 

Launch the Load Balancer

To create a load balancer, open the Network tab in the Lightsail console and click Create load balancer.

Networking tab of Lightsail console

On the next page, name your load balancer and click Create load balancer. After that, a page shows where you can add target instances to the load balancer:

target instances

From the dropdown menu under Target instances, select each instance and attach it to the load balancer, like so:

 

target instances 2

image of first loadbalancer

Once both instances are attached, copy the DNS name for your load balancer and paste it into a new browser session. When the load balancer receives the browser request, it routes the request to one of the two instances. As you refresh the browser, you should see the returned response alternate between the two instances. This should clearly illustrate the functionality of the load balancer.

Health Checks

Over time, your instances inevitably exhibit faulty behavior due to application errors, network partitions, reboots, etc. Load balancers need to detect faulty nodes in order to avoid routing traffic to them. The simplest way to implement a load balancer health check is to expose an HTTP endpoint on your nodes. By default, the Lightsail Load Balancer performs health checks by issuing an HTTP GET request against the node IP address directly. However, recall that the application serves a simple HTTP 200 response at the “/health” route. You can configure the Load Balancer to use this route to perform health checks on our nodes instead.

 

Within your Lightsail console, open your load balancer, click Customize health checking, and enter health for the route:

health check screenshot

 

Now, the Lightsail load balancer uses http://{instanceIP}/health to perform the health check. You can further customize your load balancer by using a custom domain and adding SSL termination.

 

The custom health check feature allows us to implement deeper health check logic. For example, if you want to make sure that the entire application stack is running, you can write a test value to the database and read it back. That would give our application a much higher assurance of functionality, rather than returning Status Code 200 at an HTTP endpoint.

 

Conclusion

In summary, you created a Lightsail Load Balancer and used it to route traffic between two Lightsail instances serving our demo application. If one of these instances crashes or becomes unreachable, the load balancer will automatically route traffic to the remaining healthy nodes in the cluster based on a customizable HTTP health check. In comparison to other AWS load balancing solutions, the Lightsail Load Balancer is simpler, easier, and uses a fixed-pricing model. Please leave feedback in the comments or reach out to me on Twitter or email!

 

In a follow-up post, I walk through some more advanced concepts:

  • SSL termination
  • Custom Domains
  • Deep health check
  • Websocket connections

 

About the Author

Robert Zhu is a principal developer advocate at AWS.

Email: [email protected]

Twitter: @rbzhu

Building a Photo Diary Ghost on Amazon Lightsail

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/building-a-photo-diary-ghost-on-amazon-lightsail/

This post was written by Robert Zhu, a principal technical evangelist at AWS and a member of the GraphQL Working Group. 

Ghost is a simple and flexible alternative to WordPress. With Ghost, you can build a beautiful company website, personal portfolio, photo diary, or anything in between.

In this post, I show you how to start a photo diary using Ghost on Amazon Lightsail, AWS’ easiest solution for hosting virtual private servers. Compared to Amazon EC2, Amazon Lightsail takes care of advanced concepts like VPCs, Security Groups, and IAM policies for you until you want to re-engage those services. Amazon Lightsail also bundles monthly network transfer with the instance, whereas Amazon EC2 charges separately for network transfer.

There are two easy ways to get up and running with Ghost on Lightsail:

1. Using the Ghost Blueprint

Lightsail includes a number of common instance images known as “Blueprints.” When you launch a Blueprint, the selected software is installed and preconfigured, along with any dependencies. This is especially handy for Ghost because it saves you from having to install and configure MySQL. You can find our click-to-launch stacks in the Amazon Lightsail console, as shown in the following image.

Lightsail has pre-built Blueprints for you to launch entire applications with your VPS.

To launch a Blueprint instance, navigate to the Lightsail console. Then, select “Apps + OS” on the “Create Instance” page, and choose “Ghost.” Select the $5/month instance (this is the cheapest instance that meets the minimum requirements for running Ghost and its database on a single box).

 

Once the instance is up and running, find its IP address and open it in your browser. It may take a few minutes for Ghost to start running, during which you may see an error in the browser. As you can see, it is easy to start running Ghost on Amazon Lightsail with the Blueprint.

 

2. Running Ghost in a Container

As an alternative to launching a Ghost Blueprint, you can also run Ghost within a Docker container. In order to run Ghost within a container, you must install Docker on the Lightsail instance. Compared to the Blueprint approach, there are two advantages:

  1. Ability to run other applications on your Lightsail instance, such as forums, static websites, APIs, etc.
  2. Painless upgrades to new Ghost versions. When a new version of Ghost is released, you can upgrade with just a single command.

To run Ghost in a container complete the following steps.

  1. Create a Lightsail Instance (select Ubuntu 18.04).
  2. Connect to it using Lightsail’s browser-based SSH client. Once the instance is up and running, run the following commands to install Docker on your Lightsail instance after connecting:

sudo apt-get update && sudo apt-get install docker.io -y

sudo systemctl start docker

sudo systemctl enable docker

       3. To make sure Docker is installed and running, run:

sudo docker run hello-world 

You should see:

      4. Start the ghost container by running:

sudo mkdir /var/lib/ghost

sudo mkdir /var/lib/ghost/content

sudo docker run -d --name blog -p 80:2368 -v /var/lib/ghost/content:/var/lib/ghost/content ghost

The last command runs the “ghost” container from the public Docker registry. It also maps port 80 (HTTP) on the host to port 2368 inside the container where Ghost is listening for HTTP requests. The -v argument maps a path on the host to a path inside the container. This way, any blog content is persisted into /var/lib/ghost/content and doesn’t disappear when you stop/delete/upgrade the container. The -d argument tells Docker to run the container in detached mode.

 

Now you can test your new blog.

  1. Find the public IP address of your Lightsail instance by selecting Manage from the Instance menu: You can find your IP address by clicking "Manage" on the instance.
  2. Paste the IP address into a browser, and you should see your brand new Ghost blog:After following these steps, you will see Ghost being served from the IP address of your VPS.

 

Installing a Custom Theme

You can customize the look and feel of your Ghost blog by installing a custom theme. There are plenty of free and paid themes online. If you’re a designer, you can even build your own theme from scratch.

For my photo diary, I use the London theme. Download the theme by clicking “Clone or download” > “Download ZIP.”

To download the London theme from Github, click on "Clone or download" and then click "Download ZIP".

 

Next, we need to log into the Ghost admin console. If you launched your Ghost instance using the Lightsail blueprint, follow these steps:

  1. SSH into your Ghost instance
  2. You should see the Bitnami welcome message and a bitnami_credentails file under your home directory.After SSHing into your Lightsail blueprint instance, you'll see a bitnami_credentials file in your home directory.
  3. Display the default admin user name and password by typing cat bitnami_credentials.
  4. Use your credentials to administer your ghost instance at http://INSTANCEIPADDRESS/ghost.

If you launched Ghost within a Docker container, then you can administer Ghost by going to http://INSTANCEIPADDRESS/admin.

Once you are logged into the Ghost admin console, click on Settings > Design > Upload a theme.

In the left menu, click on settings, design, then click "upload a theme" on the main page.

You then see a pop-up prompt to upload the theme file. Drag and drop the theme file you downloaded above. Once the upload is complete, click “activate.”

The london theme is perfect for showcasing your photographs

Refresh your ghost blog IP address in another tab, and you should see a bold, new look and feel.

The Ghost admin console is also used for managing content, plugins, users, and more. For example, you can use header and footer scripts to inject code to add features like analytics and comments. By combining these features, you can make your Ghost blog stand out and fit your workflow.

Managing DNS with Lightsail

If you want to add a custom domain for your blog, you can manage that domain within Lightsail. By managing all your DNS records in one place, you can simplify your workflows.

For example, Lightsail offers the ability to assign Static IP addresses to instances, and then automatically associate an A record to a static IP address. This is a quality-of-life improvement that helps you avoid typing the wrong IP address.

Here are the steps to manage your domain with Lightsail:

  1. Register your domain with a domain registrar, such as Route 53 or GoDaddy.
  2. Once registered, create a DNS Zone within Lightsail and note the nam eservers.
  3. In your domain registrar, replace the domain name servers with your Lightsail DNS Zone name servers. If you’re using Route 53, do not replace the name servers in the default hosted zone. Here’s a video that makes everything clear.
  4. After creating your DNS Zone in Lightsail, you can now associate A (address) records with your Lightsail instances.

Once you have a custom domain, you’ll want to consider adding SSL. Here are step-by-step instructions for issuing an SSL certificate using Let’s Encrypt and terminating SSL using NGNIX.

Conclusion

While Ghost is already a very flexible platform, the Ghost community constantly adds new plugins, themes, extensions, and features. Hosting Ghost yourself is a great way to learn some basic server workflows and get the most out of your Ghost blog. And with Lightsail, it’s simple and cheap! Please feel free to contact me with comments and questions. Go ahead and start running Ghost on Amazon Lightsail.

 

About the Author: 
Robert Zhu is a Principal Developer Evangelist at Amazon Web Services. He focuses on APIs, Web, Mobile, and Gaming. Prior to joining AWS, he worked on GraphQL at Facebook. While at Microsoft, he worked on the .net Framework, Windows Server, and Microsoft Game Studios. In his spare time, he loves learning about history, economics, and psychology. You can reach him @rbzhu on twitter or directly via telepathy.