Tag Archives: Compute

ICYMI: Serverless Q4 2019

Post Syndicated from Rob Sutter original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2019/

Welcome to the eighth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, checkout what happened last quarter here.

The three months comprising the fourth quarter of 2019

AWS re:Invent

AWS re:Invent 2019

re:Invent 2019 dominated the fourth quarter at AWS. The serverless team presented a number of talks, workshops, and builder sessions to help customers increase their skills and deliver value more rapidly to their own customers.

Serverless talks from re:Invent 2019

Chris Munns presenting 'Building microservices with AWS Lambda' at re:Invent 2019

We presented dozens of sessions showing how customers can improve their architecture and agility with serverless. Here are some of the most popular.

Videos

Decks

You can also find decks for many of the serverless presentations and other re:Invent presentations on our AWS Events Content.

AWS Lambda

For developers needing greater control over performance of their serverless applications at any scale, AWS Lambda announced Provisioned Concurrency at re:Invent. This feature enables Lambda functions to execute with consistent start-up latency making them ideal for building latency sensitive applications.

As shown in the below graph, provisioned concurrency reduces tail latency, directly impacting response times and providing a more responsive end user experience.

Graph showing performance enhancements with AWS Lambda Provisioned Concurrency

Lambda rolled out enhanced VPC networking to 14 additional Regions around the world. This change brings dramatic improvements to startup performance for Lambda functions running in VPCs due to more efficient usage of elastic network interfaces.

Illustration of AWS Lambda VPC to VPC NAT

New VPC to VPC NAT for Lambda functions

Lambda now supports three additional runtimes: Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new version-specific features and benefits, which are covered in the linked release posts. Like the Node.js 10 runtime, these new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda released a number of controls for both stream and async-based invocations:

  • You can now configure error handling for Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams. It’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
  • For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

AWS Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set separate destinations for success and failure. This unlocks new patterns for distributed event-based applications and can replace custom code previously used to manage routing results.

Illustration depicting AWS Lambda Destinations with success and failure configurations

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Kinesis Data Streams and DynamoDB Streams. This enables faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Illustration of multiple AWS Lambda invocations per Kinesis Data Streams shard

Lambda Parallelization Factor diagram

Lambda introduced Amazon SQS FIFO queues as an event source. “First in, first out” (FIFO) queues guarantee the order of record processing, unlike standard queues. FIFO queues support messaging batching via a MessageGroupID attribute that supports parallel Lambda consumers of a single FIFO queue, enabling high throughput of record processing by Lambda.

Lambda now supports Environment Variables in the AWS China (Beijing) Region and the AWS China (Ningxia) Region.

You can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics show the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, discover outliers, and find hard-to-spot situations that affect customer experience for a subset of your users.

Amazon API Gateway

Screen capture of creating an Amazon API Gateway HTTP API in the AWS Management Console

Amazon API Gateway announced the preview of HTTP APIs. In addition to significant performance improvements, most customers see an average cost savings of 70% when compared with API Gateway REST APIs. With HTTP APIs, you can create an API in four simple steps. Once the API is created, additional configuration for CORS and JWT authorizers can be added.

AWS SAM CLI

Screen capture of the new 'sam deploy' process in a terminal window

The AWS SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One powerful feature of AWS Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. In Q4, Step Functions expanded its integration with Amazon SageMaker to simplify machine learning workflows. Step Functions also added a new integration with Amazon EMR, making EMR big data processing workflows faster to build and easier to monitor.

Screen capture of an AWS Step Functions step with Amazon EMR

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor trends and react to usage on your AWS account.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates. This increases the limits to 1,500 state transitions per second and a default start rate of 300 state machine executions per second in US East (N. Virginia), US West (Oregon), and Europe (Ireland). Click the above link to learn more about the limit increases in other Regions.

Screen capture of choosing Express Workflows in the AWS Management Console

Step Functions released AWS Step Functions Express Workflows. With the ability to support event rates greater than 100,000 per second, this feature is designed for high-performance workloads at a reduced cost.

Amazon EventBridge

Illustration of the Amazon EventBridge schema registry and discovery service

Amazon EventBridge announced the preview of the Amazon EventBridge schema registry and discovery service. This service allows developers to automate discovery and cataloging event schemas for use in their applications. Additionally, once a schema is stored in the registry, you can generate and download a code binding that represents the schema as an object in your code.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

Amazon CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

Screenshot of Amazon CloudWatch ServiceLens in the AWS Management Console

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

CloudWatch introduced Embedded Metric Format, which helps you ingest complex high-cardinality application data as logs and easily generate actionable metrics. You can publish these metrics from your Lambda function by using the PutLogEvents API or using an open source library for Node.js or Python applications.

Finally, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

AWS X-Ray announced trace maps, which enable you to map the end-to-end path of a single request. Identifiers show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. CloudWatch Synthetics on X-Ray support tracing canary scripts throughout the application, providing metrics on performance or application issues.

Screen capture of AWS X-Ray Service map in the AWS Management Console

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

Amazon DynamoDB announced support for customer-managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Another new DynamoDB capability to identify frequently accessed keys and database traffic trends is currently in preview. With this, you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

Screen capture of Amazon CloudWatch Contributor Insights for DynamoDB in the AWS Management Console

CloudWatch Contributor Insights for DynamoDB

DynamoDB also released adaptive capacity. Adaptive capacity helps you handle imbalanced workloads by automatically isolating frequently accessed items and shifting data across partitions to rebalance them. This helps reduce cost by enabling you to provision throughput for a more balanced workload instead of over provisioning for uneven data access patterns.

Amazon RDS

Amazon Relational Database Services (RDS) announced a preview of Amazon RDS Proxy to help developers manage RDS connection strings for serverless applications.

Illustration of Amazon RDS Proxy

The RDS Proxy maintains a pool of established connections to your RDS database instances. This pool enables you to support a large number of application connections so your application can scale without compromising performance. It also increases security by enabling IAM authentication for database access and enabling you to centrally manage database credentials using AWS Secrets Manager.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge appears next to your name in the SAR and links to your GitHub profile.

Screen capture of SAR Verifiedl developer badge in the AWS Management Console

SAR Verified developer badges

AWS Developer Tools

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

Screen capture of AWS CodeBuild

CodeBuild test trends in the AWS Management Console

Amazon CodeGuru

AWS announced a preview of Amazon CodeGuru at re:Invent 2019. CodeGuru is a machine learning based service that makes code reviews more effective and aids developers in writing code that is more secure, performant, and consistent.

AWS Amplify and AWS AppSync

AWS Amplify added iOS and Android as supported platforms. Now developers can build iOS and Android applications using the Amplify Framework with the same category-based programming model that they use for JavaScript apps.

Screen capture of 'amplify init' for an iOS application in a terminal window

The Amplify team has also improved offline data access and synchronization by announcing Amplify DataStore. Developers can now create applications that allow users to continue to access and modify data, without an internet connection. Upon connection, the data synchronizes transparently with the cloud.

For a summary of Amplify and AppSync announcements before re:Invent, read: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

Illustration of AWS AppSync integrations with other AWS services

Q4 serverless content

Blog posts

October

November

December

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year. These are listed in the Serverless section of the AWS Online Tech Talks page.

Here are the ones from Q4:

Twitch

October

There are also a number of other helpful video series covering Serverless available on the AWS Twitch Channel.

AWS Serverless Heroes

We are excited to welcome some new AWS Serverless Heroes to help grow the serverless community. We look forward to some amazing content to help you with your serverless journey.

AWS Serverless Application Repository (SAR) Apps

In this edition of ICYMI, we are introducing a section devoted to SAR apps written by the AWS Serverless Developer Advocacy team. You can run these applications and review their source code to learn more about serverless and to see examples of suggested practices.

Still looking for more?

The Serverless landing page has much more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. We’re also kicking off a fresh series of Tech Talks in 2020 with new content providing greater detail on everything new coming out of AWS for serverless application developers.

Throughout 2020, the AWS Serverless Developer Advocates are crossing the globe to tell you more about serverless, and to hear more about what you need. Follow this blog to keep up on new launches and announcements, best practices, and examples of serverless applications in action.

You can also follow all of us on Twitter to see latest news, follow conversations, and interact with the team.

Chris Munns: @chrismunns
Eric Johnson: @edjgeek
James Beswick: @jbesw
Moheeb Zara: @virgilvox
Ben Smith: @benjamin_l_s
Rob Sutter: @rts_rob
Julian Wood: @julian_wood

Happy coding!

Orchestrating a security incident response with AWS Step Functions

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/orchestrating-a-security-incident-response-with-aws-step-functions/

In this post I will show how to implement the callback pattern of an AWS Step Functions Standard Workflow. This is used to add a manual approval step into an automated security incident response framework. The framework could be extended to remediate automatically, according to the individual policy actions defined. For example, applying alternative actions, or restricting actions to specific ARNs.

The application uses Amazon EventBridge to trigger a Step Functions Standard Workflow on an IAM policy creation event. The workflow compares the policy action against a customizable list of restricted actions. It uses AWS Lambda and Step Functions to roll back the policy temporarily, then notify an administrator and wait for them to approve or deny.

Figure 1: High-level architecture diagram.

Important: the application uses various AWS services, and there are costs associated with these services after the Free Tier usage. Please see the AWS pricing page for details.

You can deploy this application from the AWS Serverless Application Repository. You then create a new IAM Policy to trigger the rule and run the application.

Deploy the application from the Serverless Application Repository

  1. Find the “Automated-IAM-policy-alerts-and-approvals” app in the Serverless Application Repository.
  2. Complete the required application settings
    • Application name: an identifiable name for the application.
    • EmailAddress: an administrator’s email address for receiving approval requests.
    • restrictedActions: the IAM Policy actions you want to restrict.

      Figure 2 Deployment Fields

  3. Choose Deploy.

Once the deployment process is completed, 21 new resources are created. This includes:

  • Five Lambda functions that contain the business logic.
  • An Amazon EventBridge rule.
  • An Amazon SNS topic and subscription.
  • An Amazon API Gateway REST API with two resources.
  • An AWS Step Functions state machine

To receive Amazon SNS notifications as the application administrator, you must confirm the subscription to the SNS topic. To do this, choose the Confirm subscription link in the verification email that was sent to you when deploying the application.

EventBridge receives new events in the default event bus. Here, the event is compared with associated rules. Each rule has an event pattern defined, which acts as a filter to match inbound events to their corresponding rules. In this application, a matching event rule triggers an AWS Step Functions execution, passing in the event payload from the policy creation event.

Running the application

Trigger the application by creating a policy either via the AWS Management Console or with the AWS Command Line Interface.

Using the AWS CLI

First install and configure the AWS CLI, then run the following command:

aws iam create-policy --policy-name my-bad-policy1234 --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketObjectLockConfiguration",
                "s3:DeleteObjectVersion",
                "s3:DeleteBucket"
            ],
            "Resource": "*"
        }
    ]
}'

Using the AWS Management Console

  1. Go to Services > Identity Access Management (IAM) dashboard.
  2. Choose Create policy.
  3. Choose the JSON tab.
  4. Paste the following JSON:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "s3:GetBucketObjectLockConfiguration",
                    "s3:DeleteObjectVersion",
                    "s3:DeleteBucket"
                ],
                "Resource": "*"
            }
        ]
    }
  5. Choose Review policy.
  6. In the Name field, enter my-bad-policy.
  7. Choose Create policy.

Either of these methods creates a policy with the permissions required to delete Amazon S3 buckets. Deleting an S3 bucket is one of the restricted actions set when the application is deployed:

Figure 3 default restricted actions

This sends the event to EventBridge, which then triggers the Step Functions state machine. The Step Functions state machine holds each state object in the workflow. Some of the state objects use the Lambda functions created during deployment to process data.

Others use Amazon States Language (ASL) enabling the application to conditionally branch, wait, and transition to the next state. Using a state machine decouples the business logic from the compute functionality.

After triggering the application, go to the Step Functions dashboard and choose the newly created state machine. Choose the current running state machine from the executions table.

Figure 4 State machine executions.

You see a visual representation of the current execution with the workflow is paused at the AskUser state.

Figure 5 Workflow Paused

These are the states in the workflow:

ModifyData
State Type: Pass
Re-structures the input data into an object that is passed throughout the workflow.

ValidatePolicy
State type: Task. Services: AWS Lambda
Invokes the ValidatePolicy Lambda function that checks the new policy document against the restricted actions.

ChooseAction
State type: Choice
Branches depending on input from ValidatePolicy step.

TempRemove
State type: Task. Service: AWS Lambda
Creates a new default version of the policy with only permissions for Amazon CloudWatch Logs and deletes the previously created policy version.

AskUser
State type: Choice
Sends an approval email to user via SNS, with the task token that initiates the callback pattern.

UsersChoice
State type: Choice
Branch based on the user action to approve or deny.

Denied
State type: Pass
Ends the execution with no further action.

Approved
State type: Task. Service: AWS Lambda
Restores the initial policy document by creating as a new version.

AllowWithNotification
State type: Task. Services: AWS Lambda
With no restricted actions detected, the user is still notified of change (via an email from SNS) before execution ends.

The callback pattern

An important feature of this application is the ability for an administrator to approve or deny a new policy. The Step Functions callback pattern makes this possible.

The callback pattern allows a workflow to pause during a task and wait for an external process to return a task token. The task token is generated when the task starts. When the AskUser function is invoked, it is passed a task token. The task token is published to the SNS topic along with the API resources for approval and denial. These API resources are created when the application is first deployed.

When the administrator clicks on the approve or deny links, it passes the token with the API request to the receiveUser Lambda function. This Lambda function uses the incoming task token to resume the AskUser state.

The lifecycle of the task token as it transitions through each service is shown below:

Figure 6 Task token lifecycle

  1. To invoke this callback pattern, the askUser state definition is declared using the .waitForTaskToken identifier, with the task token passed into the Lambda function as a payload parameter:
    "AskUser":{
     "Type": "Task",
     "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
     "Parameters":{  
     "FunctionName": "${AskUser}",
     "Payload":{  
     "token.$":"$$.Task.Token"
      }
     },
      "ResultPath":"$.taskresult",
      "Next": "usersChoice"
      },
  2. The askUser Lambda function can then access this token within the event object:
    exports.handler = async (event,context) => {
        let approveLink = `process.env.APIAllowEndpoint?token=${JSON.stringify(event.token)}`
        let denyLink = `process.env.APIDenyEndpoint?token=${JSON.stringify(event.token)}
    //code continues
  3. The task token is published to an SNS topic along with the message text parameter:
        let params = {
     TopicArn: process.env.Topic,
     Message: `A restricted Policy change has been detected Approve:${approveLink} Or Deny:${denyLink}` 
    }
     let res = await sns.publish(params).promise()
    //code continues
  4. The administrator receives an email with two links, one to approve and one to deny. The task token is appended to these links as a request query string parameter named token:

    Figure 7 Approve / deny email.

  5. Using the Amazon API Gateway proxy integration, the task token is passed directly to the recieveUser Lambda function from the API resource, and accessible from within in the function code as part of the event’s queryStringParameter object:
    exports.handler = async(event, context) => {
    //some code
        let taskToken = event.queryStringParameters.token
    //more code
    
  6.  The token is then sent back to the askUser state via an API call from within the recieveUser Lambda function.  This API call also defines the next course of action for the workflow to take.
    //some code 
    let params = {
            output: JSON.stringify({"action":NextAction}),
            taskToken: taskTokenClean
        }
    let res = await stepfunctions.sendTaskSuccess(params).promise()
    //code continues
    

Each Step Functions execution can last for up to a year, allowing for long wait periods for the administrator to take action. There is no extra cost for a longer wait time as you pay for the number of state transitions, and not for the idle wait time.

Conclusion

Using EventBridge to route IAM policy creation events directly to AWS Step Functions reduces the need for unnecessary communication layers. It helps promote good use of compute resources, ensuring Lambda is used to transform data, and not transport or orchestrate.

Using Step Functions to invoke services sequentially has two important benefits for this application. First, you can identify the use of restricted policies quickly and automatically. Also, these policies can be removed and held in a ‘pending’ state until approved.

Step Functions Standard Workflow’s callback pattern can create a robust orchestration layer that allows administrators to review each change before approving or denying.

For the full code base see the GitHub repository https://github.com/bls20AWS/AutomatedPolicyOrchestrator.

For more information on other Step Functions patterns, see our documentation on integration patterns.

Continued support for Python 2.7 on AWS Lambda

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/continued-support-for-python-2-7-on-aws-lambda/

The Python Software Foundation (PSF), which is the Python language governing body, ended support for Python 2.7 on January 1, 2020. Additionally, many popular open source software packages also ended their support for Python 2.7 then.

What is AWS Lambda doing?

We recognize that Python 2 and 3 differ across several core language aspects, and the application binary interface is not always compatible. We also recognize that these differences can make migration challenging. To allow you additional time to prepare, AWS Lambda will continue to provide critical security patches for the Python 2.7 runtime until at least December 31, 2020. Lambda’s scope of support includes the Python interpreter and standard library, but does not extend to third-party packages.

What do I need to do?

There is no action required to continue using your existing Python 2.7 Lambda functions. Python 2.7 function creates and updates will continue to work as normal. However, we highly recommend that you migrate your Lambda functions to Python 3. AWS Lambda supports Python 3.6, 3.7 and version 3.8. As always, you should test your functions for Python 3 language compatibility before applying changes to your production functions.

The Python community offers helpful guides and tools to help you port Python 2 code to Python 3:

What if I have issues or need help?

Please contact us through AWS Support or the AWS Developer Forums with any questions.

Happy coding!

Running ANSYS Fluent on Amazon EC2 C5n with Elastic Fabric Adapter (EFA)

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-ansys-fluent-on-amazon-ec2-c5n-with-elastic-fabric-adapter-efa/

Written by: Nicola Venuti, HPC Specialist Solutions Architect 

In July 2019 I published: “Best Practices for Running Ansys Fluent Using AWS ParallelCluster.” The first post demonstrated how to launch ANSYS Fluent on AWS using AWS ParallelCluster. In this blog, I discuss a new AWS service: the Elastic Fabric Adapter (EFA).   I also walk you through an example EFA for tightly coupled workloads. Finally, I demonstrate how you can accelerate your tightly coupled (MPI) workloads with EFA, which lowers your cost per job.

 

EFA is a new network interface for Amazon EC2 instances designed to accelerate tightly coupled (MPI) workloads on AWS. AWS announced EFA at re:Invent in 2018. If you want to learn more about EFA, you can read Jeff Barr’s blog post and watch this technical video led by our Principal Engineer, Brian Barrett.

 

After reading our first blog post, readers asked for benchmark results and possibly the cost associated to the job. So, in addition to a step-by-step guide on how to run your first ANSYS Fluent job with EFA, this blog post also shows the results (in terms of rating and scaling curve) up to 5000 cores of a common ANSYS Fluent benchmark the Formula-1 Race Car (140M cells Mesh), and the costs per job comparison among the most suitable Amazon EC2 instance types.

 

Create your HPC Cluster

In this part of the blog, I will walk you through the following:  the setup of AWS ParallelCluster configuration file, the setup the post-install script, and the deployment of your HPC cluster.

 

Setup AWS ParellelCluster
I use AWS ParallelCluster in this example because it simplifies the deployment of  HPC clusters on AWS.  This AWS supported, open source tool manages and deploys HPC clusters in the cloud.  Additionally, AWS ParellelCluster is already integrated with EFA, which eliminates extra effort to run your preferred HPC applications.

 

The latest release (2.5.1) of AWS ParellelCluster simplifies cluster deployment in three main ways. First, the updates remove the need for custom AMIs. Second, important components (particularly Nice DCV) run on AWS ParallelCluster. Finally, the hyperthreading can be shutdown using a new parameter in the configuration file.

 

Note: If you need additional instructions on how to install AWS ParallelCluster and get started, read this blog post, and/or the AWS ParallelCluster documentation.

 

The first few steps of this blog post differ from the previous post’s because of these updates. This means that the AWS ParallelCluster configuration file is different. In particular, here are the additions:

  1. enable_efa = compute in the [cluster] section
  2. the new [dcv] section and the dcv_settings parameter in the [cluster] section
  3. the new parameter disable_hyperthreading in the [cluster] section

These additions to the configuration file enable automatic functionalities that previously needed to be enabled manually.

 

Next, in your preferred text editor paste the following code:

aws_region_name = <your-preferred-region>

[global]
sanity_check = true
cluster_template = fluentEFA
update_check = true

[vpc my-vpc-1]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<Subnet-ID>

[cluster fluentEFA]
key_name = <Key-Name>
vpc_settings = my-vpc-1
compute_instance_type=c5n.18xlarge
master_instance_type=c5n.2xlarge
initial_queue_size = 0
max_queue_size = 100
maintain_initial_size = true
scheduler=slurm
cluster_type = ondemand
s3_read_write_resource=arn:aws:s3:::<Your-S3-Bucket>*
post_install = s3://<Your-S3-Bucket>/fluent-efa-post-install.sh
placement_group = DYNAMIC
placement = compute
base_os = centos7
tags = {"Name" : "fluentEFA"}
disable_hyperthreading = true
fsx_settings = parallel-fs
enable_efa = compute
dcv_settings = my-dcv

[dcv my-dcv]
enable = master

[fsx parallel-fs]
shared_dir = /fsx
storage_capacity = 1200
import_path = s3://<Your-S3-Bucket>
imported_file_chunk_size = 1024
export_path = s3://<Your-S3-Bucket>/export

 

Now that the ParellelCluster is set up, you are ready for the second step: post-install script.

 

Edit the Post-Install Script

Below is an example of a post install script. Make sure it is saved in the S3 bucket defined with the parameter post_install = s3://<Your-S3-Bucket>/fluent-efa-post-install.sh in the configuration file above.

 

To upload the post-install script into your S3 bucket, run the following command:

aws s3 cp fluent-efa-post-install.sh s3://<Your-S3-Bucket>/fluent-efa-post-install.sh

#!/bin/bash

#this will disable the ssh host key checking
#usually not needed, but Fluent might require this setting.
cat <<\EOF >> /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
EOF

# set higher ulimits,
# usefull when running Fluent (and in general HPC applications) on multiple instances via mpi
cat <<\EOF >> /etc/security/limits.conf
* hard memlock unlimited
* soft memlock unlimited
* hard stack 1024000
* soft stack 1024000
* hard nofile 1024000
* soft nofile 1024000
EOF

#stop and disable the firewall
systemctl disable firewalld
systemctl stop firewalld

Now, you have in place all the s of AWS ParellelCluster, and you are ready to deploy your HPC cluster.

 

Deploy HPC Cluster

Run the following command to create your HPC cluster that is EFA enabled:

pcluster create -c fluent.config fluentEFA -t fluentEFA -r <your-preferred-region>

Note:  The “*” at the end of the s3_read_write_resource parameter line is needed in order to let AWS ParallelCluster accessing your S3 bucket correctly. So, for example, if your S3 bucket is called “ansys-download,” it would look like:

s3_read_write_resource=arn:aws:s3:::ansys-download*

 

You should have your HPC cluster up and running after following the three main steps in this section. Now you can install ANSYS Fluent.

Install ANSYS Fluent

The previous section of this post should take about 10 minutes to produce the following output:

Status: parallelcluster-fluentEFA - CREATE_COMPLETE
MasterPublicIP: 3.212.243.33
ClusterUser: centos
MasterPrivateIP: 10.6.1.153

Once you receive that successful output, you can move on to install ANSYS fluent. Enter the following commands to connect to the master node of your new cluster via SSH and/or DCV:

  1. via SSH: pcluster ssh fluentEFA -i ~/my_key.pem
  2. via DCV: pcluster dcv connect fluentEFA --key-path ~/my_key.pem

 

Once you are logged in, become root (sudo su - or sudo -i ), and install the ANSYS suite under the /fsx directory. You can install it manually, or you can use the sample script.

 

Note: I defined the import_path = s3://<Your-S3-Bucket> in the Amazon FSx section of the configuration file. This tells Amazon FSx to preload all the data from <Your-S3-Bucket>. I recommend copying the ANSYS installation files, and any other file or package you need, to S3 in advance. This step ensures that your files are available under the /fsx directory of your cluster.

 

The example below uses the ANSYS iso installation files. You can use either the tar or the iso file. You can download both from the ANSYS Customer Portal under Download → Current Release.

 

Run this sample script to install ANSYS:

#!/bin/bash

#check the installation directory
if [ ! -d "${1}" -o -z "${1}" ]; then
echo "Error: please check the install dir"
exit -1
fi

ansysDir="${1}/ansys_inc"
installDir="${1}/"

ansysDisk1="ANSYS2019R3_LINX64_Disk1.iso"
ansysDisk2="ANSYS2019R3_LINX64_Disk2.iso"

# mount the Disks
disk1="${installDir}/AnsysDisk1"
disk2="${installDir}/AnsysDisk2"
mkdir -p "${disk1}"
mkdir -p "${disk2}"

echo "Mounting ${ansysDisk1} ..."
mount -o loop "${installDir}/${ansysDisk1}" "${disk1}"

echo "Mounting ${ansysDisk2} ..."
mount -o loop "${installDir}/${ansysDisk2}" "${disk2}"

# INSTALL Ansys WB
echo "Installing Ansys ${ansysver}"
"${disk1}/INSTALL" -silent -install_dir "${ansysDir}" -media_dir2 "${disk2}"

echo "Ansys installed"

umount -l "${disk1}"
echo "${ansysDisk1} unmounted..."

umount -l "${disk2}"
echo "${ansysDisk2} unmounted..."

echo "Cleaning up temporary install directory"
rm -rf "${disk1}"
rm -rf "${disk2}"

echo "Installation process completed!"

 

Congrats, now you have successfully installed ANSYS Workbench!

 

Adapt the ANSYS Fluent mpi_wrapper

Now that your HPC cluster is running and that ANSYS Workbench is installed, you can patch ANSYS Fluent. ANSYS Fluent does not currently support EFA out-of-the-box, so, you need to make a few modifications to get your app running properly.

 

Complete the following steps to make the proper modifications:

 

Open mpirun.fl (an MPI wrapper script) with your preferred text editor:

vim /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl 

 

Comment this line 465:

# For better performance, suggested by Intel
FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv I_MPI_ADJUST_REDUCE 2 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_ADJUST_BCAST 1"

 

In addition to that, line 548:

FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv LD_PRELOAD $INTEL_ROOT/lib/libmpi_mt.so"

should be modified as follows:

FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv LD_PRELOAD $INTEL_ROOT/lib/release_mt/libmpi.so"

The library file location and name changed for Intel 2019 Update 5. Fixing this will remove the following error message:

ERROR: ld.so: object '/opt/intel/parallel_studio_xe_2019/compilers_and_libraries_2019/linux/mpi/intel64//lib/libmpi_mt.so' from LD_PR ELOAD cannot be preloaded: ignored

 

I recommend backing-up the MPI wrapper script before any modification:

cp /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl.ORIG

Once these steps are completed, your ANSYS Fluent installation is properly modified to support EFA.

 

Run your first ANSYS Fluent job using EFA

You are almost ready to run your first ANSYS Fluent job using EFA. You can use the same submission script used previously.  Export INTELMPI_ROOT or OPENMPI_ROOT in order to specify the custom MPI library to use.

 

The following script demonstrates this step:

#!/bin/bash

#SBATCH -J Fluent
#SBATCH -o Fluent."%j".out

module load intelmpi
export INTELMPI_ROOT=/opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/

export [email protected]<your-license-server>
export [email protected]<your-license-server>

basedir="/fsx"
workdir="${basedir}/$(date "+%d-%m-%Y-%H-%M")-${SLURM_NPROCS}-$RANDOM"
mkdir "${workdir}"
cd "${workdir}"
cp "${basedir}/f1_racecar_140m.tar.gz" .
tar xzvf f1_racecar_140m.tar.gz
rm -f f1_racecar_140m.tar.gz
cd bench/fluent/v6/f1_racecar_140m/cas_dat

srun -l /bin/hostname | sort -n | awk '{print $2}' > hostfile
${basedir}/ansys_inc/v194/fluent/bin/fluentbench.pl f1_racecar_140m -t${SLURM_NPROCS} -cnf=hostfile -part=1 -nosyslog -noloadchk -ssh -mpi=intel -cflush

Save this snippet as fluent-run-efa.sh under /fsx and run it as follows:

sbatch -n 2304 /fsx/fluent-run-efa.sh

 

Note1: The number, 2304 cores, is an example, this command will tell AWS ParallelCluster to spin-up 64 C5n.18xlarge. Feel free to change it and run it as you wish.

Note2: you may want to copy on S3 the benchmark file f1_racecar_140m.tar.gz or any other dataset you want to use, so that it’s preloaded on Amazon FSx and ready for you to use.

 

Performance and cost considerations

Now I will show benchmark results (in terms of rating and scaling efficiency) and cost per job (only EC2 instances costs will be considered). The following graph shows the scaling curve of EFA vs C5.18xlarge vs the ideal scalability.

 

The Formula-1 Race Car used for this benchmark is a 140-M cells mesh. The range of 70k-100k cells per core optimizes cost for performance. Improvement in turnaround time continues up to 40,000 cells per core with an acceptable cost for the performance. C5n.18xlarge + EFA shows ~89% scaling efficiency at 3024 cores. This metric is a great improvement compared to the C5.18xlarge scaling (48% at 3024 cores). In both cases, I ran with the hyperthreading disabled, up to 84 instances in total.

Scaling vs number of cores graph

ANSYS has published some results of this benchmark here. The plot below shows the “Rating” of a Cray XC50 and C5n.18xlarge + EFA. In ANSYS’ own words the rating is defined as: “ the primary metric used to report performance results of the Fluent Benchmarks. It is defined as the number of benchmarks that can be run on a given machine (in sequence) in a 24 hour period. It is computed by dividing the number of seconds in a day (86,400 seconds) by the number of seconds required to run the benchmark. A higher rating means faster performance.”

 

The plot below shows C5n.18xlarge + EFA with a higher rating than the XC50, up to ~2400 cores, and is on par with it up to ~3800 cores.

In addition to turnaround time improvements, EFA brings another primary advantage: cost reduction. At the moment, C5n.18xlarge costs 27% more compared to C5.18xlarge (EFA is available at no additional cost). This price difference is due to the higher, 4x network performance (100-Gbps vs 25-Gbps) and 33% higher memory footprint (192 vs 144 GB). The following chart shows the cost comparison between C5.18xlarge and C5n.18xlarge + EFA as I scale out for the ANSYS benchmark run.

 

cost per run vs number of cores

Please note that the chart above shows the cost per job using the On-Demand price (OD) in US-East-1 (N. Virginia), for short jobs (that last minutes or even hours) you may want to consider using the EC2 Spot price. Spot Instances offer spare EC2 instances at steep discounts. At the moment I am writing this blog post, the C5n.18xlarge Spot price in N. Virginia is 70% lower compared to the On-Demand price: a significant price reduction.

 

Conclusions

This blog post reviewed best practices for running ANSYS Fluent with EFA, and walked through the performance and cost benefits of EFA running a 140-M cell ANSYS Fluent benchmark. Computational Fluid Dynamics and tightly coupled workloads involve an iterative process of tuning, refining, testing, and benchmarking. Many variables can affect performance of these workloads, so the AWS HPC team is committed to document best practices for MPI workloads on AWS.

I would also love to hear from you. Please let us know about other applications you would like us to test and features you would like to request.

Orchestrating an application process with AWS Batch using AWS CloudFormation

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/orchestrating-an-application-process-with-aws-batch-using-aws-cloudformation/

This post is written by Sivasubramanian Ramani

In many real work applications, you can use custom Docker images with AWS Batch and AWS CloudFormation to execute complex jobs efficiently.

 

This post provides a file processing implementation using Docker images and Amazon S3, AWS Lambda, Amazon DynamoDB, and AWS Batch. In this scenario, the user uploads a CSV file into an Amazon S3 bucket, which is processed by AWS Batch as a job. These jobs can be packaged as Docker containers and are executed using Amazon EC2 and Amazon ECS.

 

The following steps provide an overview of this implementation:

  1. AWS CloudFormation template launches the S3 bucket that stores the CSV files.
  2. The Amazon S3 file event notification executes an AWS Lambda function that starts an AWS Batch job.
  3. AWS Batch executes the job as a Docker container.
  4. A Python-based program reads the contents of the S3 bucket, parses each row, and updates an Amazon DynamoDB table.
  5. Amazon DynamoDB stores each processed row from the CSV.

 Prerequisites 

 

Walkthrough

The following steps outline this walkthrough. Detailed steps are given through the course of the material.

  1. Run the CloudFormation template (command provided) to create the necessary infrastructure.
  2. Set up the Docker image for the job:
    1. Build a Docker image.
    2. Tag the build and push the image to the repository.
  3. Drop the CSV into the S3 bucket (copy paste the contents and create them as a [sample file csv]).
  4. Confirm that the job runs and performs the operation based on the pushed container image. The job parses the CSV file and adds each row into DynamoDB.

 

Points to consider

  • The provided AWS CloudFormation template has all the services (refer to upcoming diagram) needed for this walkthrough in one single template. In an ideal production scenario, you might split them into different templates for easier maintenance.

 

  • As part of this walkthrough, you use the Optimal Instances for the batch. The a1.medium instance is a less expensive instance type introduced for batch operations, but you can use any AWS Batch capable instance type according to your needs.

 

  • To handle a higher volume of CSV file contents, you can do multithreaded or multiprocessing programming to complement the AWS Batch performance scale.

 

Deploying the AWS CloudFormation template

 

When deployed, the AWS CloudFormation template creates the following infrastructure.

 application using AWS BatchAn application process using AWS Batch

 

You can download the source from the github location. Below steps will detail using the downloaded code. This has the CloudFormation template that spins up the infrastructure, a Python application (.py file) and a sample CSV file. You can optionally use the below git command to clone the repository as below. This becomes your SOUCE_REPOSITORY

$ git clone https://github.com/aws-samples/aws-batch-processing-job-repo

$ cd aws-batch-processing-job-repo

$ aws cloudformation create-stack --stack-name batch-processing-job --template-body file://template/template.yaml --capabilities CAPABILITY_NAMED_IAM

 

When the preceding CloudFormation stack is created successfully, take a moment to identify the major components.

 

The CloudFormation stack spins up the following resources, which can be viewed in the AWS Management Console.

  1. CloudFormation Stack Name – batch-processing-job
  2. S3 Bucket Name – batch-processing-job-<YourAccountNumber>
    1. After the sample CSV file is dropped into this bucket, the process should kick start.
  3. JobDefinition – BatchJobDefinition
  4. JobQueue – BatchProcessingJobQueue
  5. Lambda – LambdaInvokeFunction
  6. DynamoDB – batch-processing-job
  7. Amazon CloudWatch Log – This is created when the first execution is made.
    1. /aws/batch/job
    2. /aws/lambda/LambdaInvokeFunction
  8. CodeCommit – batch-processing-job-repo
  9. CodeBuild – batch-processing-job-build

 

Once the above CloudFormation stack is complete in your personal account, we need to containerize the sample python application and push it to the ECR. This can be done in two ways as below:

 

Option A: CI/CD implementation.

As you notice, a CodeCommit and CodeBuild were created as part of above stack creation. CodeCommit URL can be found from your AWS Console > CodeCommit.  With this option, we can copy the contents from the downloaded source git repo and trigger deployment into your repository as soon as the code is checked in into your CodeCommit repository. Your CodeCommit repository will be similar to “https://git-codecommit.us-east-1.amazonaws.com/v1/repos/batch-processing-job-repo”

 

Below steps details to clone your code commit & push the changes to the repo

  1.     – $ git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/batch-processing-job-repo
  2.     – cd batch-processing-job-repo
  3.     – copy all the contents from SOURCE_REPOSITORY (from step 1) and paste inside this folder
  4.     – $ git add .
  5.     – $ git commit -m “commit from source”
  6.     – $ git push

 

You would notice as soon as the code is checked in into your CodeCommit repo, a build is triggered and Docker image built based on the Python source will be pushed to the ECR!

 

Option B: Pushing the Docker image to your repository manually in your local desktop

 

Optionally you can build the docker image and push it to the repository. The following commands build the Docker image from the provided Python code file and push the image to your Amazon ECR repository. Make sure to replace <YourAcccountNumber> with your information. The following sample CLI command uses the us-west-2 Region. If you change the Region, make sure to replace the Region values in the get-login, docker tag, and push commands, also.

 

#Get login credentials by copying and pasting the following into the command line

$ aws ecr get-login --region us-west-2 --no-include-email

# Build the Docker image.

$ docker build -t batch_processor .

# Tag the image to your repository.

$ docker tag batch_processor <YourAccountNumber>.dkr.ecr.us-west-2.amazonaws.com/batch-processing-job-repository

# Push your image to the repository.

$ docker push <YourAccountNumber>.dkr.ecr.us-west-2.amazonaws.com/batch-processing-job-repository

 

Navigate to the AWS Management Console, and verify that you can see the image in the Amazon ECR section (on the AWS Management Console screen).

 

Testing

  • In AWS Console, select “CloudFormation”. Select the S3 bucket that was created as part of the stack. This will be something like – batch-processing-job-<YourAccountNumber>
  • Drop the sample CSV file provided as part of the SOUCE_REPOSITORY

 

Code cleanup

To clean up, delete the contents of the Amazon S3 bucket and Amazon ECR repository.

 

In the AWS Management Console, navigate to your CloudFormation stack “batch-processing-job” and delete it.

 

Alternatively, run this command in AWS CLI to delete the job:

$ aws cloudformation delete-stack --stack-name batch-processing-job

 

Conclusion 

You were able to launch an application process involving AWS Batch to integrate with various AWS services. Depending on the scalability of the application needs, AWS Batch is able to process both faster and cost efficiently. I also provided a Python script, CloudFormation template, and a sample CSV file in the corresponding GitHub repo that takes care of all the preceding CLI arguments for you to build out the job definitions.

 

I encourage you to test this example and see for yourself how this overall orchestration works with AWS Batch. Then, it is just a matter of replacing your Python (or any other programming language framework) code, packaging it as a Docker container, and letting the AWS Batch handle the process efficiently.

 

If you decide to give it a try, have any doubt, or want to let me know what you think about the post, please leave a comment!

 


About the Author

Sivasubramanian Ramani (Siva Ramani) is a Sr Cloud Application Architect at AWS. His expertise is in application optimization, serverless solutions and using Microsoft application workloads with AWS.

New – Provisioned Concurrency for Lambda Functions

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-provisioned-concurrency-for-lambda-functions/

It’s really true that time flies, especially when you don’t have to think about servers: AWS Lambda just turned 5 years old and the team is always looking for new ways to help customers build and run applications in an easier way.

As more mission critical applications move to serverless, customers need more control over the performance of their applications. Today we are launching Provisioned Concurrency, a feature that keeps functions initialized and hyper-ready to respond in double-digit milliseconds. This is ideal for implementing interactive services, such as web and mobile backends, latency-sensitive microservices, or synchronous APIs.

When you invoke a Lambda function, the invocation is routed to an execution environment to process the request. When a function has not been used for some time, when you need to process more concurrent invocations, or when you update a function, new execution environments are created. The creation of an execution environment takes care of installing the function code and starting the runtime. Depending on the size of your deployment package, and the initialization time of the runtime and of your code, this can introduce latency for the invocations that are routed to a new execution environment. This latency is usually referred to as a “cold start”. For most applications this additional latency is not a problem. For some applications, however, this latency may not be acceptable.

When you enable Provisioned Concurrency for a function, the Lambda service will initialize the requested number of execution environments so they can be ready to respond to invocations.

Configuring Provisioned Concurrency
I create two Lambda functions that use the same Java code and can be triggered by Amazon API Gateway. To simulate a production workload, these functions are repeating some mathematical computation 10 million times in the initialization phase and 200,000 times for each invocation. The computation is using java.Math.Random and conditions (if ...) to avoid compiler optimizations (such as “unlooping” the iterations). Each function has 1GB of memory and the size of the code is 1.7MB.

I want to enable Provisioned Concurrency only for one of the two functions, so that I can compare how they react to a similar workload. In the Lambda console, I select one the functions. In the configuration tab, I see the new Provisioned Concurrency settings.

I select Add configuration. Provisioned Concurrency can be enabled for a specific Lambda function version or alias (you can’t use $LATEST). You can have different settings for each version of a function. Using an alias, it is easier to enable these settings to the correct version of your function. In my case I select the alias live that I keep updated to the latest version using the AWS SAM AutoPublishAlias function preference. For the Provisioned Concurrency, I enter 500 and Save.

Now, the Provisioned Concurrency configuration is in progress. The execution environments are being prepared to serve concurrent incoming requests based on my input. During this time the function remains available and continues to serve traffic.

After a few minutes, the concurrency is ready. With these settings, up to 500 concurrent requests will find an execution environment ready to process them. If I go above that, the usual scaling of Lambda functions still applies.

To generate some load, I use an Amazon Elastic Compute Cloud (EC2) instance in the same region. To keep it simple, I use the ab tool bundled with the Apache HTTP Server to call the two API endpoints 10,000 times with a concurrency of 500. Since these are new functions, I expect that:

  • For the function with Provisioned Concurrency enabled and set to 500, my requests are managed by pre-initialized execution environments.
  • For the other function, that has Provisioned Concurrency disabled, about 500 execution environments need to be provisioned, adding some latency to the same amount of invocations, about 5% of the total.

One cool feature of the ab tool is that is reporting the percentage of the requests served within a certain time. That is a very good way to look at API latency, as described in this post on Serverless Latency by Tim Bray.

Here are the results for the function with Provisioned Concurrency disabled:

Percentage of the requests served within a certain time (ms)
50% 351
66% 359
75% 383
80% 396
90% 435
95% 1357
98% 1619
99% 1657
100% 1923 (longest request)

Looking at these numbers, I see that 50% the requests are served within 351ms, 66% of the requests within 359ms, and so on. It’s clear that something happens when I look at 95% or more of the requests: the time suddenly increases by about a second.

These are the results for the function with Provisioned Concurrency enabled:

Percentage of the requests served within a certain time (ms)
50% 352
66% 368
75% 382
80% 387
90% 400
95% 415
98% 447
99% 513
100% 593 (longest request)

Let’s compare those numbers in a graph.

As expected for my test workload, I see a big difference in the response time of the slowest 5% of the requests (between 95% and 100%), where the function with Provisioned Concurrency disabled shows the latency added by the creation of new execution environments and the (slow) initialization in my function code.

In general, the amount of latency added depends on the runtime you use, the size of your code, and the initialization required by your code to be ready for a first invocation. As a result, the added latency can be more, or less, than what I experienced here.

The number of invocations affected by this additional latency depends on how often the Lambda service needs to create new execution environments. Usually that happens when the number of concurrent invocations increases beyond what already provisioned, or when you deploy a new version of a function.

A small percentage of slow response times (generally referred to as tail latency) really makes a difference in end user experience. Over an extended period of time, most users are affected during some of their interactions. With Provisioned Concurrency enabled, user experience is much more stable.

Provisioned Concurrency is a Lambda feature and works with any trigger. For example, you can use it with WebSockets APIsGraphQL resolvers, or IoT Rules. This feature gives you more control when building serverless applications that require low latency, such as web and mobile apps, games, or any service that is part of a complex transaction.

Available Now
Provisioned Concurrency can be configured using the console, the AWS Command Line Interface (CLI), or AWS SDKs for new or existing Lambda functions, and is available today in the following AWS Regions: in US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), and Europe (Stockholm), Middle East (Bahrain), and South America (São Paulo).

You can use the AWS Serverless Application Model (SAM) and SAM CLI to test, deploy and manage serverless applications that use Provisioned Concurrency.

With Application Auto Scaling you can automate configuring the required concurrency for your functions. As policies, Target Tracking and Scheduled Scaling are supported. Using these policies, you can automatically increase the amount of concurrency during times of high demand and decrease it when the demand decreases.

You can also use Provisioned Concurrency today with AWS Partner tools, including configuring Provisioned Currency settings with the Serverless Framework and Terraform, or viewing metrics with Datadog, Epsagon, Lumigo, New Relic, SignalFx, SumoLogic, and Thundra.

You only pay for the amount of concurrency that you configure and for the period of time that you configure it. Pricing in US East (N. Virginia) is $0.015 per GB-hour for Provisioned Concurrency and $0.035 per GB-hour for Duration. The number of requests is charged at the same rate as normal functions. You can find more information in the Lambda pricing page.

This new feature enables developers to use Lambda for a variety of workloads that require highly consistent latency. Let me know what you are going to use it for!

Danilo

Decoupled Serverless Scheduler To Run HPC Applications At Scale on EC2

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/decoupled-serverless-scheduler-to-run-hpc-applications-at-scale-on-ec2/

This post is written by Ludvig Nordstrom and Mark Duffield | on November 27, 2019

In this blog post, we dive in to a cloud native approach for running HPC applications at scale on EC2 Spot Instances, using a decoupled serverless scheduler. This architecture is ideal for many workloads in the HPC and EDA industries, and can be used for any batch job workload.

At the end of this blog post, you will have two takeaways.

  1. A highly scalable environment that can run on hundreds of thousands of cores across EC2 Spot Instances.
  2. A fully serverless architecture for job orchestration.

We discuss deploying and running a pre-built serverless job scheduler that can run both Windows and Linux applications using any executable file format for your application. This environment provides high performance, scalability, cost efficiency, and fault tolerance. We introduce best practices and benefits to creating this environment, and cover the architecture, running jobs, and integration in to existing environments.

quick note about the term cloud native: we use the term loosely in this blog. Here, cloud native  means we use AWS Services (to include serverless and microservices) to build out our compute environment, instead of a traditional lift-and-shift method.

Let’s get started!

 

Solution overview

This blog goes over the deployment process, which leverages AWS CloudFormation. This allows you to use infrastructure as code to automatically build out your environment. There are two parts to the solution: the Serverless Scheduler and Resource Automation. Below are quick summaries of each part of the solutions.

Part 1 – The serverless scheduler

This first part of the blog builds out a serverless workflow to get jobs from SQS and run them across EC2 instances. The CloudFormation template being used for Part 1 is serverless-scheduler-app.template, and here is the Reference Architecture:

 

Serverless Scheduler Reference Architecture . Reference Architecture for Part 1. This architecture shows just the Serverless Schduler. Part 2 builds out the resource allocation architecture. Outlined Steps with detail from figure one

    Figure 1: Serverless Scheduler Reference Architecture (grayed-out area is covered in Part 2).

Read the GitHub Repo if you want to look at the Step Functions workflow contained in preceding images. The walkthrough explains how the serverless application retrieves and runs jobs on its worker, updates DynamoDB job monitoring table, and manages the worker for its lifetime.

 

Part 2 – Resource automation with serverless scheduler


This part of the solution relies on the serverless scheduler built in Part 1 to run jobs on EC2.  Part 2 simplifies submitting and monitoring jobs, and retrieving results for users. Jobs are spread across our cost-optimized Spot Instances. AWS Autoscaling automatically scales up the compute resources when jobs are submitted, then terminates them when jobs are finished. Both of these save you money.

The CloudFormation template used in Part 2 is resource-automation.template. Building on Figure 1, the additional resources launched with Part 2 are noted in the following image, they are an S3 Bucket, AWS Autoscaling Group, and two Lambda functions.

Resource Automation using Serverless Scheduler This is Part 2 of the deployment process, and leverages the Part 1 architecture. This provides the resource allocation, that allows for automated job submission and EC2 Auto Scaling. Detailed steps for the prior image

 

Figure 2: Resource Automation using Serverless Scheduler

                               

Introduction to decoupled serverless scheduling

HPC schedulers traditionally run in a classic master and worker node configuration. A scheduler on the master node orchestrates jobs on worker nodes. This design has been successful for decades, however many powerful schedulers are evolving to meet the demands of HPC workloads. This scheduler design evolved from a necessity to run orchestration logic on one machine, but there are now options to decouple this logic.

What are the possible benefits that decoupling this logic could bring? First, we avoid a number of shortfalls in the environment such as the need for all worker nodes to communicate with a single master node. This single source of communication limits scalability and creates a single point of failure. When we split the scheduler into decoupled components both these issues disappear.

Second, in an effort to work around these pain points, traditional schedulers had to create extremely complex logic to manage all workers concurrently in a single application. This stifled the ability to customize and improve the code – restricting changes to be made by the software provider’s engineering teams.

Serverless services, such as AWS Step Functions and AWS Lambda fix these major issues. They allow you to decouple the scheduling logic to have a one-to-one mapping with each worker, and instead share an Amazon Simple Queue Service (SQS) job queue. We define our scheduling workflow in AWS Step Functions. Then the workflow scales out to potentially thousands of “state machines.” These state machines act as wrappers around each worker node and manage each worker node individually.  Our code is less complex because we only consider one worker and its job.

We illustrate the differences between a traditional shared scheduler and decoupled serverless scheduler in Figures 3 and 4.

 

Traditional Scheduler Model This shows a traditional sceduler where there is one central schduling host, and then multiple workers.

Figure 3: Traditional Scheduler Model

 

Decoupled Serverless Scheduler on each instance This shows what a Decoupled Serverless Scheduler design looks like, wit

Figure 4: Decoupled Serverless Scheduler on each instance

 

Each decoupled serverless scheduler will:

  • Retrieve and pass jobs to its worker
  • Monitor its workers health and take action if needed
  • Confirm job success by checking output logs and retry jobs if needed
  • Terminate the worker when job queue is empty just before also terminating itself

With this new scheduler model, there are many benefits. Decoupling schedulers into smaller schedulers increases fault tolerance because any issue only affects one worker. Additionally, each scheduler consists of independent AWS Lambda functions, which maintains the state on separate hardware and builds retry logic into the service.  Scalability also increases, because jobs are not dependent on a master node, which enables the geographic distribution of jobs. This geographic distribution allows you to optimize use of low-cost Spot Instances. Also, when decoupling the scheduler, workflow complexity decreases and you can customize scheduler logic. You can leverage lower latency job monitoring and customize automated responses to job events as they happen.

 

Benefits

  • Fully managed –  With Part 2, Resource Automation deployed, resources for a job are managed. When a job is submitted, resources launch and run the job. When the job is done, worker nodes automatically shut down. This prevents you from incurring continuous costs.

 

  • Performance – Your application runs on EC2, which means you can choose any of the high performance instance types. Input files are automatically copied from Amazon S3 into local Amazon EC2 Instance Store for high performance storage during execution. Result files are automatically moved to S3 after each job finishes.

 

  • Scalability – A worker node combined with a scheduler state machine become a stateless entity. You can spin up as many of these entities as you want, and point them to an SQS queue. You can even distribute worker and state machine pairs across multiple AWS regions. These two components paired with fully managed services optimize your architecture for scalability to meet your desired number of workers.

 

  • Fault Tolerance –The solution is completely decoupled, which means each worker has its own state machine that handles scheduling for that worker. Likewise, each state machine is decoupled into Lambda functions that make up your state machine. Additionally, the scheduler workflow includes a Lambda function that confirms each successful job or resubmits jobs.

 

  • Cost Efficiency – This fault tolerant environment is perfect for EC2 Spot Instances. This means you can save up to 90% on your workloads compared to On-Demand Instance pricing. The scheduler workflow ensures little to no idle time of workers by closely monitoring and sending new jobs as jobs finish. Because the scheduler is serverless, you only incur costs for the resources required to launch and run jobs. Once the job is complete, all are terminated automatically.

 

  • Agility – You can use AWS fully managed Developer Tools to quickly release changes and customize workflows. The reduced complexity of a decoupled scheduling workflow means that you don’t have to spend time managing a scheduling environment, and can instead focus on your applications.

 

 

Part 1 – serverless scheduler as a standalone solution

 

If you use the serverless scheduler as a standalone solution, you can build clusters and leverage shared storage such as FSx for Lustre, EFS, or S3. Additionally, you can use AWS CloudFormation or to deploy more complex compute architectures that suit your application. So, the EC2 Instances that run the serverless scheduler can be launched in any number of ways. The scheduler only requires the instance id and the SQS job queue name.

 

Submitting Jobs Directly to serverless scheduler

The severless scheduler app is a fully built AWS Step Function workflow to pull jobs from an SQS queue and run them on an EC2 Instance. The jobs submitted to SQS consist of an AWS Systems Manager Run Command, and work with any SSM Document and command that you chose for your jobs. Examples of SSM Run Commands are ShellScript and PowerShell.  Feel free to read more about Running Commands Using Systems Manager Run Command.

The following code shows the format of a job submitted to SQS in JSON.

  {

    "job_id": "jobId_0",

    "retry": "3",

    "job_success_string": " ",

    "ssm_document": "AWS-RunPowerShellScript",

    "commands":

        [

            "cd C:\\ProgramData\\Amazon\\SSM; mkdir Result",

            "Copy-S3object -Bucket my-bucket -KeyPrefix jobs/date/jobId_0 -LocalFolder .\\",

            "C:\\ProgramData\\Amazon\\SSM\\jobId_0.bat",

            "Write-S3object -Bucket my-bucket -KeyPrefix jobs/date/jobId_0 –Folder .\\Result\\"

        ],

  }

 

Any EC2 Instance associated with a serverless scheduler it receives jobs picked up from a designated SQS queue until the queue is empty. Then, the EC2 resource automatically terminates. If the job fails, it retries until it reaches the specified number of times in the job definition. You can include a specific string value so that the scheduler searches for job execution outputs and confirms the successful completions of jobs.

 

Tagging EC2 workers to get a serverless scheduler state machine

In Part 1 of the deployment, you must manage your EC2 Instance launch and termination. When launching an EC2 Instance, tag it with a specific tag key that triggers a state machine to manage that instance. The tag value is the name of the SQS queue that you want your state machine to poll jobs from.

In the following example, “my-scheduler-cloudformation-stack-name” is the tag key that serverless scheduler app will for with any new EC2 instance that starts. Next, “my-sqs-job-queue-name” is the default job queue created with the scheduler. But, you can change this to any queue name you want to retrieve jobs from when an instance is launched.

{"my-scheduler-cloudformation-stack-name":"my-sqs-job-queue-name"}

 

Monitor jobs in DynamoDB

You can monitor job status in the following DynamoDB. In the table you can find job_id, commands sent to Amazon EC2, job status, job output logs from Amazon EC2, and retries among other things.

Alternatively, you can query DynamoDB for a given job_id via the AWS Command Line Interface:

aws dynamodb get-item --table-name job-monitoring \

                      --key '{"job_id": {"S": "/my-jobs/my-job-id.bat"}}'

 

Using the “job_success_string” parameter

For the prior DynamoDB table, we submitted two identical jobs using an example script that you can also use. The command sent to the instance is “echo Hello World.” The output from this job should be “Hello World.” We also specified three allowed job retries.  In the following image, there are two jobs in SQS queue before they ran.  Look closely at the different “job_success_strings” for each and the identical command sent to both:

DynamoDB CLI info This shows an example DynamoDB CLI output with job information.

From the image we see that Job2 was successful and Job1 retried three times before permanently labelled as failed. We forced this outcome to demonstrate how the job success string works by submitting Job1 with “job_success_string” as “Hello EVERYONE”, as that will not be in the job output “Hello World.” In “Job2” we set “job_success_string” as “Hello” because we knew this string will be in the output log.

Job outputs commonly have text that only appears if job succeeded. You can also add this text yourself in your executable file. With “job_success_string,” you can confirm a job’s successful output, and use it to identify a certain value that you are looking for across jobs.

 

Part 2 – Resource Automation with the serverless scheduler

The additional services we deploy in Part 2 integrate with existing architectures to launch resources for your serverless scheduler. These services allow you to submit jobs simply by uploading input files and executable files to an S3 bucket.

Likewise, these additional resources can use any executable file format you want, including proprietary application level scripts. The solution automates everything else. This includes creating and submitting jobs to SQS job queue, spinning up compute resources when new jobs come in, and taking them back down when there are no jobs to run. When jobs are done, result files are copied to S3 for the user to retrieve. Similar to Part 1, you can still view the DynamoDB table for job status.

This architecture makes it easy to scale out to different teams and departments, and you can submit potentially hundreds of thousands of jobs while you remain in control of resources and cost.

 

Deeper Look at the S3 Architecture

The following diagram shows how you can submit jobs, monitor progress, and retrieve results. To submit jobs, upload all the needed input files and an executable script to S3. The suffix of the executable file (uploaded last) triggers an S3 event to start the process, and this suffix is configurable.

The S3 key of the executable file acts as the job id, and is kept as a reference to that job in DynamoDB. The Lambda (#2 in diagram below) uses the S3 key of the executable to create three SSM Run Commands.

  1. Synchronize all files in the same S3 folder to a working directory on the EC2 Instance.
  2. Run the executable file on EC2 Instances within a specified working directory.
  3. Synchronize the EC2 Instances working directory back to the S3 bucket where newly generated result files are included.

This Lambda (#2) then places the job on the SQS queue using the schedulers JSON formatted job definition seen above.

IMPORTANT: Each set of job files should be given a unique job folder in S3 or more files than needed might be moved to the EC2 Instance.

 

Figure 5: Resource Automation using Serverless Scheduler - A deeper look A deeper dive in to Part 2, resource allcoation.

Figure 5: Resource Automation using Serverless Scheduler – A deeper look

 

EC2 and Step Functions workflow use the Lambda function (#3 in prior diagram) and the Auto Scaling group to scale out based on the number of jobs in the queue to a maximum number of workers (plus state machine), as defined in the Auto Scaling Group. When the job queue is empty, the number of running instances scale down to 0 as they finish their remaining jobs.

 

Process Submitting Jobs and Retrieving Results

  1. Seen in1, upload input file(s) and an executable file into a unique job folder in S3 (such as /year/month/day/jobid/~job-files). Upload the executable file last because it automatically starts the job. You can also use a script to upload multiple files at a time but each job will need a unique directory. There are many ways to make S3 buckets available to users including AWS Storage Gateway, AWS Transfer for SFTP, AWS DataSync, the AWS Console or any one of the AWS SDKs leveraging S3 API calls.
  2. You can monitor job status by accessing the DynamoDB table directly via the AWS Management Console or use the AWS CLI to call DynamoDB via an API call.
  3. Seen in step 5, you can retrieve result files for jobs from the same S3 directory where you left the input files. The DynamoDB table confirms when jobs are done. The SQS output queue can be used by applications that must automatically poll and retrieve results.

You no longer need to create or access compute nodes as compute resources. These automatically scale up from zero when jobs come in, and then back down to zero when jobs are finished.

 

Deployment

Read the GitHub Repo for deployment instructions. Below are CloudFormation templates to help:

AWS RegionLaunch Stack
eu-north-1link to zone
ap-south-1
eu-west-3
eu-west-2
eu-west-1
ap-northeast-3
ap-northeast-2
ap-northeast-1
sa-east-1
ca-central-1
ap-southeast-1
ap-southeast-2
eu-central-1
us-east-1
us-east-2
us-west-1
us-west-2

 

 

Additional Points on Usage Patterns

 

  • While the two solutions in this blog are aimed at HPC applications, they can be used to run any batch jobs. Many customers that run large data processing batch jobs in their data lakes could use the serverless scheduler.

 

  • You can build pipelines of different applications when the output of one job triggers another to do something else – an example being pre-processing, meshing, simulation, post-processing. You simply deploy the Resource Automation template several times, and tailor it so that the output bucket for one step is the input bucket for the next step.

 

  • You might look to use the “job_success_string” parameter for iteration/verification used in cases where a shot-gun approach is needed to run thousands of jobs, and only one has a chance of producing the right result. In this case the “job_success_string” would identify the successful job from potentially hundreds of thousands pushed to SQS job queue.

 

Scale-out across teams and departments

Because all services used are serverless, you can deploy as many run environments as needed without increasing overall costs. Serverless workloads only accumulate cost when the services are used. So, you could deploy ten job environments and run one job in each, and your costs would be the same if you had one job environment running ten jobs.

 

All you need is an S3 bucket to upload jobs to and an associated AMI that has the right applications and license configuration. Because a job configuration is passed to the scheduler at each job start, you can add new teams by creating an S3 bucket and pointing S3 events to a default Lambda function that pulls configurations for each job start.

 

Setup CI/CD pipeline to start continuous improvement of scheduler

If you are advanced, we encourage you to clone the git repo and customize this solution. The serverless scheduler is less complex than other schedulers, because you only think about one worker and the process of one job’s run.

Ways you could tailor this solution:

  • Add intelligent job scheduling using AWS Sagemaker  – It is hard to find data as ready for ML as log data because every job you run has different run times and resource consumption. So, you could tailor this solution to predict the best instance to use with ML when workloads are submitted.
  • Add Custom Licensing Checkout Logic – Simply add one Lambda function to your Step Functions workflow to make an API call a license server before continuing with one or more jobs. You can start a new worker when you have a license checked out or if a license is not available then the instance can terminate to remove any costs waiting for licenses.
  • Add Custom Metrics to DynamoDB – You can easily add metrics to DynamoDB because the solution already has baseline logging and monitoring capabilities.
  • Run on other AWS Services – There is a Lambda function in the Step Functions workflow called “Start_Job”. You can tailor this Lambda to run your jobs on AWS Sagemaker, AWS EMR, AWS EKS or AWS ECS instead of EC2.

 

Conclusion

 

Although HPC workloads and EDA flows may still be dependent on current scheduling technologies, we illustrated the possibilities of decoupling your workloads from your existing shared scheduling environments. This post went deep into decoupled serverless scheduling, and we understand that it is difficult to unwind decades of dependencies. However, leveraging numerous AWS Services encourages you to think completely differently about running workloads.

But more importantly, it encourages you to Think Big. With this solution you can get up and running quickly, fail fast, and iterate. You can do this while scaling to your required number of resources, when you want them, and only pay for what you use.

Serverless computing  catalyzes change across all industries, but that change is not obvious in the HPC and EDA industries. This solution is an opportunity for customers to take advantage of the nearly limitless capacity that AWS.

Please reach out with questions about HPC and EDA on AWS. You now have the architecture and the instructions to build your Serverless Decoupled Scheduling environment.  Go build!


About the Authors and Contributors

Authors 

 

Ludvig Nordstrom is a Senior Solutions Architect at AWS

 

 

 

 

Mark Duffield is a Tech Lead in Semiconductors at AWS

 

 

 

Contributors

 

Steve Engledow is a Senior Solutions Builder at AWS

 

 

 

 

Arun Thomas is a Senior Solutions Builder at AWS

 

 

Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances

Post Syndicated from peven original https://aws.amazon.com/blogs/compute/running-cost-effective-queue-workers-with-amazon-sqs-and-amazon-ec2-spot-instances/

This post is contributed by Ran Sheinberg | Sr. Solutions Architect, EC2 Spot & Chad Schmutzer | Principal Developer Advocate, EC2 Spot | Twitter: @schmutze

Introduction

Amazon Simple Queue Service (SQS) is used by customers to run decoupled workloads in the AWS Cloud as a best practice, in order to increase their applications’ resilience. You can use a worker tier to do background processing of images, audio, documents and so on, as well as offload long-running processes from the web tier. This blog post covers the benefits of pairing Amazon SQS and Spot Instances to maximize cost savings in the worker tier, and a customer success story.

Solution Overview

Amazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a common best practice to use Amazon SQS with decoupled applications. Amazon SQS increases applications resilience by decoupling the direct communication between the frontend application and the worker tier that does data processing. If a worker node fails, the jobs that were running on that node return to the Amazon SQS queue for a different node to pick up.

Both the frontend and worker tier can run on Spot Instances, which offer spare compute capacity at steep discounts compared to On-Demand Instances. Spot Instances optimize your costs on the AWS Cloud and scale your application’s throughput up to 10 times for the same budget. Spot Instances can be interrupted with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. These can include analytics, containerized workloads, high performance computing (HPC), stateless web servers, rendering, CI/CD, and queue worker nodes—which is the focus of this post.

Worker tiers of a decoupled application are typically fault-tolerant. So, it is a prime candidate for running on interruptible capacity. Amazon SQS running on Spot Instances allows for more robust, cost-optimized applications.

By using EC2 Auto Scaling groups with multiple instance types that you configured as suitable for your application (for example, m4.xlarge, m5.xlarge, c5.xlarge, and c4.xlarge, in multiple Availability Zones), you can spread the worker tier’s compute capacity across many Spot capacity pools (a combination of instance type and Availability Zone). This increases the chance of achieving the scale that’s required for the worker tier to ingest messages from the queue, and of keeping that scale when Spot Instance interruptions occur, while selecting the lowest-priced Spot Instances in each availability zone.

You can also choose the capacity-optimized allocation strategy for the Spot Instances in your Auto Scaling group. This strategy automatically selects instances that have a lower chance of interruption, which decreases the chances of restarting jobs due to Spot interruptions. When Spot Instances are interrupted, your Auto Scaling group automatically replenishes the capacity from a different Spot capacity pool in order to achieve your desired capacity. Read the blog post “Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances” for more details on how to choose the suitable allocation strategy.

We focus on three main points in this blog:

  1. Best practices for using Spot Instances with Amazon SQS
  2. A customer example that uses these components
  3. Example solution that can help you get you started quickly

Application of Amazon SQS with Spot Instances

Amazon SQS eliminates the complexity of managing and operating message-oriented middleware. Using Amazon SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available. Amazon SQS is a fully managed service which allows you to set up a queue in seconds. It also allows you to use your preferred SDK to start writing and reading to and from the queue within minutes.

In the following example, we describe an AWS architecture that brings together the Amazon SQS queue and an EC2 Auto Scaling group running Spot Instances. The architecture is used for decoupling the worker tier from the web tier by using Amazon SQS. The example uses the Protect feature (which we will explain later in this post) to ensure that an instance currently processing a job does not get terminated by the Auto Scaling group when it detects that a scale-in activity is required due to a Dynamic Scaling Policy.Architecture diagram for using Amazon SQS with Spot Instances and Auto Scaling groups

AWS reference architecture used for decoupling the worker tier from the web tier by using Amazon SQS

Customer Example: How Trax Retail uses Auto Scaling groups with Spot Instances in their Amazon SQS application

Trax decided to run its queue worker tier exclusively on Spot Instances due to the fault-tolerant nature of its architecture and for cost-optimization purposes. The company digitizes the physical world of retail using Computer Vision. Their ‘Trax Factory’ transforms individual shelf into data and insights about retail store conditions.

Built using asynchronous event-driven architecture, Trax Factory is a cluster of microservices in which the completion of one service triggers the activation of another service. The worker tier uses Auto Scaling groups with dynamic scaling policies to increase and decrease the number of worker nodes in the worker tier.

You can create a Dynamic Scaling Policy by doing the following:

  1. Observe a Amazon CloudWatch metric. Watch the metric for the current number of messages in the Amazon SQS queue (ApproximateNumberOfMessagesVisible).
  2. Create a CloudWatch alarm. This alarm should be based on that metric you created in the prior step.
  3. Use your CloudWatch alarm in a Dynamic Scaling Policy. Use this policy increase and decrease the number of EC2 Instances in the Auto Scaling group.

In Trax’s case, due to the high variability of the number of messages in the queue, they opted to enhance this approach in order to minimize the time it takes to scale, by building a service that would call the SQS API and find the current number of messages in the queue more frequently, instead of waiting for the 5 minute metric refresh interval in CloudWatch.

Trax ensures that its applications are always scaled to meet the demand by leveraging the inherent elasticity of Amazon EC2 instances. This elasticity ensures that end users are never affected and/or service-level agreements (SLA) are never violated.

With a Dynamic Scaling Policy, the Auto Scaling group can detect when the number of messages in the queue has decreased, so that it can initiate a scale-in activity. The Auto Scaling group uses its configured termination policy for selecting the instances to be terminated. However, this policy poses the risk that the Auto Scaling group might select an instance for termination while that instance is currently processing an image. That instance’s work would be lost (although the image would eventually be processed by reappearing in the queue and getting picked up by another worker node).

To decrease this risk, you can use Auto Scaling groups instance protection. This means that every time an instance fetches a job from the queue, it also sends an API call to EC2 to protect itself from scale-in. The Auto Scaling group does not select the protected, working instance for termination until the instance finishes processing the job and calls the API to remove the protection.

Handling Spot Instance interruptions

This instance-protection solution ensures that no work is lost during scale-in activities. However, protecting from scale-in does not work when an instance is marked for termination due to Spot Instance interruptions. These interruptions occur when there’s increased demand for On-Demand Instances in the same capacity pool (a combination of an instance type in an Availability Zone).

Applications can minimize the impact of a Spot Instance interruption. To do so, an application catches the two-minute interruption notification (available in the instance’s metadata), and instructs itself to stop fetching jobs from the queue. If there’s an image still being processed when the two minutes expire and the instance is terminated, the application does not delete the message from the queue after finishing the process. Instead, the message simply becomes visible again for another instance to pick up and process after the Amazon SQS visibility timeout expires.

Alternatively, you can release any ongoing job back to the queue upon receiving a Spot Instance interruption notification by setting the visibility timeout of the specific message to 0. This timeout potentially decreases the total time it takes to process the message.

Testing the solution

If you’re not currently using Spot Instances in your queue worker tier, we suggest testing the approach described in this post.

For that purpose, we built a simple solution to demonstrate the capabilities mentioned in this post, using an AWS CloudFormation template. The stack includes an Amazon Simple Storage Service (S3) bucket with a CloudWatch trigger to push notifications to an SQS queue after an image is uploaded to the Amazon S3 bucket. Once the message is in the queue, it is picked up by the application running on the EC2 instances in the Auto Scaling group. Then, the image is converted to PDF, and the instance is protected from scale-in for as long as it has an active processing job.

To see the solution in action, deploy the CloudFormation template. Then upload an image to the Amazon S3 bucket. In the Auto Scaling Groups console, check the instance protection status on the Instances tab. The protection status is shown in the following screenshot.

instance protection status in console

You can also see the application logs using CloudWatch Logs:

/usr/local/bin/convert-worker.sh: Found 1 messages in https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB

/usr/local/bin/convert-worker.sh: Found work to convert. Details: INPUT=Capture1.PNG, FNAME=capture1, FEXT=png

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --protected-from-scale-in

/usr/local/bin/convert-worker.sh: Convert done. Copying to S3 and cleaning up

/usr/local/bin/convert-worker.sh: Running: aws s3 cp /tmp/capture1.pdf s3://qtest-s3bucket-18fdpm2j17wxx

/usr/local/bin/convert-worker.sh: Running: aws sqs --output=json delete-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB --receipt-handle

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --no-protected-from-scale-in

Conclusion

This post helps you architect fault tolerant worker tiers in a cost optimized way. If your queue worker tiers are fault tolerant and use the built-in Amazon SQS features, you can increase your application’s resilience and take advantage of Spot Instances to save up to 90% on compute costs.

In this post, we emphasized several best practices to help get you started saving money using Amazon SQS and Spot Instances. The main best practices are:

  • Diversifying your Spot Instances using Auto Scaling groups, and selecting the right Spot allocation strategy
  • Protecting instances from scale-in activities while they process jobs
  • Using the Spot interruption notification so that the application stop polling the queue for new jobs before the instance is terminated

We hope you found this post useful. If you’re not using Spot Instances in your queue worker tier, we suggest testing the approach described here. Finally, we would like to thank the Trax team for sharing its architecture and best practices. If you want to learn more, watch the “This is my architecture” video featuring Trax and their solution.

We’d love your feedback—please comment and let me know what you think.


About the authors

 

Ran Sheinberg is a specialist solutions architect for EC2 Spot Instances with Amazon Web Services. He works with AWS customers on cost optimizing their compute spend by utilizing Spot Instances across different types of workloads: stateless web applications, queue workers, containerized workloads, analytics, HPC and others.

 

 

 

 

As a Principal Developer Advocate for EC2 Spot at AWS, Chad’s job is to make sure our customers are saving at scale by using EC2 Spot Instances to take advantage of the most cost-effective way to purchase compute capacity. Follow him on Twitter here! @schmutze

 

Deploying a highly available WordPress site on Amazon Lightsail, Part 4: Increasing performance and scalability with a Lightsail load balancer

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/deploying-a-highly-available-wordpress-site-on-amazon-lightsail-part-4-increasing-performance-and-scalability-with-a-lightsail-load-balancer/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail | Twitter: @mikegcoleman

This is the final post in a series about getting a highly available WordPress site up and running on Amazon Lightsail. For reference, the other blog posts are:

  1. Implementing a highly available Lightsail database with WordPress
  2. Using Amazon S3 with WordPress to securely deliver media files
  3. Increasing security and performance using Amazon CloudFront

In this post, you’ll learn how to stand up a Lightsail load balancer, create a snapshot of your WordPress server, deploy new instances from those snapshots, and place the instances behind your load balancer.

A load balancer accepts incoming web traffic and routes it to one or (usually) more servers. Having multiple servers allows you to scale the number of incoming requests your site can handle, as well as allowing your site to remain responsive if a web server fails. The following diagram shows the solution architecture, which features multiple front-end WordPress servers behind a Lightsail load balancer, a highly available Lightsail database, and uses S3 alongside CloudFront to deliver your media content securely.

Graphic showing the final HA architecture including 3 servers behind a load balancer, S3 alongside cloudfront, and a highly-available database

Prerequisites

This post assumes you built your WordPress site by following the previous posts in this series.

Configuring SSL requires a registered domain name and sufficient permissions to create DNS records for that domain.

You don’t need AWS or Lightsail to manage your domain, but this post uses Lightsail’s DNS management. For more information, see Creating a DNS zone to manage your domain’s DNS records in Amazon Lightsail.

Deploying the load balancer and configuring SSL

To deploy a Lightsail load balancer and configure it to support SSL, complete the following steps:

  1. Open the Lightsail console.
  2. From the menu, choose Networking.
  3. Choose Create Load Balancer.
  4. For Identify your load balancer, enter a name for your load balancer.

This post names the load balancer wp-lb.

  1. Choose Create Load Balancer.

The details page for your new load balancer opens. From here, add your initial WordPress server to the load balancer and configure SSL.

  1. For Target instances, choose your WordPress server.

The following screenshot indicates that this post chooses the server WordPress-1.

screenshot showing an instance being selected from the drop down

 

  1. Choose Attach.

It can take a few seconds for your instance to attach to the load balancer and the Health Check to report as Passed. See the following screenshot of the Health Check status.

 Picture of healh check status

  1. From the menu, choose Inbound traffic.
  2. Under Certificates, choose Create certificate.
  3. For PRIMARY DOMAIN, enter the domain name that you want to use to reach your WordPress site.

You can either accept the default certificate name Lightsail creates or change it to something you prefer. This post uses www.mikegcoleman.com.

  1. Choose Create.

The following screenshot shows the Create a certificate section.

Picture of the Create a Certificate section

Creating a CNAME record

As you did with CloudFront, you need to create a CNAME record as a way of validating that you have ownership of the domain for which you are creating a certificate.

  1. Copy the random characters and the subdomain from the name field.

The following screenshot shows the example record information.

screenshot showing the portion of the name value that needs to be copied

  1. Open a second copy of the Lightsail console in another tab or window.
  2. Choose Networking.
  3. Choose your domain name.
  4. Choose Add record.
  5. From the drop-down menu, choose CNAME record.

The following screenshot shows the drop-down menu options.

Screenshot showing CNAME record selected in the dropdown box

  1. For Subdomain, enter the random characters and subdomain you copied from the load balancer page.
  2. Return to the load balancer page.
  3. Copy the entire Value
  4. Return to the DNS page.
  5. For Maps to, enter the value string.
  6. Choose the green check box.

The following screenshot shows the CNAME record details.

Screenshot showing the completed cname record and green checkbox highlighted

  1. Return to the load balancer page and wait a few minutes before refreshing the page.

You should see a notification that the certificate is verified and ready to use. This process can take several minutes; continue refreshing the page every few minutes until the status updates. The following screenshot shows the notification message.

Screenshot showing the verification complete message for the load balancer ssl certificate

  1. In the HTTPS box, select your certificate from the drop-down menu.

The following screenshot shows the HTTPS box.

screenshot showing the newly validated certificate in the drop down box

  1. Copy the DNS name for your load balancer.

The following screenshot shows the example DNS name.

Screenshot showing the DNS name of the load balancer

  1. Return to the Lightsail DNS console and follow steps 13 through 23 as a guide in creating a CNAME record that maps your website address to the load balancer’s DNS name.

Use the subdomain you chose for your WordPress server (in this post, that’s www) and the DNS name you copied for the Maps to field.

The following screenshot shows the CNAME record details.

screenshot showing the completed cname record for the load balancer

Updating the wp-config file

The last step to configure SSL is updating the wp-config file to configure WordPress to deliver content over SSL.

  1. Start an SSH session with your WordPress server.
  2. Copy and paste the following code into the terminal window to create a temporary file that holds the configuration string that will be added to the WordPress configuration file.
cat <<EOT >> ssl_config.txt
if (\$_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https') \$_SERVER['HTTPS']='on';
EOT
  1. Copy and paste the following sed command into your terminal window to add the SSL line to the configuration file.
sed -i "/define( 'WP_DEBUG', false );/r ssl_config.txt" \
/home/bitnami/apps/wordpress/htdocs/wp-config.php
  1. The sed command changes the permissions on the configuration file, so you’ll need to reset them. See the following code:
sudo chown bitnami:daemon /home/bitnami/apps/wordpress/htdocs/wp-config.php

You also need to update two variables that WordPress uses to identify which address is used to access your site.

  1. Update the WP_SITEURL variable (be sure to specify https) by running the following command in the terminal window:
wp config set WP_SITEURL https://<your wordpress site domain name>

For example, this post uses the following code:

wp config set WP_SITEURL https://www.mikegcoleman.com

You should get a response that the variable updated.

  1. Update the WP_HOME variable (be sure to specify https) by issuing the following command in the terminal window:
wp config set WP_HOME https://<your wordpress site domain name>

For example, this post uses the following code:

wp config set WP_HOME https://www.mikegcoleman.com

You should get a response that the variable updated.

  1. Restart the WordPress server to read the new configuration with the following code:
sudo /opt/bitnami/ctlscript.sh restart

After the server has restarted, visit the DNS name for your WordPress site. The site should load and your browser should report the connection is secure.

You can now finish any customization of your WordPress site, such as adding additional plugins, setting the blog name, or changing the theme.

Scaling your WordPress servers

With your WordPress server fully configured, the last step is to create additional instances and place them behind the load balancer so that if one of your WordPress servers fails, your site is still reachable. An added benefit is that your site is more scalable because there are additional servers to handle incoming requests.

Complete the following steps:

  1. On the Lightsail console, choose the name of your WordPress server.
  2. Choose Snapshots.
  3. For Create instance snapshot, enter the name of your snapshot.

This post uses the name WordPress-version-1. See the following screenshot of your snapshot details.

Screenshot of the snapshot creation dialog

  1. Choose Create snapshot.

It can take a few minutes for the snapshot creation process to finish.

  1. Click the three-dot menu icon to the right of your snapshot name and choose Create new instance.

The following screenshot shows the Recent snapshots section.

Screenshot showing the location of the three dot menu

To provide the highest level of redundancy, deploy each of your WordPress servers into a different Availability Zone within the same region. By default, the first server was placed in zone A; place the subsequent servers in two different zones (B and C would be good choices). For more information, see Regions and Availability Zones.

  1. For Instance location, choose Change AWS Region and Availability Zone.
  2. Choose Change your Availability Zone.
  3. Choose an Availability Zone you have not used previously.

The following screenshot shows the Availability Zones to choose from.

screenshot showing availability zone b selected

  1. Give your instance a new name.

This post names the instance WordPress-2.

  1. Choose Create Instance.

You should have at least two WordPress server instances to provide a minimum degree of redundancy. To add more, create additional instances by following steps 1–10.

Return to the Lightsail console home page, and wait for your instances to report back Running.

Adding your instances to the load balancer

Now that you have your additional WordPress instances created, add them to the load balancer. This is the same process you followed previously to add the first instance:

  1. On the Lightsail console, choose Networking.
  2. Choose the load balancer you previously created.
  3. Choose Attach another.
  4. From the drop-down menu, select the name of your instance.

The following screenshot shows the available instances on the drop-down menu.

screenshot showing the WordPress instances in the load balancer drop down

  1. Choose Attach.
  2. Repeat steps 3–5 for any additional instances.

When the instances report back Passed, your site is fully up and running.

Conclusion

You have configured your site to provide a good degree of redundancy and performance, while delivering your content over secure connections. S3 and CloudFront allow your site to deliver static content over a secured connection, while the Lightsail load balancer and database make sure your site can keep serving your customers in the event of a failure.

If you haven’t got one already, head on over and create a free AWS account, and start building your own WordPress site – or whatever else might think of!

We love SQL Server running on AWS almost as much as our customers

Post Syndicated from Sandy Carter original https://aws.amazon.com/blogs/compute/we-love-sql-server-running-on-aws-almost-as-much-as-our-customers/

We love SQL Server running on AWS almost as much as our customers. Microsoft SQL server 2019 became generally available on November 8, 2019, and is now available to on AWS. More customers run SQL Server on AWS than any other cloud, and trust AWS for a number of reasons.

The first is performance. Recent performance benchmarks show that AWS delivers great price-performance for running SQL server. ZK Research points out that SQL Server on AWS consistently shows a price performance using HammerDB, a TPC-C-like benchmark tool compared to Azure, that’s over two times better. These results come from analysis done by ZKResearch based on independent testing results published by DBBest. Furthermore, we offer fast and high throughput storage options with Amazon EBS and local NVMe instance storage. We know that getting better application performance is critical for your customer’s satisfaction. In fact, excellent application performance leads to 39% higher* customer satisfaction, while poor performance may lead to damaged reputations or, even worse, customer attrition. To make sure you have the best possible experience for your customers, we have focused on pushing the boundaries around performance.

For example, the value of running SQL Server on AWS is shown in Pearson’s migration story. Pearson is a British-owned education publishing and assessment service to schools and corporations, as well for students directly. Pearson owns educational media brands including Addison–Wesley, Prentice Hall, eCollege, and others. Schoolnet , one of their offerings, tests tens of millions of students and is used by tens of thousands of educators. Pearson migrated Schoolnet, which was an on-premises application that used SQL Server, over to the AWS cloud. The goal of the migration was to ensure that the high volume of tests done daily would run efficiently and effectively. As many customers do, Pearson build for the worst-case demand, and had over-provisioned with a massive infrastructure utilized for potential (real!) peaks. When they moved to AWS, not only did they see efficiencies in cost but SQL Server was far easier to manage and was much faster. In fact, at our reInvent conference, Ian Wright told the audience that they had so much feedback coming into the level 2 support desk that was positive from the migration in particular, that Schoolnet was running faster than ever before!

Second, customers need high availability for mission-critical applications written using SQL Server. We have the best global infrastructure for running workloads that require high availability. The AWS Global Infrastructure underlays SQL Server on AWS and spans 69 Availability Zones (AZs) within 22 geographic regions around the world. These AZs are designed for physical redundancy and provide resilience, which enables uninterrupted performance. In 2018, the next-largest cloud provider had almost seven times more downtime hours than AWS.

Third, while CIOs tell us that cost is not the key factor in their decision to move to the cloud (agility and innovation are usually at the center of their motivation), they are often impressed with the cost savings they see when bringing their SQL server workloads to AWS. We typically see at least 20% savings with just a lift-and-shift. Over the first few months, you can continue to optimize your EC2 instances for an additional 10–20% savings. By adopting higher level services, you can further optimize—many customers saw 60% or more savings.

For example, Axinom ran its Windows-based applications in an on-premises environment that made it difficult to scale to meet increasing user traffic. The company also wanted to boost scalability and cut costs. Axinom moved its applications, including Axinom CMS and Axinom DRM, to the AWS Cloud. The company runs its Microsoft SQL Server–based platform on AWS and Spot Instances to optimize costs. As Johannes Jauch, the Chief Technology Officer of Axinom, said, “We have cut costs for supporting our digital media supply chain services by 70 percent using AWS products such as Amazon Spot Instances. As a result, we can provide more competitive pricing for our global customers.” And Axinom is not the only customers. Customers like Salesforce, Adobe, and Decisiv are benefitting from increased productivity and agility running SQL Server on AWS. You can read more about how customers are unlocking maximum business value by migrating to AWS.

And finally, not only does AWS offers more security, compliance, and governance services and key features than the next largest cloud provider, we also have the most migration experience.

All these benefits mean that the new features of SQL Server 2019, such as big data clusters, always being encrypted with secure enclaves, and improvements in SQL on Linux features, run better on AWS. We are also happy to announce that you can now launch EC2 instances that run Windows Server 2019/2016 and four editions of SQL Server 2019 (Web, Express, Standard, and Enterprise). The Amazon Machine Images (AMIs) are available today in all AWS Regions and run on a wide variety of EC2 instance types. You can launch these instances from the AWS Management Console, AWS CLI, or through AWS Marketplace. To get started with SQL 2019 on AWS, you can either purchase a License Included EC2 instance, or, if you have software assurance, you also have two Bring Your Own License (BYOL) options! You can BYOL SQL Server on an AWS instance with license-included Windows, or can BYOL SQL Server on a dedicated host with BYOL Windows (provided the Windows license was purchased before October 1, 2019).

Join the customers on SQL Server on AWS today! To learn more about SQL Server 2019 and to explore your licensing options, visit Microsoft SQL Server on AWS. If you need advice and guidance as you plan your migration effort, check out the AWS Partners who have qualified for the AWS Microsoft Workloads Competency and focus on database solutions. Please join me and the AWS team at AWS re:Invent (December 2–6 in Las Vegas).

*Source: Netmagic, https://www.netmagicsolutions.com/data/images/WP_How-End-User-Experience-Affects-Your-Bottom-Line16-08-231471935227.pdf

Deploying a Highly available WordPress site on Amazon Lightsail, Part 3: Increasing security and performance using Amazon CloudFront

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/deploying-a-highly-available-wordpress-site-on-amazon-lightsail-part-3-increasing-security-and-performance-using-amazon-cloudfront/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail | Twitter: @mikegcoleman

The previous posts in this series (Implementing a highly available Lightsail database with WordPress and Using Amazon S3 with WordPress to securely deliver media files), showed how to build a WordPress site and configure it to use Amazon S3 to serve your media assets. The next step is to use an Amazon CloudFront distribution to increase performance and add another level of security.

CloudFront is a content delivery network (CDN). It essentially caches content from your server on endpoints across the globe. When someone visits your site, they hit one of the CloudFront endpoints first. CloudFront determines if the requested content is cached. If so, CloudFront responds to the client with the requested information. If the content isn’t cached on the endpoint, it is loaded from the actual server and cached on CloudFront so subsequent requests can be served from the endpoint.

This process speeds up response time, because the endpoint is usually closer to the client than the actual server, and it reduces load on the server, because any request that CloudFront can handle is one less request your server needs to deal with.

In this post’s configuration, CloudFront only caches the media files stored in your S3 bucket, but CloudFront can cache more. For more information, see How to Accelerate Your WordPress Site with Amazon CloudFront.

Another benefit of CloudFront is that it responds to requests over HTTPS. Because some requests are served by the WordPress server and others from CloudFront, it’s important to secure both connections with HTTPS. Otherwise, the customer’s web browser shows a warning that the site is not secure. The next post in this series shows how to configure HTTPS for your WordPress server.

Solution overview

To configure CloudFront to work with your Lightsail WordPress site, complete the following steps:

  1. Request a certificate from AWS Certificate Manager.
  2. Create a CloudFront distribution.
  3. Configure the WP Offload Media Lite plugin to use the CloudFront distribution.

The following diagram shows the architecture of this solution.

Image showing WordPress architecture with Lightsail, CloudFront, S3, and a Lightsail database

Prerequisites

This post assumes that you built your WordPress site by following the previous posts in this series.

Configuring SSL requires that you have a registered domain name and sufficient permissions to create DNS records for that domain.

You don’t need AWS or Lightsail to manage your domain, but this post uses Lightsail’s DNS management. For more information, see Creating a DNS zone to manage your domain’s DNS records in Amazon Lightsail.

Creating the SSL certificate

This post uses two different subdomains for your WordPress site. One points to your main site, and the other points to your S3 bucket. For example, a customer visits https://www.example.com to access your site. When they click on a post that contains a media file, the post body loads off of https://www.example.com, but the media file loads from https://media.example.com, as depicted in the previous graphic.

Create the SSL certificate for CloudFront to use with the S3 bucket with the following steps. The next post in this series shows how to create the SSL certificate for your WordPress server.

  1. Open the ACM console.
  2. Under Provision certificates, choose Get started.
  3. Choose Request a public certificate.
  4. Choose Request a certificate.
  5. For Domain name*, enter the name of the domain to serve your media files.

This post uses media.mikegcoleman.com.

  1. Choose Next.

The following screenshot shows the example domain name.

Screen shot showing domain name dialog

  1. Select Leave DNS validation.
  2. Choose Review.
  3. Choose Confirm and request.

ACM needs you to validate that you have the necessary privileges to request a certificate for the given domain by creating a special DNS record.

Choose the arrow next to the domain name to show the values for the record you need to create. See the following screenshot.

Screen shot showing verification record

You now need to use Lightsail’s DNS management to create the CNAME record to validate your certificate.

  1. In a new tab or window, open the Lightsail console.
  2. Choose Networking.
  3. Choose the domain name for the appropriate domain.
  4. Under DNS records, choose Add record.
  5. From the drop-down menu, choose CNAME record.

The following screenshot shows the menu options.

Screen shot showing CNAME record selected in dropdown box

  1. Navigate back to ACM.

Under Name, the value is formatted as randomcharacters.subdomain.domain.com.

  1. Copy the random characters and the subdomain.

For example, if the value was _f836d9f10c45c6a6fbe6ba89a884d9c4.media.mikegcoleman.com, you would copy _f836d9f10c45c6a6fbe6ba89a884d9c4.media.

  1. Return to the Lightsail DNS console.
  2. Under Subdomain, enter the value you copied.
  3. Return to the ACM console.
  4. For Value, copy the entire string.
  5. Return to the Lightsail DNS console.
  6. For Maps to, enter the value you copied.
  7. Choose the green checkmark.

The following screenshot shows the completed record details in the Lightsail DNS console.

screen shot showing where the green checkmark is and completed domain record

ACM periodically checks DNS to see if this record exists, and validates your certificate when it finds the record. The validation usually takes approximately 15 minutes; however, it could take up to 48 hours.

To track the validation, return to the ACM console. You can periodically refresh this page until you see the status field change to Issued. See the following screenshot.

Screenshot highlighting the updated status of the domain validation

Building the CloudFront distribution

Now that you have the certificate, you’re ready to create your CloudFront distribution. Complete the following steps:

  1. Open the CloudFront console.
  2. Choose Create distribution.
  3. Under Web, choose Get Started.
  4. For Origin Domain Name, select the S3 bucket you created for your WordPress media files.

The following screenshot shows the options available in the Origin Domain Name drop-down menu. This post chooses mike-wp-bucket-s3-amazonaws.com.

Screenshot of the S3 bucket selected from the Origin Domain Name dialog

By default, WordPress does not pass information that indicates when to clear an item in the cache; specifying how to configure that functionality is out of the scope of this post.

Because WordPress doesn’t send this information, you need to set a default time to live (TTL) in CloudFront.

As a starting point, set the value to 900 seconds (15 minutes). This means that if you load a post, and that post includes media, CloudFront checks the cache for that media. If the media is in the cache, but has been there longer than 15 minutes, CloudFront requests the media from your WordPress server and update the cache.

While 15 minutes is a reasonable starting value for media, the optimal value for the TTL depends on how you want to balance delivering your clients the latest content with performance and cost.

  1. For Object Caching, choose Customize.
  2. For Default TTL, enter 900.

The following screenshot shows the TTL options.

Image of TTL options

CloudFront has endpoints across the globe, and the price you pay depends on the number of endpoints you have configured. If your traffic is localized to a certain geographic region, you may want to restrict which endpoints CloudFront uses.

  1. Under Distribution Settings, for Price Class, choose the appropriate setting.

This post chooses Use All Edge Locations. See the following screenshot. You should choose a setting that make sense for your site. Choosing only a subset of price classes will reduce costs.

Screenshot showing the choices for price classes

  1. For Alternate Domain Names (CNAMEs), enter the name of the subdomain for your S3 bucket.

This name is the same as the subdomain you created the certificate for in the previous steps. This post uses media.mikegcoleman.com.

Assign the certificate you created earlier.

  1. Choose Custom SSL Certificate.

When entering in the text field, the certificate you created earlier shows in the drop-down menu.

  1. Choose your certificate.

The following screenshot shows the drop-down menu with custom SSL certificates.

Screenshot showing the previously created SSL certificate being selected

  1. Choose Create Distribution.

It can take 15–30 minutes to configure the distribution.

The next step is to create a DNS record that points your media subdomain to the CloudFront distribution.

  1. Return to the CloudFront console, choose the ID of the distribution.

The following screenshot shows available distribution IDs.

screen shot showing the distribution ID highlighted

The page displays the distribution ID details.

  1. Copy the Domain Name

The following screenshot shows the distribution ID details.

Screenshot showing the domain name highlighted in the distribution details

  1. Return to the Lightsail DNS console.
  2. Choose Add record.
  3. Choose CNAME record.
  4. For Subdomain, enter the subdomain you are using for the S3 bucket.

This post is using the domain media.mikegcoleman.com, so the value is media. 

  1. For Maps to, enter the domain name of the CloudFront distribution that you previously copied.

The following screenshot shows the CNAME record details.

Screenshot of the completed DNS CNAME record

  1. Choose the green check box.
  2. Return to the CloudFront console.

It can take 15–30 minutes for the distribution status to change to Deployed. See the following screenshot.

Screenshot of the updated CloudFront distribution status

Configuring the plugin

The final step is to configure the WP Offload Media Lite plugin to use the newly created CloudFront distribution. Complete the following steps:

  1. Log in to the admin dashboard for your WordPress site.
  2. Under Plugins, choose WP Offload Media Lite.
  3. Choose Settings.
  4. For Custom Domain (CNAME), select On.
  5. Enter the domain name for your S3 bucket.

This post uses media.mikegcoleman.com as an example.

  1. For Force HTTPS, select On.

This makes sure that all media is served over HTTPS.

  1. Choose Save Changes.

The following screenshot shows the URL REWRITING and ADVANCED OPTIONS details.

Screenshot of the cloud distribution network settings in the S3 plugin

You can load an image from your media library to confirm that everything is working correctly.

  1. Under Media, choose Library.
  2. Choose an image in your library.

If you don’t have an image in your library, add a new one now.

The URL listed should start with the domain you configured for your S3 bucket. For this post, the URL is media.mikegcoleman.com.

The following screenshot shows the image details.

Screenshot highlighting the URL of the S3 asset

To confirm that the image loads correctly, and there are no SSL errors or warnings, copy the URL value and paste it into your web browser.

Conclusion

You now have your media content served securely from CloudFront. The final post in this series demonstrates how to create multiple instances of your WordPress server and place those behind a Lightsail load balancer to increase performance and enhance availability.

If you haven’t got one already, head on over and create a free AWS account, and start building your own WordPress site – or whatever else might think of!

Now Available: New C5d Instance Sizes and Bare Metal Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-new-c5d-instance-sizes-and-bare-metal-instances/

Amazon EC2 C5 instances are very popular for running compute-heavy workloads like batch processing, distributed analytics, high-performance computing, machine/deep learning inference, ad serving, highly scalable multiplayer gaming, and video encoding.

In 2018, we added blazing fast local NVMe storage, and named these new instances C5d. They are a great fit for applications that need access to high-speed, low latency local storage like video encoding, image manipulation and other forms of media processing. They will also benefit applications that need temporary storage of data, such as batch and log processing and applications that need caches and scratch files.

Just a few weeks ago, we launched new instances sizes and a bare metal option for C5 instances. Today, we are happy to add the same capabilities to the C5d family: 12xlarge, 24xlarge, and a bare metal option.

The new C5d instance sizes run on Intel’s Second Generation Xeon Scalable processors (code-named Cascade Lake) with sustained all-core turbo frequency of 3.6GHz and maximum single core turbo frequency of 3.9 GHz.

The new processors also enable a new feature called Intel Deep Learning Boost, a capability based on the AVX-512 instruction set. Thanks to the new Vector Neural Network Instructions (AVX-512 VNNI), deep learning frameworks will speed up typical machine learning operations like convolution, and automatically improve inference performance over a wide range of workloads.

These instances are based on the AWS Nitro System, with dedicated hardware accelerators for EBS processing (including crypto operations), the software-defined network inside of each Virtual Private Cloud (VPC), and ENA networking.

New Instance Sizes for C5d: 12xlarge and 24xlarge
Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.12xlarge4896 GiB2 x 900 GB NVMe SSD7 Gbps12 Gbps
c5d.24xlarge96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Previously, the largest C5d instance available was c5d.18xlarge, with 72 logical processors, 144 GiB of memory, and 1.8 TB of storage. As you can see, the new 24xlarge size increases available resources by 33%, in order to help you crunch those super heavy workloads. Last but not least, customers also get 50% more NVMe storage per logical processor on both 12xlarge and 24xlarge, with up to 3.6 TB of local storage!

Bare Metal C5d
As is the case with the existing bare metal instances (M5, M5d, R5, R5d, z1d, and so forth), your operating system runs on the underlying hardware and has direct access to processor and other hardware.

Bare metal instances can be used to run software with specific requirements, e.g. applications that are exclusively licensed for use on physical, non-virtualized hardware. These instances can also be used to run tools and applications that require access to low-level processor features such as performance counters.

Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.metal96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

Now Available!
You can start using these new instances today in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Canada (Central), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Europe (London), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), South America (São Paulo), and AWS GovCloud (US-West).

Please send us feedback, either on the AWS forum for Amazon EC2, or through your usual AWS support contacts.

Julien;

Processing batch jobs quickly, cost-efficiently, and reliably with Amazon EC2 On-Demand and Spot Instances

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.

 

Prerequisites

This approach will be demonstrated via a simple batch-processing environment with the following components:

  • A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
  • An Amazon SQS queue to manage the tasks.
  • A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
  • Amazon EC2 Auto Scaling groups to model scenarios.
  • Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.

 

Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance.  The batch total is $38.40.

 

Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.

 

Instance TypeProvisioningHost CountHourly Instance CostTotal 30-minute batch cost
c5.18xlargeSpot15 $     1.2367 $     9.2753
c5.2xlargeSpot22 $     0.1547 $     1.7017
c5.4xlargeSpot12 $     0.2772 $     1.6632
c5.9xlargeSpot11 $     0.6239 $     3.4315
c5.2xlargeOn-Demand13 $     0.3840 $     2.4960
c5.4xlargeOn-Demand3 $     0.7680 $     1.1520
c5.9xlargeOn-Demand4 $     1.7280 $     3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.

 

Summary

This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Migrating Azure VM to AWS using AWS SMS Connector for Azure

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/migrating-azure-vm-to-aws-using-aws-sms-connector-for-azure/

AWS SMS is an agentless service that facilitates and expedites the migration of your existing workloads to AWS. The service enables you to automate, schedule, and monitor incremental replications of active server volumes, which facilitates large-scale server migration coordination. Recently, you could only migrate virtual machines (VMs) running in VMware vSphere and Microsoft Hyper-V environments. Currently, you can use the simplicity and ease of AWS Server Migration Service (SMS) to migrate virtual machines running on Microsoft Azure. You can discover Azure VMs, group them into applications, and migrate a group of applications as a single unit without having to go through the hassle of coordinating the replication of the individual servers or decoupling application dependencies. SMS significantly reduces application migration time, as well as decreases the risk of errors in the migration process.

 

This post takes you step-by-step through how to provision the SMS virtual machine on Microsoft Azure, discover the virtual machines in a Microsoft Azure subscription, create a replication job, and finally launch the instance on AWS.

 

1- Provisioning the SMS virtual machine

To provision your SMS virtual machine on Microsoft Azure, complete the following steps.

  1. Download three PowerShell scripts listed under Step 1 of Installing the Server Migration Connection on Azure.
FileURL
Installation scripthttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1
MD5 hashhttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1.md5
SHA256 hashhttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1.sha256

 

  1. To validate the integrity of the files you can compare the checksums of the files. You can use PowerShell 5.1 or newer.

 

2.1 To validate the MD5 hash of the aws-sms-azure-setup.ps1 script, run the following command and wait for an output similar to the following result:

Command to validate the MD5 has of the aws-sems-azure-setup.ps1 script

2.2 To validate the SHA256 hash of the aws-sms-azure-setup.ps1 file, run the following command and wait for an output similar to the following result:

Command to validate the SHA256 hash of the aws-sms-azure-setup.ps1 file

2.3 Compare the returned values ​​by opening the aws-sms-azure-setup.ps1.md5 and aws-sms-azure-setup.ps1.sha256 files in your preferred text editor.

2.4 To validate if the PowerShell script has a valid Amazon Web Services signature, run the following command and wait for an output similar to the following result:

Command to validate validate if the PowerShell script has a valid Amazon Web Services signature

 

  1. Before running the script for provisioning the SMS virtual machine, you must have an Azure Virtual Network and an Azure Storage Account in which you will temporarily store metadata for the tasks that SSM performs against the Microsoft Azure Subscription. A good recommendation is to use the same Azure Virtual Network as the Azure Virtual Machines being migrated, since the SSM virtual machine performs REST API communications to communicate with AWS endpoints as well as the Azure Cloud Service. It is not necessary for the SMS virtual machine to have a Public IP or Internet Inbounds Rules.

 

4.  Run the installation script .\aws-sms-azure-setup.ps1

Screenshot of running the installation script

  1. Enter with the name of the existing Storage Account Name and Azure Virtual Network in the subscription:

Screenshot of where to enter Storage Account Name and Azure Virtual Network

  1. The Microsoft Azure modules imports into the local PowerShell, and you receive a prompt for credentials to access the subscription.

Azure login credentials

  1. A summary of the created features appears, similar to the following:

Screenshot of created features

  1. Wait for the process to complete. It may take a few minutes:

screenshot of processing jobs

  1. After the provisioning an output containing the Object Id of System Assigned Identity and Private IP. Save this information as it is going to be used to register the connector to the SMS service in the step 23.

Screenshot of the information to save

  1. To check the provisioned resources, log into the Microsoft Azure Portal and select the Resource Group option. The provided AWS script performed a role created in the Microsoft Azure IAM that allows the virtual machine to make use of the necessary services through REST APIs over HTTPS calls and to be authenticated via Azure Inbuilt Instance Metadata Service (IMDS).

Screenshot of provisioned resources log in Microsoft Azure Portal

  1. As a requirement, you need to create an IAM User that contains the necessary permissions for the SMS service to perform the migration. To do this, log into your AWS account at https://aws.amazon.com/console, under services select IAM. Then select User, and click Add user.

Screenshot of AWS console. add user

 

  1. In the Add user page, insert a username and check the option Programmatic access. Click: Next Permissions

Screenshot of adding a username

  1. Attach an existing policy with the name ServerMigrationConnector. This policy allows the AWS Connector to connects and executes API-requests against AWS. Click Next:Tags.

Adding policy ServerMigrationConnector

  1. Optionally add tags to the user. Click Next: Review.

Screenshot of option to add tags to the user

15. Click Create User and save the Access Key and Secret Access Key. This information is going to be used during the AWS SMS Connector setup.

Create User and save the access key and secret access key

 

  1. From a computer that has access to the Azure Virtual Network, access the SMS Virtual Machine configuration using a browser and the previously recorded private IP from the output of the script. In this example, the URL is https://10.0.0.4.

Screenshot of accessing the SMS Virtual Machine configuration

  1. On the main page of the SMS virtual machine, click Get Started Now

Screenshot of the SMS virtual machine start page

  1. Read and accept the terms of the contract, then click Next.

Screenshot of accepting terms of contract

  1. Create a password that will be used to login later in the management connector console and click Next.

Screenshot of creating a password

  1. Review the Network Info and click Next.

Screenshot of reviewing the network info

  1. Choose if you would like to opt-in to having anonymous log data set to AWS then click Next.

Screenshot of option to add log data to AWS

  1. Insert an Access Key and Secret Access Key for an IAM User whose only policy is attached: “ServerMigrationConnector” Also, select the region in which the SMS endpoint will be used and click Next. The access key mentioned it was created through step 11 to 15.

Selet AWS Region, and Insert Access Key and Secret Key

  1. Enter the Object Id of System Assigned Identify copied in step 9 and click Next.

Enter Object Id of System Assigned Identify

  1. Congratulations, you have successfully configured the Azure connector, click Go to connector dashboard.

Screenshot of the successful configuration of the Azure connector

  1. Verify that the connector status is HEALTHY by clicking Connectors on the menu.

Screenshot of verifying that the connector status is healthy

 

2 – Replicating Azure Virtual Machines to Amazon Web Services

  1. Access the SMS console and go to the Servers option. Click Import Server Catalog or Re-Import Server Catalog if it has been previously executed.

Screenshot of SMS console and servers option

  1. Select the Azure Virtual Machines to be migrated and click Create Replication Job.

Screenshot of Azure virtual machines migration

  1. Select which type of licensing best suits your environment, such as:

– Auto (Current licensing autodetection)

– AWS (License Included)

– BYOL (Bring Your Own License).
See options: https://aws.amazon.com/windows/resources/licensing/

Screenshot of best type of licensing for your environment

  1. Select the appropriate replication frequency, when the replication should start, and the IAM service role. You can leave it blank and the SMS service is going to use the built-in service role “sms”

Screenshot of replication jobs and IAM service role

  1. A summary of the settings are displayed and click Create. 
    Screenshot of the summary of settings displayed
  2. In the SMS Console, go to the Replication Jobs option and follow the replication job status:

Overview of replication jobs

  1. After completion, access the EC2 console, go to AMIs, and a list of the AMIs generated by SMS will now be in this list. In the example below, several AMIs were generated because the replication frequency is 1 hour.

List of AMIs generated by SMS

  1. Now navigate to the SMS console, click Launch Instance and follow the screen processes for creating a new Amazon EC2 instance.

SMS console and Launch Instance screenshot

 

3 – Conclusion

This solution provides a simple, agentless, non-intrusive way to the migration process with the AWS Server Migration Service.

 

For more about Windows Workloads on AWS go to:  http://aws.amazon.com/windows

 

About the Author

Photo of the Author

 

 

Marcio Morales is a Senior Solution Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on running their Microsoft workloads on AWS.

Optimizing deep learning on P3 and P3dn with EFA

Post Syndicated from whiteemm original https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/

This post is written by Rashika Kheria, Software Engineer, Purna Sanyal, Senior Solutions Architect, Strategic Account and James Jeun, Sr. Product Manager

The Amazon EC2 P3dn.24xlarge instance is the latest addition to the Amazon EC2 P3 instance family, with upgrades to several components. This high-end size of the P3 family allows users to scale out to multiple nodes for distributed workloads more efficiently.  With these improvements to the instance, you can complete training jobs in a shorter amount of time and iterate on your Machine Learning (ML) models faster.

 

This blog reviews the significant upgrades with p3dn.24xlarge, walks you through deployment, and shows an example ML use case for these upgrades.

 

Overview of P3dn instance upgrades

The most notable upgrade to the p3dn.24xlarge instance is the 100-Gbps network bandwidth and the new EFA network interface that allows for highly scalable internode communication. This means you can scale runs on applications to use thousands of GPUs, which reduces time to get results. EFA’s operating system bypasses networking mechanisms and the underlying Scalable Reliable Protocol that is built in to the Nitro Controllers. The Nitro controllers enable a low-latency, low-jitter channel for inter-instance communication. EFA has been adopted in the mainline Linux and integrated with LibFabric and various distributions. AWS worked with NVIDIA for EFA to support NVIDIA Collective Communication Library (NCCL). NCCL optimizes multi-GPU and multi-node communication primitives and helps achieve high throughput over NVLink interconnects.

 

The following diagram shows the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

 

The following table summarizes the full set of differences between p3.16xlarge and p3dn.24xlarge.

Featurep3.16xlp3dn.24xl
ProcessorIntel Xeon E5-2686 v4Intel Skylake 8175 (w/ AVX 512)
vCPUs6496
GPU8x 16 GB NVIDIA Tesla V1008x 32 GB NVIDIA Tesla V100
RAM488 GB768 GB
Network25 Gbps ENA100 Gbps ENA + EFA
GPU InterconnectNVLink – 300 GB/sNVLink – 300 GB/s

 

P3dn.24xl offers more networking bandwidth than p3.16xl. Paired with EFA’s communication library, this feature increases scaling efficiencies drastically for large-scale, distributed training jobs. Other improvements include double the GPU memory for large datasets and batch sizes, increased system memory, and more vCPUs. This upgraded instance is the most performant GPU compute option on AWS.

 

The upgrades also improve your workload around distributed deep learning. The GPU memory improvement enables higher intranode batch sizes. The newer Layer-wise Adaptive Rate Scaling (LARS) has been tested with ResNet50 and other deep neural networks (DNNs) to allow for larger batch sizes. The increased batch sizes reduce wall-clock time per epoch with minimal loss of accuracy. Additionally, using 100-Gbps networking with EFA heightens performance with scale. Greater networking performance is beneficial when updating weights for a large number of parameters. You can see high scaling efficiency when running distributed training on GPUs for ResNet50 type models that primarily use images for object recognition. For more information, see Scalable multi-node deep learning training using GPUs in the AWS Cloud.

 

Natural language processing (NLP) also presents large compute requirements for model training. This large compute requirement is especially present with the arrival of large Transformer-based models like BERT and GPT-2, which have up to a billion parameters. The following describes how to set up distributed model trainings with scalability for both image and language-based models, and also notes how the AWS P3 and P3dn instances perform.

 

Optimizing your P3 family

First, optimize your P3 instances with an important environmental update. This update runs traditional TCP-based networking and is in the latest release of NCCL 2.4.8 as of this writing.

 

Two new environmental variables are available, which allow you to take advantage of multiple TCP sockets per thread: NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD.

 

These environmental variables allow the NCCL backend to exceed the 10-Gbps TCP single stream bandwidth limitation in EC2.

 

Enter the following command:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_NSOCKS_PERTHREAD=4 -x NCCL_SOCKET_NTHREADS=4 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

 

The following graph shows the synthetic NCCL tests and their increased performance with the additional directives.

synthetic NCCL tests and their increased performance with the additional directives

You can achieve a two-fold increase in throughput after a threshold in the synthetic payload size (around 1 MB).

 

 

Deploying P3dn

 

The following steps walk you through spinning up a cluster of p3dn.24xlarge instances in a cluster placement group. This allows you to take advantage of all the new performance features within the P3 instance family. For more information, see Cluster Placement Groups in the Amazon EC2 User Guide.

This post deploys the following stack:

 

  1. On the Amazon EC2 console, create a security group.

 Make sure that both inbound and outbound traffic are open on all ports and protocols within the security group.

 

  1. Modify the user variables in the packer build script so that the variables are compatible with your environment.

The following is the modification code for your variables:

 

{

  "variables": {

    "Region": "us-west-2",

    "flag": "compute",

    "subnet_id": "<subnet-id>",

    "sg_id": "<security_group>",

    "build_ami": "ami-0e434a58221275ed4",

    "iam_role": "<iam_role>",

    "ssh_key_name": "<keyname>",

    "key_path": "/path/to/key.pem"

},

3. Build and Launch the AMI by running the following packer script:

Packer build nvidia-efa-fsx-al2.yml

This entire workflow takes care of setting up EFA, compiling NCCL, and installing the toolchain. After building it, you have an AMI ID that you can launch in the EC2 console. Make sure to enable the EFA.

  1. Launch a second instance in a cluster placement group so you can run two node tests.
  2. Enter the following code to make sure that all components are built correctly:

/opt/nccl-tests/build/all_reduce_perf 

  1. The following output of the commend will confirm that the build is using EFA :

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

INFO: Function: main Line: 49: NET/OFI Process rank 8 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.

INFO: Function: main Line: 53: NET/OFI Received 1 network devices

INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

 

Synthetic two-node performance

This blog includes the NCCL-tests GitHub as part of the deployment stack. This shows synthetic benchmarking of the communication layer over NCCL and the EFA network.

When launching the two-node cluster, complete the following steps:

  1. Place the instances in the cluster placement group.
  2. SSH into one of the nodes.
  3. Fill out the hosts file.
  4. Run the two-node test with the following code:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x FI_PROVIDER="efa" -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

This test makes sure that the node performance works the way it is supposed to.

The following graph compares the NCCL bandwidth performance using -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp“. There is a three-fold increase in bus bandwidth when using EFA.

 

 -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp". There is a three-fold increase in bus bandwidth when using EFA. 

Now that you have run the two node tests, you can move on to a deep learning use case.

FAIRSEQ ML training on a P3dn cluster

Fairseq(-py) is a sequence modeling toolkit that allows you to train custom models for translation, summarization, language modeling, and other text-generation tasks. FAIRSEQ MACHINE TRANSLATION distributed training requires a fast network to support the Allreduce algorithm. Fairseq provides reference implementations of various sequence-to-sequence models, including convolutional neural networks (CNN), long short-term memory (LSTM) networks, and transformer (self-attention) networks.

 

After you receive consistent 10 GB/s bus-bandwidth on the new P3dn instance, you are ready for FAIRSEQ distributed training.

To install fairseq from source and develop locally, complete the following steps:

  1. Copy FAIRSEQ source code to one of the P3dn instance.
  2. Copy FAIRSEQ Training data in the data folder.
  3. Copy FAIRSEQ Test Data in the data folder.

 

git clone https://github.com/pytorch/fairseq

cd fairseq

pip install -- editable . 

Now that you have FAIRSEQ installed, you can run the training model. Complete the following steps:

  1. Run FAIRSEQ Training in 1 node/8 GPU p3dn instance to check the performance and the accuracy of FAIRSEQ operations.
  2. Create a custom AMI.
  3. Build the other 31 instances from the custom AMI.

 

Use the following scripts for distributed All Reduce FAIRSEQ Training :

 

export RANK=$1 # the rank of this process, from 0 to 127 in case of 128 GPUs
export LOCAL_RANK=$2 # the local rank of this process, from 0 to 7 in case of 8 GPUs per mac
export NCCL_DEBUG=INFO
export NCCL_TREE_THRESHOLD=0;
export FI_PROVIDER="efa";

export FI_EFA_TX_MIN_CREDIS=64;
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64/:/home/ec2-user/aws-ofi-nccl/install/lib/:/home/ec2-user/nccl/build/lib:$LD_LIBRARY_PATH;
echo $FI_PROVIDER
echo $LD_LIBRARY_PATH
python train.py data-bin/wmt18_en_de_bpej32k \
   --clip-norm 0.0 -a transformer_vaswani_wmt_en_de_big \
   --lr 0.0005 --source-lang en --target-lang de \
   --label-smoothing 0.1 --upsample-primary 16 \
   --attention-dropout 0.1 --dropout 0.3 --max-tokens 3584 \
   --log-interval 100  --weight-decay 0.0 \
   --criterion label_smoothed_cross_entropy --fp16 \
   --max-update 500000 --seed 3 --save-interval-updates 16000 \
   --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' \
   --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
   --warmup-updates 4000 --min-lr 1e-09 \
   --distributed-port 12597 --distributed-world-size 32 \
   --distributed-init-method 'tcp://172.31.43.34:9218' --distributed-rank $RANK \
   --device-id $LOCAL_RANK \
   --max-epoch 3 \
   --no-progress-bar  --no-save

Now that you have completed and validated your base infrastructure layer, you can add additional components to the stack for various workflows. The following charts show time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training.

time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training

 

Conclusion

EFA on p3dn.24xlarge allows you to take advantage of additional performance at scale with no change in code. With this updated infrastructure, you can decrease cost and time to results by using more GPUs to scale out and get more done on complex workloads like natural language processing. This blog provides much of the undifferentiated heavy lifting with the DLAMI integrated with EFA. Go power up your ML workloads with EFA!

 

Optimizing for cost, availability and throughput by selecting your AWS Batch allocation strategy

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/optimizing-for-cost-availability-and-throughput-by-selecting-your-aws-batch-allocation-strategy/

This post is contributed by Steve Kendrex, Senior Technical Product Manager, AWS Batch

 

Introduction

 

AWS offers a broad range of instances that are advantageous for batch workloads. The scale and provisioning speed of AWS’ compute instances allow you to get up and running at peak capacity in minutes without paying for downtime. Today, I’m pleased to introduce allocation strategies: a significant new capability in AWS Batch that  makes provisioning compute resources flexible and simple. In this blog post, I explain how the AWS Batch allocation strategies work, when you should use them for your workload, and provide an example CloudFormation script. This blog helps you get started on building your personalized Compute Environment (CE) most appropriate to your workloads.

Overview

AWS Batch is a fully managed, cloud-native batch scheduler. It manages the queuing and scheduling of your batch jobs, and the resources required to run your jobs. One of AWS Batch’s great strengths is the ability to manage instance provisioning as your workload requirements and budget needs change. AWS Batch takes advantage of AWS’s broad base of compute types. For example, you can launch compute based instances and memory instances that can handle different workload types, without having to worry about building a cluster to meet peak demand.

Previously, AWS Batch had a cost-controlling approach to manage compute instances for your workloads. The service chose an instance that was the best fit for your jobs based on vCPU, memory, and GPU requirements, at the lowest cost. Now, the newly added allocation strategies provide flexibility. They allow AWS Batch to consider capacity and throughput in addition to cost when provisioning your instances. This allows you to leverage different priorities when launching instances depending on your workloads’ needs, such as: controlling cost, maximizing throughput, or minimizing Amazon EC2 Spot instances interruption rates.

There are now three instance allocation strategies from which to choose when creating an AWS Batch Compute Environment (CE). They are:

1.        Spot Capacity Optimized

2.        Best Fit Progressive

3.        Best Fit

 

Spot Capacity Optimized

As the name implies, the Spot capacity optimized strategy is only available when launching Spot CEs in AWS Batch. In fact, I recommend the Spot capacity optimized strategy for most of your interruptible workloads running on Spot today. This strategy takes advantage of the recently released EC2 Auto Scaling and EC2 Fleet capacity optimized strategy. Next, I examine how this strategy behaves in AWS Batch.

Let’s say you’re running a simulation workload in AWS Batch. Your workload is Spot-appropriate (see this whitepaper to determine), so you want to take advantage of the savings you can glean from using Spot. However, you also want to minimize your Spot interruption rate, so you’ve followed the Spot best practices. Your instances can run across multiple instance types and multiple Availability Zones. When creating your Spot CE in AWS Batch, Input all the instance types with which your workload is compatible in the instance field. OR select ‘optimal’, which allows Batch to choose from among M, C, or R instance families. The image below shows how this appears in the console:

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

When evaluating your workload, AWS Batch selects from the allowed instance types. These allowed instance types are specified in the ‘compute resource parameter’, and are capable of running your jobs listed in your Spot CE.  From the capable instances, AWS calculates the correct assortment of instance types that have the most Spot capacity. AWS Batch then launches those instances on your behalf, and runs your jobs when those instances are available. This strategy gives you access to all AWS compute resources at a fraction of On-Demand cost. The Spot capacity optimized strategy works whether you’re trying to launch hundreds of thousands (or a million!) of vCPU’s in Spot, or simply trying to lower your chance of interruption. Additionally, AWS Batch manages your instance pool to meet the capacity needed to run your workload as time passes.

For example, as your workloads run, demand in an Availability Zone may shift. This might lead to several of your instances being reclaimed. In that event, AWS Batch automatically attempts to scale a different instance type based on the deepest capacity pools. Assuming you set a retry attempt count, your jobs then automatically retry. Then, AWS Batch scales new instances until either it meets the desired capacity, or it runs out of instance types to launch based on those provided.  That is why I recommend that you give AWS Batch as many instance types and families as possible to choose from when running Spot capacity optimized. Additional detail on behavior can be found in the capacity optimized documentation.

To launch a Spot capacity optimized CE, follow these steps:

1.       Navigate to the console

2.       Create a new Compute Environment.

3.       Select “Spot Capacity Optimized” in the Allocation Strategy field

4.       Alternatively, you can use the CreateComputeEnvironment API; in the Allocation Strategy field, pass in “Spot_Capacity_Optimized”. This command should look like the following:

…
"TestAllocationStrategyCE": { 
"Type": "AWS::Batch::ComputeEnvironment",
 "Properties": { 
"State": "ENABLED", 
"Type": "MANAGED", 
"ComputeResources": { 
"Subnets": [
 {"Ref": "TestSubnet"}
 ], 
"InstanceRole": {
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
" InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "SPOT_CAPACITY_OPTIMIZED", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 }
 },
…

Once you follow these steps your Spot capacity optimized CE should be up and running.

 

Best Fit Progressive

Imagine you have a time-sensitive machine learning workload that is very compute intensive. You want to run this workload on C5 instances because you know that those have a preferable vCPU/memory ratio for your jobs. In a pinch, however, you know that M5 instances can run your workload perfectly well. You’re happy to take advantage of Spot prices. However, you also have a base level of throughput you need so you have to run part of the workload on On-Demand instances.  In this case, I recommend the best fit progressive strategy. This strategy is available in both On-Demand and Spot CEs, and I recommend it for most On-Demand workloads. The best fit progressive strategy allows you to let AWS Batch choose the best fit instance for your workload (based on your jobs’ vCPU and memory requirements). In this context, “best fit” means AWS Batch provisions the least number of instances capable of running your jobs at the lowest cost.

Sometimes, AWS Batch cannot resource enough of the best fit instances to meet your capacity. When this is the case, AWS Batch progressively looks for the next best fit instance type from what you specified in the ‘compute resources’ parameter. Generally, AWS Batch attempts to spin up different instance sizes within the same family first. This is because AWS Batch has already determined that vCPU and memory ratio to fit your workload. If it still cannot find enough instances that can run your jobs to meet your capacity, AWS Batch launches instances from a different family. These attempts run until capacity is met, or until it runs out of available instances from which to select.

To create a best fit progressive CE, follow the steps detailed in the Spot capacity optimized strategy section. However, specify the strategy BEST_FIT_PROGRESSIVE when creating a CE, for example:


…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT_PROGRESSIVE", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important note: you can always restrict AWS Batch’s ability to launch instances by using the max vCPU setting in your CE. AWS Batch may go above Max vCPU to meet your capacity requirements for best fit progressive and Spot capacity optimized strategies. In this event, AWS Batch will never go above Max vCPU by more than a single instance (for example, no more than a single instance from among those specified in your CE compute resources parameter).

 

How to Combine Strategies

You can combine strategies using separate AWS Batch Compute Environments. Let’s take the case I mentioned earlier: you’re happy to take advantage of Spot prices, but you want a base level of throughput for your time-sensitive workloads.

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

 

In this case, you can create two AWS Batch CEs:

1.       Create an On-Demand CE that uses the best fit progressive strategy.

2.       Set the max vCPU at the level of throughput that is necessary for your workload.

3.       Create a Spot CE using the Spot capacity optimized strategy.

4.        Attach both CEs to your job queue, with the On-Demand CE higher in order. Once you start submitting jobs to your queue, AWS Batch spins up your On-Demand CE first and starts placing jobs.

If AWS Batch meets its max vCPU criteria, it will spin up instances in the next CE. In this case, the next CE is your Spot CE, and AWS Batch will place any additional jobs on this CE.  AWS Batch continues to place jobs on both CEs until the queue is empty.

Please see this repository for sample CloudFormation code to replicate this environment. Or, click here for more examples of leveraging Spot with AWS Batch.

 

Best Fit

Imagine you have a well-defined genomics sequencing workload. You know that this workload performs best on M5 instances, and you run this workload On-Demand because it is not interruptible. You’ve run this workload on AWS Batch before and you’re happy with its current behavior. You’re willing to trade off occasional capacity constraints in return for the knowledge you’re controlling cost strictly.  In this case, the best fit strategy may be a good option. This strategy used to be AWS Batch’s only behavior. It examines the queue and picks the best fit instance type and size for the workload. As described earlier, best fit to AWS Batch means the least number of instances capable of running the workload, at the lowest cost. In general, we recommend the best fit strategy only when you want the lowest cost for your instance, and you’re willing to trade cost for throughput and availability.

Note: AWS Batch will not launch instances above Max vCPU while using the best fit strategy. To launch a best fit CE, you can launch it similar to the following:

…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important Note for AWS Batch Allocation Strategies with Spot Instances:

You always have the option to set a percentage of On-Demand price when creating a Spot CE. When setting a percentage of an On-Demand price, AWS Batch will only launch instances that have Spot prices lower than the lowest per-unit-hour instance. In general, setting a percentage of an On-Demand price lowers your availability, and should only be used if you want cost controls.If you want to enjoy the cost savings with Spot with better availability, I recommend that you do not set a percentage of On-Demand price.

Conclusion

With these new allocation strategies, you now have much greater flexibility to control how AWS Batch provisions your instances. This allows you to make better throughput and cost trade-offs depending on the sensitivity of your workload. To learn more about how these strategies behave, please visit the AWS Batch documentation. Feel free to experiment with AWS Batch on your own to get an idea of how they help you run your specific workload.

 

Thanks to Chad Scmutzer for his support on the CloudFormation template

Deploying a highly available WordPress site on Amazon Lightsail, Part 1: Implementing a highly available Lightsail database with WordPress

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/deploying-a-highly-available-wordpress-site-on-amazon-lightsail-part-1-implementing-a-highly-available-lightsail-database-with-wordpress/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail

This post walks you through what to consider when architecting a scalable, redundant WordPress site. It discusses how WordPress stores such elements as user accounts, posts, settings, media, and themes, and how to configure WordPress to work with a standalone database.

This walkthrough deploys a WordPress site on Amazon Lightsail. Lightsail is the easiest way to get started on AWS, and it might be the easiest (and least expensive) way to get started on WordPress. You can launch a new WordPress site in a few clicks with one of Lightsail’s blueprints, for a few dollars a month. This gives you a single Lightsail instance to host your WordPress site that’s perfect for a small personal blog.

However, you may need a more resilient site capable of scaling to meet increased demand and architected to provide a degree of redundancy. If you’re a novice cloud user, the idea of setting up a highly available WordPress implementation might seem daunting. But with Lightsail and other AWS services, it doesn’t need to be.

Subsequent posts in this series cover managing media files in a multi-server environment, using CloudFront to increase site security and performance, and scaling the WordPress front-end with a Lightsail load balancer.

What’s under the hood?

Even though you’re a WordPress user, you may not have thought about how WordPress is built. However, if you’re moving into managing your WordPress site, it’s essential to understand what’s under the hood. As a content management system (CMS), WordPress provides a lot of functionality; this post focuses on some of the more essential features in the context of how they relate to architecting your highly available WordPress site.

WordPress manages a variety of different data. There are user accounts, posts, media (such as images and videos), themes (code that customizes the look and feel of a given WordPress site), plugins (code that adds additional functionality to your site), and configuration settings.

Where WordPress stores your data varies depending on the type of data. At the most basic level, WordPress is a PHP application running on a web server and database. The web server is the instance that you create in Lightsail, and includes WordPress software and the MySQL database. The following diagram shows the Lightsail VPC architecture.

Lightsail's VPC Architecture

The database stores a big chunk of the data that WordPress needs; for example, all of the user account information and blog posts. However, the web server’s file system stores another portion of data; for example, a new image you upload to your WordPress server. Finally, with themes and plugins, both the database and the file system store information. For example, the database holds information on what plugins and which theme is currently active, but the file system stores the actual code for the themes and plugins.

To provide a highly available WordPress implementation, you need to provide redundancy not only for the database, but also the content that may live on the file system.

Prerequisites

This solution has the following prerequisites:

This post and the subsequent posts deal with setting up a new WordPress site. If you have an existing site, the processes to follow are similar, but you should consult the documentation for both Lightsail and WordPress. Also, be sure to back up your existing database as well as snapshot your existing WordPress instance.

Configuring the database

The standalone MySQL database you created is not yet configured to work with WordPress. You need to create the actual database and define the tables that WordPress needs. The easiest method is to export the table from the database on your WordPress instance and import it into your standalone MySQL database. To do so, complete the following steps:

  1. Connect to your WordPress instance by using your SSH client or the web-based SSH client in the Lightsail console. The screenshot below highlights the icon to click.

WordPress icon for connecting through SSH

  1. From the terminal prompt for your WordPress instance, set two environment variables (LSDB_USERNAME and LSDB_ENDPOINT) that contain the connection information for your standalone database.

You can find that information on the database’s management page from the Lightsail console. See the following screenshot of the Connection details page.

the Connection details page

  1. To set the environment variables, substitute the values for your instance into the following code example and input each line one at a time at the terminal prompt:

LSDB_USERNAME=UserName

LSDB_ENDPOINT=Endpoint

For example, your input should look similar to the following code:

LSDB_USERNAME=dbmasteruser

LSDB_ENDPOINT=ls.rds.amazonaws.com

  1. Retrieve the Bitnami application password for the database running on your WordPress instance.

This password is stored at /home/bitnami/bitnami_application_password.

Enter the following cat command in the terminal to display the value:

cat /home/bitnami/bitnami_application_password

  1. Copy and paste the following code into a text document and copy the password

cat /home/bitnami/bitnami_application_password

You need this password in the following steps.

  1. Enter the following command into the terminal window:

mysqldump \

 -u root \

--databases bitnami_wordpress \

--single-transaction \

--order-by-primary \

-p > dump.sql

This command creates a file (dump.sql) that defines the database and all the needed tables.

  1. When prompted for a password, enter the Bitnami application password you recorded previously.

The terminal window doesn’t show the password as you enter it.

Now that you have the right database structure exported, import that into your standalone database. You’ll do this by entering the contents of your dump file into the mysql command line.

  1. Enter the following command at the terminal prompt:

cat dump.sql | mysql \

--user $LSDB_USERNAME \

--host $LSDB_ENDPOINT \

-p

  1. When prompted for a password, enter the password for your Lightsail database.

The terminal window doesn’t show the password as you enter it.

  1. Enter the following mysql command in the instance terminal:

echo 'use bitnami_wordpress; show tables' | \

mysql \

--user $LSDB_USERNAME \

--host $LSDB_ENDPOINT \

-p

This command shows the structure of the WordPress database, and verifies that you created the database on your Lightsail instance successfully.

  1. When prompted for a password, enter the password for your standalone database.

You should receive the following output:

Tables_in_bitnami_wordpress

wp_commentmeta

wp_comments

wp_links

wp_options

wp_postmeta

wp_posts

wp_term_relationships

wp_term_taxonomy

wp_termmeta

wp_terms

wp_usermeta

wp_users

This test confirms that your standalone database is ready for you to use with your WordPress instance.

Configuring WordPress

Now that you have the standalone database configured, modify the WordPress configuration file (wp-config.php) to direct the WordPress instance to use the standalone database instead of the database on the instance.

The first step is to back up your existing configuration file. If you run into trouble, copy wp-config.php.bak over to wp-config.php to roll back any changes.

  1. Enter the following code:

cp /home/bitnami/apps/wordpress/htdocs/wp-config.php /home/bitnami/apps/wordpress/htdocs/wp-config.php.bak

You are using wp-cli to modify the wp-config file.

  1. Swap out the following values with those for your Lightsail database:

wp config set DB_USER UserName

wp config set DB_PASSWORD Password

wp config set DB_HOST Endpoint

The following screenshot shows the example database values.

example database values

For example:

wp config set DB_USER dbmasteruser

wp config set DB_PASSWORD ‘MySecurePassword!2019’

wp config set DB_HOST ls.rds.amazonaws.com

To avoid issues with any special characters the password may contain, make sure to wrap the password value in single quotes (‘).

  1. Enter the following command:

wp config list

The output should match the values on the database’s management page in the Lightsail console. This confirms that the changes went through.

  1. Restart WordPress by entering the following command:

sudo /opt/bitnami/ctlscript.sh restart

Conclusion

This post covered a lot of ground. Hopefully it educated and inspired you to sign up for a free AWS account and start building out a robust WordPress site. Subsequent posts in this series show how to deal with your uploaded media and scale the web front end with a Lightsail load balancer.

Now Available: Bare Metal Arm-Based EC2 Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-bare-metal-arm-based-ec2-instances/

At AWS re:Invent 2018, we announced a new line of Amazon Elastic Compute Cloud (EC2) instances: the A1 family, powered by Arm-based AWS Graviton processors. This family is a great fit for scale-out workloads e.g. web front-ends, containerized microservices or caching fleets. By expanding the choice of compute options, A1 instances help customers use the right instances for the right applications, and deliver up to 45% cost savings. In addition, A1 instances enable Arm developers to build and test natively on Arm-based infrastructure in the cloud: no more cross compilation or emulation required.

Today, we are happy to expand the A1 family with a bare metal option.

Bare Metal for A1

Instance NameLogical ProcessorsMemoryEBS-Optimized BandwidthNetwork Bandwidth
a1.metal1632 GiB3.5 GbpsUp to 10 Gbps

Just like for existing bare metal instances (M5, M5d, R5, R5d, z1d, and so forth), your operating system runs directly on the underlying hardware with direct access to the processor.

As described in a previous blog post, you can leverage bare metal instances for applications that:

  • need access to physical resources and low-level hardware features, such as performance counters, that are not always available or fully supported in virtualized environments,
  • are intended to run directly on the hardware, or licensed and supported for use in non-virtualized environments.

Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

Working with A1 Instances
Bare metal or not, it’s never been easier to work with A1 instances. Initially launched in four AWS regions, they’re now available in four additional regions: Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney).

From a software perspective, you can run on A1 instances Amazon Machine Images for popular Linux distributions like Ubuntu, Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Debian, and of course Amazon Linux 2. Applications such as the Apache HTTP Server and NGINX Plus are available too. So are all major programming languages and run-times including PHP, Python, Perl, Golang, Ruby, NodeJS and multiple flavors of Java including Amazon Corretto, a supported open source OpenJDK implementation.

What about containers? Good news here as well! Amazon ECS and Amazon EKS both support A1 instances. Docker has announced support for Arm-based architectures in Docker Enterprise Edition, and most Docker official images support Arm. In addition, millions of developers can now use Arm emulation to build, run and test containers on their desktop machine before moving them to production.

As you would expect, A1 instances are seamlessly integrated with many AWS services, such as Amazon EBS, Amazon CloudWatch, Amazon Inspector, AWS Systems Manager and AWS Batch.

Now Available!
You can start using a1.metal instances today in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney). As always, we appreciate your feedback, so please don’t hesitate to get in touch via the AWS Compute Forum, or through your usual AWS support contacts.

Julien;

New M5n and R5n EC2 Instances, with up to 100 Gbps Networking

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-m5n-and-r5n-instances-with-up-to-100-gbps-networking/

AWS customers build ever-demanding applications on Amazon EC2. To support them the best we can, we listen to their requirements, go to work, and come up with new capabilities. For instance, in 2018, we upgraded the networking capabilities of Amazon EC2 C5 instances, with up to 100 Gbps networking, and significant improvements in packet processing performance. These are made possible by our new virtualization technology, aka the AWS Nitro System, and by the Elastic Fabric Adapter which enables low latency on 100 Gbps networking platforms.

In order to extend these benefits to the widest range of workloads, we’re happy to announce that these same networking capabilities are available today for both Amazon EC2 M5 and R5 instances.

Introducing Amazon EC2 M5n and M5dn instances
Since the very early days of Amazon EC2, the M family has been a popular choice for general-purpose workloads. The new M5(d)n instances uphold this tradition, and are a great fit for databases, High Performance Computing, analytics, and caching fleets that can take advantage of improved network throughput and packet rate performance.

The chart below lists out the new instances and their specs: each M5(d) instance size now has an M5(d)n counterpart, which supports the upgraded networking capabilities discussed above. For example, whereas the regular m5(d).8xlarge instance has a respectable network bandwidth of 10 Gbps, its m5(d)n.8xlarge sibling goes to 25 Gbps. The top of the line m5(d)n.24xlarge instance even hits 100 Gbps.

Here are the specs:

Instance NameLogical ProcessorsMemoryLocal Storage
(m5dn only)
EBS-Optimized BandwidthNetwork Bandwidth
m5n.large
m5dn.large
28 GiB1 x 75 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.xlarge
m5dn.xlarge
416 GiB1 x 150 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.2xlarge
m5dn.2xlarge
832 GiB1 x 300 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.4xlarge
m5dn.4xlarge
1664 GiB2 x 300 GB
NVMe SSD
3.5 GbpsUp to 25 Gbps
m5n.8xlarge
m5dn.8xlarge
32128 GiB2 x 600 GB
NVMe SSD
5 Gbps25 Gbps
m5n.12xlarge
m5dn.12xlarge
48192 GiB2 x 900 GB
NVMe SSD
7 Gbps50 Gbps
m5n.16xlarge
m5dn.16xlarge
64256 GiB4 x 600 GB
NVMe SSD
10 Gbps75 Gbps
m5n.24xlarge
m5dn.24xlarge
96384 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps
m5n.metal
m5dn.metal
96384 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps

Introducing Amazon EC2 R5n and R5dn instances
The R5 family is ideally suited for memory-hungry workloads, such as high performance databases, distributed web scale in-memory caches, mid-size in-memory databases, real time big data analytics, and other enterprise applications.

The logic here is exactly the same: each R5(d) instance size has an R5(d)n counterpart. Here are the specs:

Instance NameLogical ProcessorsMemoryLocal Storage
(r5dn only)
EBS-Optimized BandwidthNetwork Bandwidth
r5n.large
r5dn.large
216 GiB1 x 75 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.xlarge
r5dn.xlarge
432 GiB1 x 150 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.2xlarge
r5dn.2xlarge
864 GiB1 x 300 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.4xlarge
r5dn.4xlarge
16128 GiB2 x 300 GB
NVMe SSD
3.5 GbpsUp to 25 Gbps
r5n.8xlarge
r5dn.8xlarge
32256 GiB2 x 600 GB
NVMe SSD
5 Gbps25 Gbps
r5n.12xlarge
r5dn.12xlarge
48384 GiB2 x 900 GB
NVMe SSD
7 Gbps50 Gbps
r5n.16xlarge
r5dn.16xlarge
64512 GiB4 x 600 GB
NVMe SSD
10 Gbps75 Gbps
r5n.24xlarge
r5dn.24xlarge
96768 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps
r5n.metal
r5dn.metal
96768 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps

These new M5(d)n and R5(d)n instances are powered by custom second generation Intel Xeon Scalable Processors (based on the Cascade Lake architecture) with sustained all-core turbo frequency of 3.1 GHz and maximum single core turbo frequency of 3.5 GHz. Cascade Lake processors enable new Intel Vector Neural Network Instructions (AVX-512 VNNI) which will help speed up typical machine learning operations like convolution, and automatically improve inference performance over a wide range of deep learning workloads.

Now Available!
You can start using the M5(d)n and R5(d)n instances today in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Singapore).

We hope that these new instances will help you tame your network-hungry workloads! Please send us feedback, either on the AWS Forum for Amazon EC2, or through your usual support contacts.

Julien;

Building a pocket platform-as-a-service with Amazon Lightsail

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/building-a-pocket-platform-as-a-service-with-amazon-lightsail/

This post was written by Robert Zhu, a principal technical evangelist at AWS and a member of the GraphQL Working Group. 

When you start a new web-based project, you figure out wha kind of infrastructure you need. For my projects, I prioritize simplicity, flexibility, value, and on-demand capacity, and find myself quickly needing the following features:

  • DNS configuration
  • SSL support
  • Subdomain to a service
  • SSL reverse proxy to localhost (similar to ngrok and serveo)
  • Automatic deployment after a commit to the source repo (nice to have)

new projects have different requirements compared to mature projects

Amazon Lightsail is perfect for building a simple “pocket platform” that provides all these features. It’s cheap and easy for beginners and provides a friendly interface for managing virtual machines and DNS. This post shows step-by-step how to assemble a pocket platform on Amazon Lightsail.

Walkthrough

The following steps describe the process. If you prefer to learn by watching videos instead, view the steps by watching the following: Part 1, Part 2, Part 3.

Prerequisites

You should be familiar with: Linux, SSH, SSL, Docker, Nginx, HTTP, and DNS.

Steps

Use the following steps to assemble a pocket platform.

Creating a domain name and static IP

First, you need a domain name for your project. You can register your domain with any domain name registration service, such as Amazon Route53.

  1. After your domain is registered, open the Lightsail console, choose the Networking tab, and choose Create static IP.

Lightsail console networking tab

  1. On the Create static IP page, give the static IP a name you can remember and don’t worry about attaching it to an instance just yet. Choose Create DNS zone.

 

  1. On the Create a DNS zone page, enter your domain name and then choose Create DNS zone. For this post, I use the domain “raccoon.news.DNS zone in Lightsail with two A records
  2. Choose Add Record and create two A records—“@.raccoon.news” and “raccoon.news”—both resolving to the static IP address you created earlier. Then, copy the values for the Lightsail name servers at the bottom of the page. Go back to your domain name provider, and edit the name servers to point to the Lightsail name servers. Since I registered my domain with Route53, here’s what it looks like:

Changing name servers in Route53

Note: If you registered your domain with Route53, make sure you change the name server values under “domain registration,” not “hosted zones.” If you registered your domain with Route53, you need to delete the hosted zone that Route53 automatically creates for your domain.

Setting up your Lightsail instance

While you wait for your DNS changes to propagate, set up your Lightsail instance.

  1. In the Lightsail console, create a new instance and select Ubuntu 18.04.

Choose OS Only and Ubuntu 18.04 LTS

For this post, you can use the cheapest instance. However, when you run anything in production, make sure you choose an instance with enough capacity for your workload.

  1. After the instance launches, select it, then click on the Networking tab and open two additional TCP ports: 443 and 2222. Then, attach the static IP allocated earlier.
  2. To connect to the Lightsail instance using SSH, download the SSH key, and save it to a friendly path, for example: ~/ls_ssh_key.pem.

Click the download link to download the SSH key

  • Restrict permissions for your SSH key:

chmod 400 ~/ls_ssh_key.pem

  • Connect to the instance using SSH:

ssh -i ls_ssh_key.pem [email protected]_IP

  1. After you connect to the instance, install Docker to help manage deployment and configuration:

sudo apt-get update && sudo apt-get install docker.io
sudo systemctl start docker
sudo systemctl enable docker
docker run hello-world

  1. After Docker is installed, set up a gateway using called the nginx-proxy container. This container lets you route traffic to other containers by providing the “VIRTUAL_HOST” environment variable. Conveniently, nginx-proxy comes with an SSL companion, nginx-proxy-letsencrypt, which uses Let’s Encrypt.

# start the reverse proxy container
sudo docker run --detach \
    --name nginx-proxy \
    --publish 80:80 \
    --publish 443:443 \
    --volume /etc/nginx/certs \
    --volume /etc/nginx/vhost.d \
    --volume /usr/share/nginx/html \
    --volume /var/run/docker.sock:/tmp/docker.sock:ro \
    jwilder/nginx-proxy

# start the letsencrypt companion
sudo docker run --detach \
    --name nginx-proxy-letsencrypt \
    --volumes-from nginx-proxy \
    --volume /var/run/docker.sock:/var/run/docker.sock:ro \
    --env "DEFAULT_EMAIL=YOUREMAILHERE" \
    jrcs/letsencrypt-nginx-proxy-companion

# start a demo web server under a subdomain
sudo docker run --detach \
    --name nginx \
    --env "VIRTUAL_HOST=test.EXAMPLE.COM" \
    --env "LETSENCRYPT_HOST=test.EXAMPLE.COM" \
    nginx

Pay special attention to setting a valid email for the DEFAULT_EMAIL environment variable on the proxy companion; otherwise, you’ll need to specify the email whenever you start a new container. If everything went well, you should be able to navigate to https://test.EXAMPLE.COM and see the nginx default content with a valid SSL certificate that has been auto-generated by Let’s Encrypt.

A publicly accessible URL served from our Lightsail instance with SSL support

Troubleshooting:

Deploying a localhost proxy with SSL

Most developers prefer to code on a dev machine (laptop or desktop) because they can access the file system, use their favorite IDE, recompile, debug, and more. Unfortunately, developing on a dev machine can introduce bugs due to differences from the production environment. Also, certain services (for example, Alexa Skills or GitHub Webhooks) require SSL to work, which can be annoying to configure on your local machine.

For this post, you can use an SSL reverse proxy to make your local dev environment resemble production from the browser’s point of view. This technique also helps allow your test application to make API requests to production endpoints with Cross-Origin Resource Sharing restrictions. While it’s not a perfect solution, it takes you one step closer toward a frictionless dev/test feedback loop. You may have used services like ngrok and serveo for this purpose. By running a reverse proxy, you won’t need to spread your domain and SSL settings across multiple services.

To run a reverse proxy, create an SSH reverse tunnel. After the reverse tunnel SSH session is initiated, all network requests to the specified port on the host are proxied to your dev machine. However, since your Lightsail instance is already using port 22 for VPS management, you need a different SSH port (use 2222 from earlier). To keep everything organized, run the SSH server for port 2222 inside a special proxy container. The following diagram shows this solution.

Diagram of how an SSL reverse proxy works with SSH tunneling

Using Dockerize an SSH service as a starting point, I created a repository with a working Dockerfile and nginx config for reference. Here are the summary steps:

git clone https://github.com/robzhu/nginx-local-tunnelcd nginx-local-tunnel

docker build -t {DOCKERUSER}/dev-proxy . --build-arg ROOTPW={PASSWORD}

# start the proxy container
# Note, 2222 is the port we opened on the instance earlier.
docker run --detach -p 2222:22 \
    --name dev-proxy \
    --env "VIRTUAL_HOST=dev.EXAMPLE.com" \
    --env "LETSENCRYPT_HOST=dev.EXAMPLE.com" \
    {DOCKERUSER}/dev-proxy

# Ports explained:
# 3000 refers to the port that your app is running on localhost.
# 2222 is the forwarded port on the host that we use to directly SSH into the container.
# 80 is the default HTTP port, forwarded from the host
ssh -R :80:localhost:3000 -p 2222 [email protected]

# Start sample app on localhost
cd node-hello && npm i
nodemon main.js

# Point browser to https://dev.EXAMPLE.com

The reverse proxy subdomain works only as long as the reverse proxy SSH connection remains open. If there is no SSH connection, you should see an nginx gateway error:

Nginx will return 502 if you try to access the reverse proxy without running the SSH tunnel

While this solution is handy, be extremely careful, as it could expose your work-in-progress to the internet. Consider adding additional authorization logic and settings for allowing/denying specific IPs.

Setting up automatic deployment

Finally, build an automation workflow that watches for commits on a source repository, builds an updated container image, and re-deploys the container on your host. There are many ways to do this, but here’s the combination I’ve selected for simplicity:

  1. First, create a GitHub repository to host your application source code. For demo purposes, you can clone my express hello-world example. On the Docker hub page, create a new repository, click the GitHub icon, and select your repository from the dropdown list.Create GitHub repo to host your application source code
  2. Now Docker watches for commits to the repo and builds a new image with the “latest” tag in response. After the image is available, start the container as follows:

docker run --detach \
    --name app \
    --env "VIRTUAL_HOST=app.raccoon.news" \
    --env "LETSENCRYPT_HOST=app.raccoon.news" \
    robzhu/express-hello

  1. Finally, use Watchtower to poll dockerhub and update the “app” container whenever a new image is detected:

docker run -d \
    --name watchtower \
    -v /var/run/docker.sock:/var/run/docker.sock \
    containrrr/watchtower \
    --interval 10 \
    APPCONTAINERNAME

 

Summary

Your Pocket PaaS is now complete! As long as you deploy new containers and add the VIRTUAL_HOST and LETSENCRYPT_HOST environment variables, you get automatic subdomain routing and SSL termination. With SSH reverse tunneling, you can develop on your local dev machine using your favorite IDE and test/share your app at https://dev.EXAMPLE.COM.

Because this is a public URL with SSL, you can test Alexa Skills, GitHub Webhooks, CORS settings, PWAs, and anything else that requires SSL. Once you’re happy with your changes, a git commit triggers an automated rebuild of your Docker image, which is automatically redeployed by Watchtower.

I hope this information was useful. Thoughts? Leave a comment or direct-message me on Twitter: @rbzhu.