Tag Archives: Amazon EC2

re:Invent 2019: Introducing the Amazon Builders’ Library (Part I)

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/reinvent-2019-introducing-the-amazon-builders-library-part-i/

Today, I’m going to tell you about a new site we launched at re:Invent, the Amazon Builders’ Library, a collection of living articles covering topics across architecture, software delivery, and operations. You get to peek under the hood of how Amazon architects, releases, and operates the software underpinning Amazon.com and AWS.

Want to know how Amazon.com does what it does? This is for you. In this two-part series (the next one coming December 23), I’ll highlight some of the best architecture articles written by Amazon’s senior technical leaders and engineers.

Avoiding insurmountable queue backlogs

Avoiding insurmountable queue backlogs

In queueing theory, the behavior of queues when they are short is relatively uninteresting. After all, when a queue is short, everyone is happy. It’s only when the queue is backlogged, when the line to an event goes out the door and around the corner, that people start thinking about throughput and prioritization.

In this article, I discuss strategies we use at Amazon to deal with queue backlog scenarios – design approaches we take to drain queues quickly and to prioritize workloads. Most importantly, I describe how to prevent queue backlogs from building up in the first place. In the first half, I describe scenarios that lead to backlogs, and in the second half, I describe many approaches used at Amazon to avoid backlogs or deal with them gracefully.

Read the full article by David Yanacek – Principal Engineer

Timeouts, retries, and backoff with jitter

Timeouts, retries and backoff with jitter

Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools: timeouts, retries, and backoff.

Read the full article by Marc Brooker, Senior Principal Engineer

Challenges with distributed systems

Challenges with distributed systems

The moment we added our second server, distributed systems became the way of life at Amazon. When I started at Amazon in 1999, we had so few servers that we could give some of them recognizable names like “fishy” or “online-01”. However, even in 1999, distributed computing was not easy. Then as now, challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos. As the systems quickly grew larger and more distributed, what had been theoretical edge cases turned into regular occurrences.

Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed.

Read the full article by Jacob Gabrielson, Senior Principal Engineer

Static stability using Availability Zones

Static stability using availability zones

At Amazon, the services we build must meet extremely high availability targets. This means that we need to think carefully about the dependencies that our systems take. We design our systems to stay resilient even when those dependencies are impaired. In this article, we’ll define a pattern that we use called static stability to achieve this level of resilience. We’ll show you how we apply this concept to Availability Zones, a key infrastructure building block in AWS and therefore a bedrock dependency on which all of our services are built.

Read the full article by Becky Weiss, Senior Principal Engineer, and Mike Furr, Principal Engineer

Check back in two weeks to read about some other architecture-based expert articles that let you in on how Amazon does what it does.

Amazon EC2 Update – Inf1 Instances with AWS Inferentia Chips for High Performance Cost-Effective Inferencing

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/

Our customers are taking to Machine Learning in a big way. They are running many different types of workloads, including object detection, speech recognition, natural language processing, personalization, and fraud detection. When running on large-scale production workloads, it is essential that they can perform inferencing as quickly and as cost-effectively as possible. According to what they have told us, inferencing can account for up to 90% of the cost of their machine learning work.

New Inf1 Instances
Today we are launching Inf1 instances in four sizes. These instances are powered by AWS Inferentia chips, and are designed to provide you with fast, low-latency inferencing.

AWS Inferentia chips are designed to accelerate the inferencing process. Each chip can deliver the following performance:

  • 64 teraOPS on 16-bit floating point (FP16 and BF16) and mixed-precision data.
  • 128 teraOPS on 8-bit integer (INT8) data.

The chips also include a high-speed interconnect, and lots of memory. With 16 chips on the largest instance, your new and existing TensorFlow, PyTorch, and MxNet inferencing workloads can benefit from over 2 petaOPS of inferencing power. When compared to the G4 instances, the Inf1 instances offer up to 3x the inferencing throughput, and up to 40% lower cost per inference.

Here are the sizes and specs:

Instance Name
Inferentia Chips
vCPUsRAMEBS BandwidthNetwork Bandwidth
inf1.xlarge148 GiBUp to 3.5 GbpsUp to 25 Gbps
inf1.2xlarge1816 GiBUp to 3.5 GbpsUp to 25 Gbps
inf1.6xlarge42448 GiB3.5 Gbps25 Gbps
inf1.24xlarge1696192 GiB14 Gbps100 Gbps

The instances make use of custom Second Generation Intel® Xeon® Scalable (Cascade Lake) processors, and are available in On-Demand, Spot, and Reserved Instance form, or as part of a Savings Plan in the US East (N. Virginia) and US West (Oregon) Regions. You can launch the instances directly, and they will also be available soon through Amazon SageMaker and Amazon ECS, and Amazon Elastic Kubernetes Service.

Using Inf1 Instances
Amazon Deep Learning AMIs have been updated and contain versions of TensorFlow and MxNet that have been optimized for use in Inf1 instances, with PyTorch coming very soon. The AMIs contain the new AWS Neuron SDK, which contains commands to compile, optimize, and execute your ML models on the Inferentia chip. You can also include the SDK in your own AMIs and images.

You can build and train your model on a GPU instance such as a P3 or P3dn, and then move it to an Inf1 instance for production use. You can use a model natively trained in FP16, or you can use models that have been trained to 32 bits of precision and have AWS Neuron automatically convert them to BF16 form. Large models, such as those for language translation or natural language processing, can be split across multiple Inferentia chips in order to reduce latency.

The AWS Neuron SDK also allows you to assign models to Neuron Compute Groups, and to run them in parallel. This allows you to maximize hardware utilization and to use multiple models as part of Neuron Core Pipeline mode, taking advantage of the large on-chip cache on each Inferentia chip. Be sure to read the AWS Neuron SDK Tutorials to learn more!

Jeff;

 

AWS Now Available from a Local Zone in Los Angeles

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-now-available-from-a-local-zone-in-los-angeles/

AWS customers are always asking for more features, more bandwidth, more compute power, and more memory, while also asking for lower latency and lower prices. We do our best to meet these competing demands: we launch new EC2 instance types, EBS volume types, and S3 storage classes at a rapid pace, and we also reduce prices regularly.

AWS in Los Angeles
Today we are launching a Local Zone in Los Angeles, California. The Local Zone is a new type of AWS infrastructure deployment that brings select AWS services very close to a particular geographic area. This Local Zone is designed to provide very low latency (single-digit milliseconds) to applications that are accessed from Los Angeles and other locations in Southern California. It will be of particular interest to highly-demanding applications that are particularly sensitive to latency. This includes:

Media & Entertainment – Gaming, 3D modeling & rendering, video processing (including real-time color correction), video streaming, and media production pipelines.

Electronic Design Automation – Interactive design & layout, simulation, and verification.

Ad-Tech – Rapid decision making & ad serving.

Machine Learning – Fast, continuous model training; high-performance low-latency inferencing.

All About Local Zones
The new Local Zone in Los Angeles is a logical part of the US West (Oregon) Region (which I will refer to as the parent region), and has some unique and interesting characteristics:

Naming – The Local Zone can be accessed programmatically as us-west-2-lax-1a. All API, CLI, and Console access takes place through the us-west-2 API endpoint and the US West (Oregon) Console.

Opt-In – You will need to opt in to the Local Zone in order to use it. After opting in, you can create a new VPC subnet in the Local Zone, taking advantage of all relevant VPC features including Security Groups, Network ACLs, and Route Tables. You can target the Local Zone when you launch EC2 instances and other resources, or you can create a default subnet in the VPC and have it happen automatically.

Networking – The Local Zone in Los Angeles is connected to US West (Oregon) over Amazon’s private backbone network. Connections to the public internet take place across an Internet Gateway, giving you local ingress and egress to reduce latency. Elastic IP Addresses can be shared by a group of Local Zones in a particular geographic location, but they do not move between a Local Zone and the parent region. The Local Zone also supports AWS Direct Connect, giving you the opportunity to route your traffic over a private network connection.

Services – We are launching with support for seven EC2 instance types (T3, C5, M5, R5, R5d, I3en, and G4), two EBS volume types (io1 and gp2), Amazon FSx for Windows File Server, Amazon FSx for Lustre, Application Load Balancer, and Amazon Virtual Private Cloud. Single-Zone RDS is on the near-term roadmap, and other services will come later based on customer demand. Applications running in a Local Zone can also make use of services in the parent region.

Parent Region – As I mentioned earlier, the new Local Zone is a logical extension of the US West (Oregon) region, and is managed by the “control plane” in the region. API calls, CLI commands, and the AWS Management Console should use “us-west-2” or US West (Oregon).

AWS – Other parts of AWS will continue to work as expected after you start to use this Local Zone. Your IAM resources, CloudFormation templates, and Organizations are still relevant and applicable, as are your tools and (perhaps most important) your investment in AWS training.

Pricing & Billing – Instances and other AWS resources in Local Zones will have different prices than in the parent region. Billing reports will include a prefix that is specific to a group of Local Zones that share a physical location. EC2 instances are available in On Demand & Spot form, and you can also purchase Savings Plans.

Using a Local Zone
The first Local Zone is available today, and you can request access here:

In early 2020, you will be able opt in using the console, CLI, or by API call.

After opting in, I can list my AZs and see that the Local Zone is included:

Then I create a new VPC subnet for the Local Zone. This gives me transparent, seamless connectivity between the parent zone in Oregon and the Local Zone in Los Angeles, all within the VPC:

I can create EBS volumes:

They are, as usual, ready within seconds:

I can also see and use the Local Zone from within the AWS Management Console:

I can also use the AWS APIs, CloudFormation templates, and so forth.

Thinking Ahead
Local Zones give you even more architectural flexibility. You can think big, and you can think different! You now have the components, tools, and services at your fingertips to build applications that make use of any conceivable combination of legacy on-premises resources, modern on-premises cloud resources via AWS Outposts, resources in a Local Zone, and resources in one or more AWS regions.

In the fullness of time (as Andy Jassy often says), there could very well be more than one Local Zone in any given geographic area. In 2020, we will open a second one in Los Angeles (us-west-2-lax-1b), and are giving consideration to other locations. We would love to get your advice on locations, so feel free to leave me a comment or two!

Now Available
The Local Zone in Los Angeles is available now and you can start using it today. Learn more about Local Zones.

Jeff;

 

Coming Soon – Graviton2-Powered General Purpose, Compute-Optimized, & Memory-Optimized EC2 Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/coming-soon-graviton2-powered-general-purpose-compute-optimized-memory-optimized-ec2-instances/

We launched the first generation (A1) of Arm-based, Graviton-powered EC2 instances at re:Invent 2018. Since that launch, thousands of our customers have used them to run many different types of scale-out workloads including containerized microservices, web servers, and data/log processing.

The Operating System Vendors (OSV) and Independent Software Vendor (ISV) communities have been quick to embrace the Arm architecture and the A1 instances. You have your pick of multiple Linux & Unix distributions including Amazon Linux 2, Ubuntu, Red Hat, SUSE, Fedora, Debian, and FreeBSD:

You can also choose between three container services (Docker, Amazon ECS, and Amazon Elastic Kubernetes Service), multiple system agents, and lots of developer tools (AWS Developer Tools, Jenkins, and more).

The feedback on these instances has been strong and positive, and our customers have told us that they are ready to use Arm-based servers on their more demanding compute-heavy and memory-intensive workloads.

Graviton2
Today I would like to give you a sneak peek at the next generation of Arm-based EC2 instances. These instances are built on AWS Nitro System and will be powered by the new Graviton2 processor. This is a custom AWS design that is built using a 7 nm (nanometer) manufacturing process. It is based on 64-bit Arm Neoverse cores, and can deliver up to 7x the performance of the A1 instances, including twice the floating point performance. Additional memory channels and double-sized per-core caches speed memory access by up to 5x.

All of these performance enhancements come together to give these new instances a significant performance benefit over the 5th generation (M5, C5, R5) of EC2 instances. Our initial benchmarks show the following per-vCPU performance improvements over the M5 instances:

  • SPECjvm® 2008: +43% (estimated)
  • SPEC CPU® 2017 integer: +44% (estimated)
  • SPEC CPU 2017 floating point: +24% (estimated)
  • HTTPS load balancing with Nginx: +24%
  • Memcached: +43% performance, at lower latency
  • X.264 video encoding: +26%
  • EDA simulation with Cadence Xcellium: +54%

Based on these results, we are planning to use these instances to power Amazon EMR, Elastic Load Balancing, Amazon ElastiCache, and other AWS services.

The new instances raise the already-high bar on AWS security. Building on the existing capabilities of the AWS Nitro System, memory on the instances is encrypted with 256-bit keys that are generated at boot time, and which never leave the server.

We are working on three types of Graviton2-powered EC2 instances (the d suffix indicates NVMe local storage):

General Purpose (M6g and M6gd) – 1-64 vCPUs and up to 256 GiB of memory.

Compute-Optimized (C6g and C6gd) – 1-64 vCPUs and up to 128 GiB of memory.

Memory-Optimized (R6g and R6gd) – 1-64 vCPUs and up to 512 GiB of memory.

The instances will have up to 25 Gbps of network bandwidth, 18 Gbps of EBS-Optimized bandwidth, and will also be available in bare metal form. I will have more information to share with you in 2020.

M6g Preview
We are now running a preview of the M6g instances for testing on non-production workloads; if you are interested, please contact us.

Jeff;

AWS Load Balancer Update – Lots of New Features for You!

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-load-balancer-update-lots-of-new-features-for-you/

The AWS Application Load Balancer (ALB) and Network Load Balancer (NLB) are important parts of any highly available and scalable system. Today I am happy to share a healthy list of new features for ALB and NLB, all driven by customer requests.

Here’s what I have:

  • Weighted Target Groups for ALB
  • Least Outstanding Requests for ALB
  • Subnet Expansion for NLB
  • Private IP Address Selection for Internal NLB
  • Shared VPC Support for NLB

All of these features are available now and you can starting using them today!

It’s time for a closer look…

Weighted Target Groups for ALB
You can now use traffic weights for your ALB target groups; this will be very helpful for blue/green deployments, canary deployments, and hybrid migration/burst scenarios. You can register multiple target groups with any of the forward actions in your ALB routing rules, and associate a weight (0-999) with each one. Here’s a simple last-chance rule that sends 99% of my traffic to tg1 and the remaining 1% to tg2:

You can use this feature in conjunction with group-level target stickiness in order to maintain a consistent customer experience for a specified duration:

To learn more, read about Listeners for Your Load Balancers.

Least Outstanding Requests for ALB
You can now balance requests across targets based on the target with the lowest number of outstanding requests. This is especially useful for workloads with varied request sizes, target groups with containers & other targets that change frequently, and targets with varied levels of processing power, including those with a mix of instance types in a single auto scaling group. You can enable this new load balancing option by editing the attributes of an existing target group:

Enabling this option will disable any slow start; to learn more, read about ALB Routing Algorithms.

Subnet Expansion Support for NLB
You now have the flexibility to add additional subnets to an existing Network Load Balancer. This gives you more scaling options, and allows you to expand into newly opened Availability Zones while maintaining high availability. Select the NLB, and click Edit subnets in the Actions menu:

Then choose one or more subnets to add:

This is a good time to talk about multiple availability zones and redundancy. Since you are adding a new subnet, you want to make sure that you either have targets in it, or have cross-zone load balancing enabled.

Private IP Address Selection for Internal NLB
You can now select the private IPv4 address that is used for your internal-facing Network Load Balancer, on a per-subnet basis. This gives you additional control over network addressing, and removes the need to manually ascertain addresses and configure them into clients that do not support DNS-based routing:

You can also choose your own private IP addresses when you add additional subnets to an existing NLB.

Shared VPC Support for NLB
You can now create NLBs in shared VPCs. Using NLBs with VPC sharing, you can route traffic across subnets in VPCs owned by a centrally managed account in the same AWS Organization. You can also use NLBs to create an AWS PrivateLink service, which will enable users to privately access your services in the shared subnets from other VPCs or on-premises networks, without using public IPs or requiring the traffic to traverse the internet.

Jeff;

 

Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances

Post Syndicated from peven original https://aws.amazon.com/blogs/compute/running-cost-effective-queue-workers-with-amazon-sqs-and-amazon-ec2-spot-instances/

This post is contributed by Ran Sheinberg | Sr. Solutions Architect, EC2 Spot & Chad Schmutzer | Principal Developer Advocate, EC2 Spot | Twitter: @schmutze

Introduction

Amazon Simple Queue Service (SQS) is used by customers to run decoupled workloads in the AWS Cloud as a best practice, in order to increase their applications’ resilience. You can use a worker tier to do background processing of images, audio, documents and so on, as well as offload long-running processes from the web tier. This blog post covers the benefits of pairing Amazon SQS and Spot Instances to maximize cost savings in the worker tier, and a customer success story.

Solution Overview

Amazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a common best practice to use Amazon SQS with decoupled applications. Amazon SQS increases applications resilience by decoupling the direct communication between the frontend application and the worker tier that does data processing. If a worker node fails, the jobs that were running on that node return to the Amazon SQS queue for a different node to pick up.

Both the frontend and worker tier can run on Spot Instances, which offer spare compute capacity at steep discounts compared to On-Demand Instances. Spot Instances optimize your costs on the AWS Cloud and scale your application’s throughput up to 10 times for the same budget. Spot Instances can be interrupted with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. These can include analytics, containerized workloads, high performance computing (HPC), stateless web servers, rendering, CI/CD, and queue worker nodes—which is the focus of this post.

Worker tiers of a decoupled application are typically fault-tolerant. So, it is a prime candidate for running on interruptible capacity. Amazon SQS running on Spot Instances allows for more robust, cost-optimized applications.

By using EC2 Auto Scaling groups with multiple instance types that you configured as suitable for your application (for example, m4.xlarge, m5.xlarge, c5.xlarge, and c4.xlarge, in multiple Availability Zones), you can spread the worker tier’s compute capacity across many Spot capacity pools (a combination of instance type and Availability Zone). This increases the chance of achieving the scale that’s required for the worker tier to ingest messages from the queue, and of keeping that scale when Spot Instance interruptions occur, while selecting the lowest-priced Spot Instances in each availability zone.

You can also choose the capacity-optimized allocation strategy for the Spot Instances in your Auto Scaling group. This strategy automatically selects instances that have a lower chance of interruption, which decreases the chances of restarting jobs due to Spot interruptions. When Spot Instances are interrupted, your Auto Scaling group automatically replenishes the capacity from a different Spot capacity pool in order to achieve your desired capacity. Read the blog post “Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances” for more details on how to choose the suitable allocation strategy.

We focus on three main points in this blog:

  1. Best practices for using Spot Instances with Amazon SQS
  2. A customer example that uses these components
  3. Example solution that can help you get you started quickly

Application of Amazon SQS with Spot Instances

Amazon SQS eliminates the complexity of managing and operating message-oriented middleware. Using Amazon SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available. Amazon SQS is a fully managed service which allows you to set up a queue in seconds. It also allows you to use your preferred SDK to start writing and reading to and from the queue within minutes.

In the following example, we describe an AWS architecture that brings together the Amazon SQS queue and an EC2 Auto Scaling group running Spot Instances. The architecture is used for decoupling the worker tier from the web tier by using Amazon SQS. The example uses the Protect feature (which we will explain later in this post) to ensure that an instance currently processing a job does not get terminated by the Auto Scaling group when it detects that a scale-in activity is required due to a Dynamic Scaling Policy.Architecture diagram for using Amazon SQS with Spot Instances and Auto Scaling groups

AWS reference architecture used for decoupling the worker tier from the web tier by using Amazon SQS

Customer Example: How Trax Retail uses Auto Scaling groups with Spot Instances in their Amazon SQS application

Trax decided to run its queue worker tier exclusively on Spot Instances due to the fault-tolerant nature of its architecture and for cost-optimization purposes. The company digitizes the physical world of retail using Computer Vision. Their ‘Trax Factory’ transforms individual shelf into data and insights about retail store conditions.

Built using asynchronous event-driven architecture, Trax Factory is a cluster of microservices in which the completion of one service triggers the activation of another service. The worker tier uses Auto Scaling groups with dynamic scaling policies to increase and decrease the number of worker nodes in the worker tier.

You can create a Dynamic Scaling Policy by doing the following:

  1. Observe a Amazon CloudWatch metric. Watch the metric for the current number of messages in the Amazon SQS queue (ApproximateNumberOfMessagesVisible).
  2. Create a CloudWatch alarm. This alarm should be based on that metric you created in the prior step.
  3. Use your CloudWatch alarm in a Dynamic Scaling Policy. Use this policy increase and decrease the number of EC2 Instances in the Auto Scaling group.

In Trax’s case, due to the high variability of the number of messages in the queue, they opted to enhance this approach in order to minimize the time it takes to scale, by building a service that would call the SQS API and find the current number of messages in the queue more frequently, instead of waiting for the 5 minute metric refresh interval in CloudWatch.

Trax ensures that its applications are always scaled to meet the demand by leveraging the inherent elasticity of Amazon EC2 instances. This elasticity ensures that end users are never affected and/or service-level agreements (SLA) are never violated.

With a Dynamic Scaling Policy, the Auto Scaling group can detect when the number of messages in the queue has decreased, so that it can initiate a scale-in activity. The Auto Scaling group uses its configured termination policy for selecting the instances to be terminated. However, this policy poses the risk that the Auto Scaling group might select an instance for termination while that instance is currently processing an image. That instance’s work would be lost (although the image would eventually be processed by reappearing in the queue and getting picked up by another worker node).

To decrease this risk, you can use Auto Scaling groups instance protection. This means that every time an instance fetches a job from the queue, it also sends an API call to EC2 to protect itself from scale-in. The Auto Scaling group does not select the protected, working instance for termination until the instance finishes processing the job and calls the API to remove the protection.

Handling Spot Instance interruptions

This instance-protection solution ensures that no work is lost during scale-in activities. However, protecting from scale-in does not work when an instance is marked for termination due to Spot Instance interruptions. These interruptions occur when there’s increased demand for On-Demand Instances in the same capacity pool (a combination of an instance type in an Availability Zone).

Applications can minimize the impact of a Spot Instance interruption. To do so, an application catches the two-minute interruption notification (available in the instance’s metadata), and instructs itself to stop fetching jobs from the queue. If there’s an image still being processed when the two minutes expire and the instance is terminated, the application does not delete the message from the queue after finishing the process. Instead, the message simply becomes visible again for another instance to pick up and process after the Amazon SQS visibility timeout expires.

Alternatively, you can release any ongoing job back to the queue upon receiving a Spot Instance interruption notification by setting the visibility timeout of the specific message to 0. This timeout potentially decreases the total time it takes to process the message.

Testing the solution

If you’re not currently using Spot Instances in your queue worker tier, we suggest testing the approach described in this post.

For that purpose, we built a simple solution to demonstrate the capabilities mentioned in this post, using an AWS CloudFormation template. The stack includes an Amazon Simple Storage Service (S3) bucket with a CloudWatch trigger to push notifications to an SQS queue after an image is uploaded to the Amazon S3 bucket. Once the message is in the queue, it is picked up by the application running on the EC2 instances in the Auto Scaling group. Then, the image is converted to PDF, and the instance is protected from scale-in for as long as it has an active processing job.

To see the solution in action, deploy the CloudFormation template. Then upload an image to the Amazon S3 bucket. In the Auto Scaling Groups console, check the instance protection status on the Instances tab. The protection status is shown in the following screenshot.

instance protection status in console

You can also see the application logs using CloudWatch Logs:

/usr/local/bin/convert-worker.sh: Found 1 messages in https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB

/usr/local/bin/convert-worker.sh: Found work to convert. Details: INPUT=Capture1.PNG, FNAME=capture1, FEXT=png

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --protected-from-scale-in

/usr/local/bin/convert-worker.sh: Convert done. Copying to S3 and cleaning up

/usr/local/bin/convert-worker.sh: Running: aws s3 cp /tmp/capture1.pdf s3://qtest-s3bucket-18fdpm2j17wxx

/usr/local/bin/convert-worker.sh: Running: aws sqs --output=json delete-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB --receipt-handle

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --no-protected-from-scale-in

Conclusion

This post helps you architect fault tolerant worker tiers in a cost optimized way. If your queue worker tiers are fault tolerant and use the built-in Amazon SQS features, you can increase your application’s resilience and take advantage of Spot Instances to save up to 90% on compute costs.

In this post, we emphasized several best practices to help get you started saving money using Amazon SQS and Spot Instances. The main best practices are:

  • Diversifying your Spot Instances using Auto Scaling groups, and selecting the right Spot allocation strategy
  • Protecting instances from scale-in activities while they process jobs
  • Using the Spot interruption notification so that the application stop polling the queue for new jobs before the instance is terminated

We hope you found this post useful. If you’re not using Spot Instances in your queue worker tier, we suggest testing the approach described here. Finally, we would like to thank the Trax team for sharing its architecture and best practices. If you want to learn more, watch the “This is my architecture” video featuring Trax and their solution.

We’d love your feedback—please comment and let me know what you think.


About the authors

 

Ran Sheinberg is a specialist solutions architect for EC2 Spot Instances with Amazon Web Services. He works with AWS customers on cost optimizing their compute spend by utilizing Spot Instances across different types of workloads: stateless web applications, queue workers, containerized workloads, analytics, HPC and others.

 

 

 

 

As a Principal Developer Advocate for EC2 Spot at AWS, Chad’s job is to make sure our customers are saving at scale by using EC2 Spot Instances to take advantage of the most cost-effective way to purchase compute capacity. Follow him on Twitter here! @schmutze

 

New – Amazon EBS Fast Snapshot Restore (FSR)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-ebs-fast-snapshot-restore-fsr/

Amazon Elastic Block Store (EBS) has been around for more than a decade and is a fundamental AWS building block. You can use it to create persistent storage volumes that can store up to 16 TiB and supply up to 64,000 IOPS (Input/Output Operations per Second). You can choose between four types of volumes, making the choice that best addresses your data transfer throughput, IOPS, and pricing requirements. If your requirements change, you can modify the type of a volume, expand it, or change the performance while the volume remains online and active. EBS snapshots allow you to capture the state of a volume for backup, disaster recovery, and other purposes. Once created, a snapshot can be used to create a fresh EBS volume. Snapshots are stored in Amazon Simple Storage Service (S3) for high durability.

Our ever-creative customers are using EBS snapshots in many interesting ways. In addition to the backup and disaster recovery use cases that I just mentioned, they are using snapshots to quickly create analytical or test environments using data drawn from production, and to support Virtual Desktop Interface (VDI) environments. As you probably know, the AMIs (Amazon Machine Images), that you use to launch EC2 instances are also stored as one or more snapshots.

Fast Snapshot Restore
Today we are launching Fast Snapshot Restore (FSR) for EBS. You can enable it for new and existing snapshots on a per-AZ (Availability Zone) basis, and then create new EBS volumes that deliver their maximum performance and do not need to be initialized.

This performance enhancement will allow you to build AWS-based systems that are even faster and more responsive than before. Faster boot times will speed up your VDI environments and allow your Auto Scaling Groups to come online and start processing traffic more quickly, even if you use large and/or custom AMIs. I am sure that you will soon dream up new applications that can take advantage of this new level of speed and predictability.

Fast Snapshot Restore can be enabled on a snapshot even while the snapshot is being created. If you create nightly backup snapshots, enabling them for FSR will allow you to do fast restores the following day regardless of the size of the volume or the snapshot.

Enabling & Using Fast Snapshot Restore
I can get started in minutes! I open the EC2 Console and find the first snapshot that I want to set up for fast restore:

I select the snapshot and choose Manage Fast Snapshot Restore from the Actions menu:

Then I select the Availability Zones where I plan to create EBS volumes, and click Save:

After the settings are saved, I receive a confirmation:

The console shows me that my snapshot is being enabled for Fast Snapshot Restore:

The status progresses from enabling to optimizing, and then to enabled. Behind the scenes and with no extra effort on my part, the optimization process provisions extra resources to deliver the fast restores, proceeding at a rate of one TiB per hour. By contrast, non-optimized volumes retrieve data directly from the S3-stored snapshot on an incremental, on-demand basis.

Once the optimization is complete, I can create volumes from the snapshot in the usual way, confident that they will be ready in seconds and pre-initialized for full performance! Each FSR-enabled snapshot supports creation of up to 10 initialized volumes per hour per Availability Zone; additional volume creations will be non-initialized. As my needs change, I can enable Fast Snapshot Restore in additional Availability Zones and I can disable it in Zones where I had previously enabled it.

When Fast Snapshot Restore is enabled for a snapshot in a particular Availability Zone, a bucket-based credit system governs the acceleration process. Creating a volume consumes a credit; the credits refill over time, and the maximum number of credits is a function of the FSR-enabled snapshot size. Here are some guidelines:

  • A 100 GiB FSR-enabled snapshot will have a maximum credit balance of 10, and a fill rate of 10 credits per hour.
  • A 4 TiB FSR-enabled snapshot will have a maximum credit balance of 1, and a fill rate of 1 credit every 4 hours.

In other words, you can do 1 TiB of restores per hour for a given FSR-enabled snapshot within an AZ.

Things to Know
Here are some things to know about Fast Snapshot Restore:

Regions & AZs – Fast Snapshot Restore is available in all Availability Zones of the US East (N. Virginia), US West (Oregon), US West (N. California), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Sydney), and Asia Pacific (Tokyo) Regions.

Pricing – You pay $0.75 for each hour that Fast Snapshot Restore is enabled for a snapshot in a particular Availability Zone, pro-rated and with a minimum of one hour.

Monitoring – You can use the following per-minute CloudWatch metrics to track the state of the credit bucket for each FSR-enabled snapshot:

  • FastSnapshotRestoreCreditsBalance – The number of volume creation credits that are available.
  • FastSnapshotRestoreCreditsBucketSize – The maximum number of volume creation credits that can be accumulated.

CLI & Programmatic Access – You can use the enable-fast-snapshot-restores, describe-fast-snapshot-restores, and disable-fast-snapshot-restores commands to create and manage your accelerated snapshots from the command line. You can also use the EnableFastSnapshotRestores, DescribeFastSnapshotRestores, and DisableFastSnapshotRestores API functions from your application code.

CloudWatch Events – You can use the EBS Fast Snapshot Restore State-change Notification event type to invoke Lambda functions or other targets when the state of a snapshot/AZ pair changes. Events are emitted on successful and unsuccessful transitions to the enabling, optimizing, enabled, disabling, and disabled states.

Data Lifecycle Manager – You can enable FSR on snapshots created by your DLM lifecycle policies, specify AZs, and specify the number of snapshots to be FSR-enabled. You can use an existing CloudFormation template to integrate FSR into your DLM policies (read about the AWS::DLM::LifecyclePolicy to learn more).

In the Works
We are launching with support for snapshots that you own. Over time, we intend to expand coverage and allow you to enable Fast Snapshot Restore for snapshots that you have been granted access to.

Available Now
Fast Snapshot Restore is available now and you can start using it today!

Jeff;

 

Add defense in depth against open firewalls, reverse proxies, and SSRF vulnerabilities with enhancements to the EC2 Instance Metadata Service

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/

Since it first launched over 10 years ago, the Amazon EC2 Instance Metadata Service (IMDS) has helped customers build secure and scalable applications. The IMDS solved a big security headache for cloud users by providing access to temporary, frequently rotated credentials, removing the need to hardcode or distribute sensitive credentials to instances manually or programatically. Attached locally to every EC2 instance, the IMDS runs on a special “link local” IP address of 169.254.169.254 that means only software running on the instance can access it. For applications with access to IMDS, it makes available metadata about the instance, its network, and its storage. The IMDS also makes the AWS credentials available for any IAM role that is attached to the instance.

When you run applications in the cloud, application security is as critical as instance security; if the applications running on an instance have vulnerabilities or misconfigurations, there can be serious consequences. While application security plays an important role in a layered defense, AWS also constantly evaluates where to add layers, even within the instance, to minimize the damage that can occur when these situations occur.

Today, AWS is making v2 of the EC2 Instance Metadata Service (IMDSv2) available. The existing instance metadata service (IMDSv1) is fully secure, and AWS will continue to support it. But IMDSv2 adds new “belt and suspenders” protections for four types of vulnerabilities that could be used to try to access the IMDS. These new protections go well beyond other types of mitigations, while working seamlessly with existing mitigations such as restricting IAM roles and using local firewall rules to restrict access to the IMDS. AWS is also making new versions of the AWS SDKs and CLIs available that support IMDSv2.

What’s new in IMDSv2

With IMDSv2, every request is now protected by session authentication. A session begins and ends a series of requests that software running on an EC2 instance uses to access the locally-stored EC2 instance metadata and credentials. The software starts a session with a simple HTTP PUT request to IMDSv2. IMDSv2 returns a secret token to the software running on the EC2 instance, which will use the token as a password to make requests to IMDSv2 for metadata and credentials. Unlike traditional passwords, you don’t need to worry about getting the token to the software, because the software gets it for itself with the PUT request. The token is never stored by IMDSv2 and can never be retrieved by subsequent calls, so a session and its token are effectively destroyed when the process using the token terminates. There’s no limit on the number of requests within a single session, and there’s no limit on the number of IMDSv2 sessions. Sessions can last up to six hours and, for added security, a session token can only be used directly from the EC2 instance where that session began.

For example, this curl recipe retrieves a session token that’s valid for the full six hours (21600 seconds) and then uses that token to access the EC2 instance’s profile metadata:


TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`

curl http://169.254.169.254/latest/meta-data/profile -H "X-aws-ec2-metadata-token: $TOKEN"

If you need to write code against the IMDSv2 directly, you can get more detail on the new scheme in the EC2 User Guide.

How these changes add defense in depth

IMDSv2’s changes are easy to use, and you’ll start using it automatically if you’re using the updated AWS SDKs and CLIs. These changes go beyond other types of mitigations to protect against misconfigured-open website application firewalls, misconfigured-open reverse proxies, unpatched SSRF vulnerabilities, and misconfigured-open layer-3 firewalls and network address translation.

Protecting against open Website Application Firewalls

Some Web Application Firewall (WAF) services, such as AWS WAF, can’t be configured to act as open WAFs. However, some third-party WAFs can be misconfigured to allow attackers unauthorized access to the network behind the WAF, including the EC2 IMDS.

Many WAFs are designed to act invisibly, so that they can protect websites and applications without administrators having to change or reconfigure the applications that are behind the WAF. To be transparent, WAFs usually pass on all of the headers that come with a request, and do not add their own headers, such as the standard X-Forwarded-For header that other kinds of proxies add. In other words, applications behind a WAF get requests just as the requester sent them.

The AWS approach is to block open WAFs by using a type of request that open WAFs very rarely support, HTTP PUT requests. Although web services such as Amazon S3 use PUT requests for object storage, they’re an uncommon type of request for websites and browsers to use. Our analysis of third-party WAF products and open WAF misconfigurations found that the vast majority do not permit HTTP PUT requests. We’re using this PUT request to provide a new layer of defense that goes beyond any existing capabilities – we’ve architected the IMDSv2 service to require a PUT request at the beginning of a session, which will prevent open WAFs from being abused to access the IMDS in the vast majority of cases.

Protecting against open reverse proxies

As it happens, it’s also very rare for open reverse proxies to allow PUT requests, but IMDSv2 has another layer of defense against open reverse proxies. Reverse proxies, such as Apache httpd or Squid, can also be misconfigured to allow external requests that reach internal resources, but it’s still normal for these proxies to send an X-Forwarded-For HTTP header. That header itself is used to pass on the IP address of the original caller. IMDSv2 will also not issue session tokens to any caller with an X-Forwarded-For header, which is effective at blocking unauthorized access due to misconfigurations like an open reverse proxy.

Protecting against SSRF vulnerabilities

SSRF vulnerabilities allow attackers to make unauthorized requests from web applications. Since these requests come from the application itself, they can be used to access internal resources that the application has access to but that were not intended to be accessible to outsiders. SSRF vulnerabilities vary in their severity, and some are immune to other types of mitigations. For instance, blocking SSRFs through static headers in instance metadata requests is effective only when the vulnerability merely allows the attacker to control the URL that is being requested; however, AWS analysis found many SSRF vulnerabilities that allow attackers to set arbitrary headers because the SSRF vulnerability impacts the application’s own header processing.

IMDSv2’s combination of beginning a session with a PUT request, and then requiring the secret session token in other requests, is always strictly more effective than requiring only a static header. AWS analysis of real-world vulnerabilities found that this combination protects against the vast majority of SSRF vulnerabilities.

Protecting against open layer 3 firewalls and NATs

Last, there is a final layer of defense in IMDSv2 that is designed to protect EC2 instances that have been misconfigured as open routers, layer 3 firewalls, VPNs, tunnels, or NAT devices. With IMDSv2, the PUT response containing the secret token will, by default, not be able to travel outside the instance. This is accomplished by having the default Time To Live (TTL) on the low-level IP packets containing the secret token set to “1,” much lower than a typical value, such as “64.” Hardware and software that handle packets, including EC2 instances, subtract 1 from each packet’s TTL field whenever they pass it on. If the TTL gets to 0, the packet is discarded, and an error message is sent back to the sender. A packet with a TTL of “64” can therefore make sixty-four “hops” in a network before giving up, while a packet with a TTL of “1” can exist in just one. This feature allows legitimate traffic to get to an intended destination, but is designed to stop packets from endlessly running around in circles if there’s a loop in a network.

With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.

Making the transition

Both IMDSv1 and IMDSv2 will be available and enabled by default, and customers can choose which they will use. The IMDS can now be restricted to v2 only, or IMDS (v1 and v2) can also be disabled entirely. AWS recommends adopting v2 and restricting access to v2 only for added security. IMDSv1 remains available for customers who have tools and scripts using v1, and who are comfortable with the existing security posture of their instances.

A number of tools are available to make transitioning to v2 and disabling v1 seamless. Starting today, a new CloudWatch metric is available that provides visibility into the number of v1 calls that are being made on any given instance. Customers can use this metric to monitor how often v1 is still being accessed as Amazon Machine Images, the AWS SDKs, CLIs, cloud-init, and other software accessing the IMDS is updated, released, and upgraded. When you can see that an instance can be launched, activated, used for service, and the metric is zero, it is safe to require v2 of the IMDS, disabling v1. For more information on transitioning to IMDSv2, see the user guide.

Security can also be further enhanced while this transition is happening. AWS credentials provided by the IMDS now include an ec2:RoleDelivery IAM context key. Credentials provided by the older IMDSv1 have an ec2:RoleDelivery value of “1.0,” and credentials using the new scheme will have an ec2:RoleDelivery value of “2.0.” This context key makes it easy to enforce use of the new scheme on a service-by-service or resource-by-resource basis by using those context keys as conditions in IAM policies, resource policies, or AWS Organizations service control policies. For example, if all of the software accessing an S3 bucket has been upgraded to use IMDSv2, then that S3 bucket can be safely restricted to only allow access to role-account credentials that have the “2.0” value (or greater) for the context key. The effect is that credentials retrieved using IMDSv1 will be prevented from accessing the bucket. AWS CloudTrail is also being updated to record the new ec2:RoleDelivery parameters.

Hear about IMDSv2 at re:Invent

Mark Ryland will be talking in more detail about IMDSv2, and the transition to it, at AWS re:Invent in December. We’ll update this post soon with a link to the session in the re:Invent catalog.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

In The Works – New AMD-Powered, Compute-Optimized EC2 Instances (C5a/C5ad)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/in-the-works-new-amd-powered-compute-optimized-ec2-instances-c5a-c5ad/

We’re getting ready to give you even more power and even more choices when it comes to EC2 instances.

We will soon launch C5a and C5ad instances powered by custom second-generation AMD EPYC “Rome” processors running at frequencies as high as 3.3 GHz. You will be able to use these compute-optimized instances to run your batch processing, distributed analytics, web applications and other compute-intensive workloads. Like the existing AMD-powered instances in the M, R and T families, the C5a and C5ad instances are built on the AWS Nitro System and give you an opportunity to balance your instance mix based on cost and performance.

The instances will be available in eight sizes and also in bare metal form, with up to 192 vCPUs and 384 GiB of memory. The C5ad instances will include up to 7.6 TiB of fast, local NVMe storage, making them perfect for video encoding, image manipulation, and other media processing workloads.

The bare metal instances (c5an.metal and c5adn.metal) will offer twice as much memory and double the vCPU count of comparable instances, making them some of the largest and most powerful compute-optimized instances yet. The bare metal variants will have access to 100 Gbps of network bandwidth and will be compatible with Elastic Fabric Adapter — perfect for your most demanding HPC workloads!

I’ll have more information soon, so stay tuned!

Jeff;

New – Savings Plans for AWS Compute Services

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-savings-plans-for-aws-compute-services/

I first wrote about EC2 Reserved Instances a decade ago! Since I wrote that post, our customers have saved billions of dollars by using Reserved Instances to commit to usage of a specific instance type and operating system within an AWS region.

Over the years we have enhanced the Reserved Instance model to make it easier for you to take advantage of the RI discount. This includes:

Regional Benefit – This enhancement gave you the ability to apply RIs across all Availability Zones in a region.

Convertible RIs – This enhancement allowed you to change the operating system or instance type at any time.

Instance Size Flexibility – This enhancement allowed your Regional RIs to apply to any instance size within a particular instance family.

The model, as it stands today, gives you discounts of up to 72%, but it does require you to coordinate your RI purchases and exchanges in order to ensure that you have an optimal mix that covers usage that might change over time.

New Savings Plans
Today we are launching Savings Plans, a new and flexible discount model that provides you with the same discounts as Reserved Instances, in exchange for a commitment to use a specific amount (measured in dollars per hour) of compute power over a one or three year period.

Every type of compute usage has an On Demand price and a (lower) Savings Plan price. After you commit to a specific amount of compute usage per hour, all usage up to that amount will be covered by the Saving Plan, and anything past it will be billed at the On Demand rate.

If you own Reserved Instances, the Savings Plan applies to any On Demand usage that is not covered by the RIs. We will continue to sell RIs, but Savings Plans are more flexible and I think many of you will prefer them!

Savings Plans are available in two flavors:

Compute Savings Plans provide the most flexibility and help to reduce your costs by up to 66% (just like Convertible RIs). The plans automatically apply to any EC2 instance regardless of region, instance family, operating system, or tenancy, including those that are part of EMR, ECS, or EKS clusters, or launched by Fargate. For example, you can shift from C4 to C5 instances, move a workload from Dublin to London, or migrate from EC2 to Fargate, benefiting from Savings Plan prices along the way, without having to do anything.

EC2 Instance Savings Plans apply to a specific instance family within a region and provide the largest discount (up to 72%, just like Standard RIs). Just like with RIs, your savings plan covers usage of different sizes of the same instance type (such as a c5.4xlarge or c5.large) throughout a region. You can even switch switch from Windows to Linux while continuing to benefit, without having to make any changes to your savings plan.

Purchasing a Savings Plan
AWS Cost Explorer will help you to choose a Savings Plan, and will guide you through the purchase process. Since my own EC2 usage is fairly low, I used a test account that had more usage. I open AWS Cost Explorer, then click Recommendations within Savings Plans:

I choose my Recommendation options, and review the recommendations:

Cost Explorer recommends that I purchase $2.40 of hourly Savings Plan commitment, and projects that I will save 40% (nearly $1200) per month, in comparison to On-Demand. This recommendation tries to take into account variable usage or temporary usage spikes in order to recommend the steady state capacity for which we believe you should consider a Savings Plan. In my case, the variable usage averages out to $0.04 per hour that we’re recommending I keep as On-Demand.

I can see the recommended Savings Plans at the bottom of the page, select those that I want to purchase, and Add them to my cart:

When I am ready to proceed, I click View cart, review my purchases, and click Submit order to finalize them:

My Savings Plans become active right away. I can use the Cost Explorer’s Performance & Coverage reports to review my actual savings, and to verify that I own sufficient Savings Plans to deliver the desired amount of coverage.

Available Now
As you can see, Savings Plans are easy to use! You can access compute power at discounts of up to 72%, while gaining the flexibility to change compute services, instance types, operating systems, regions, and so forth.

Savings Plans are available in all AWS regions outside of China, and you can start to purchase (and benefit) from them today!

Jeff;

 

Now Available: New C5d Instance Sizes and Bare Metal Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-new-c5d-instance-sizes-and-bare-metal-instances/

Amazon EC2 C5 instances are very popular for running compute-heavy workloads like batch processing, distributed analytics, high-performance computing, machine/deep learning inference, ad serving, highly scalable multiplayer gaming, and video encoding.

In 2018, we added blazing fast local NVMe storage, and named these new instances C5d. They are a great fit for applications that need access to high-speed, low latency local storage like video encoding, image manipulation and other forms of media processing. They will also benefit applications that need temporary storage of data, such as batch and log processing and applications that need caches and scratch files.

Just a few weeks ago, we launched new instances sizes and a bare metal option for C5 instances. Today, we are happy to add the same capabilities to the C5d family: 12xlarge, 24xlarge, and a bare metal option.

The new C5d instance sizes run on Intel’s Second Generation Xeon Scalable processors (code-named Cascade Lake) with sustained all-core turbo frequency of 3.6GHz and maximum single core turbo frequency of 3.9 GHz.

The new processors also enable a new feature called Intel Deep Learning Boost, a capability based on the AVX-512 instruction set. Thanks to the new Vector Neural Network Instructions (AVX-512 VNNI), deep learning frameworks will speed up typical machine learning operations like convolution, and automatically improve inference performance over a wide range of workloads.

These instances are based on the AWS Nitro System, with dedicated hardware accelerators for EBS processing (including crypto operations), the software-defined network inside of each Virtual Private Cloud (VPC), and ENA networking.

New Instance Sizes for C5d: 12xlarge and 24xlarge
Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.12xlarge4896 GiB2 x 900 GB NVMe SSD7 Gbps12 Gbps
c5d.24xlarge96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Previously, the largest C5d instance available was c5d.18xlarge, with 72 logical processors, 144 GiB of memory, and 1.8 TB of storage. As you can see, the new 24xlarge size increases available resources by 33%, in order to help you crunch those super heavy workloads. Last but not least, customers also get 50% more NVMe storage per logical processor on both 12xlarge and 24xlarge, with up to 3.6 TB of local storage!

Bare Metal C5d
As is the case with the existing bare metal instances (M5, M5d, R5, R5d, z1d, and so forth), your operating system runs on the underlying hardware and has direct access to processor and other hardware.

Bare metal instances can be used to run software with specific requirements, e.g. applications that are exclusively licensed for use on physical, non-virtualized hardware. These instances can also be used to run tools and applications that require access to low-level processor features such as performance counters.

Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.metal96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

Now Available!
You can start using these new instances today in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Canada (Central), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Europe (London), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), South America (São Paulo), and AWS GovCloud (US-West).

Please send us feedback, either on the AWS forum for Amazon EC2, or through your usual AWS support contacts.

Julien;

Processing batch jobs quickly, cost-efficiently, and reliably with Amazon EC2 On-Demand and Spot Instances

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.

 

Prerequisites

This approach will be demonstrated via a simple batch-processing environment with the following components:

  • A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
  • An Amazon SQS queue to manage the tasks.
  • A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
  • Amazon EC2 Auto Scaling groups to model scenarios.
  • Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.

 

Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance.  The batch total is $38.40.

 

Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.

 

Instance TypeProvisioningHost CountHourly Instance CostTotal 30-minute batch cost
c5.18xlargeSpot15 $     1.2367 $     9.2753
c5.2xlargeSpot22 $     0.1547 $     1.7017
c5.4xlargeSpot12 $     0.2772 $     1.6632
c5.9xlargeSpot11 $     0.6239 $     3.4315
c5.2xlargeOn-Demand13 $     0.3840 $     2.4960
c5.4xlargeOn-Demand3 $     0.7680 $     1.1520
c5.9xlargeOn-Demand4 $     1.7280 $     3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.

 

Summary

This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Architecting a Low-Cost Web Content Publishing System

Post Syndicated from Craig Jordan original https://aws.amazon.com/blogs/architecture/architecting-a-low-cost-web-content-publishing-system/

Introduction

When an IT team first contemplates reducing on-premises hardware they manage to support their workloads they often feel a tension between wanting to use cloud-native services versus taking a lift-and-shift approach. Cloud native services based on serverless designs could reduce costs and enable a solution that is easier to operate, but appears to be disruptive to end user processes and tools. A lift-and-shift migration, though it can eliminate on-premises hardware and maintain existing workflows, doesn’t eliminate the need to manage a server infrastructure, does nothing to improve a team’s agility in releasing enhancements after migration to the cloud, and may not optimize the cost of the resulting solution. Rather than settling for an either/or option that sacrifices cost savings and ease of operation in order to be non-intrusive to their web authors’ daily work, the University of Saint Thomas, Minnesota team implemented a creative hybrid approach that both avoids end user disruption and achieves the cost savings, agility, and simplified administration that a cloud-native solution can provide.

The Situation

University of St. Thomas wanted to reduce on-premises management of hardware for their university website. In addition, by migrating this functionality to the cloud, they intended to increase the website’s availability. The on-premises solution was deployed on an IIS server maintained by the IT team, but the content of the website was authored by staff members in departments across the university using two different Content Management Systems (CMS). The publishing process from these tools to the web server worked well, and there was no appetite for eliminating the distributed nature of the web site’s development nor the content management systems that the authors were comfortable with.

The IT team hoped to implement a serverless solution utilizing only Amazon Simple Storage Service (S3) to host the static website content. Not only would that reduce the cost of the solution, it would also eliminate having to manage web servers. One of the two content management systems could publish directly to S3, but unfortunately the other CMS could not.

A lift-and-shift migration approach would move the website onto an IIS server in Amazon Elastic Cloud Compute (EC2) and update the publishing process to write its outputs to this new server. This solution would avoid any impact to the authors because all the change would be accomplished behind the scenes by the IT team. However, this approach did not achieve the team’s goals of creating a solution that cost less and was easier to manage than the current on-premises one.

Rather than giving up on creating a cloud-native solution, the team worked from the constraints on the edges of the solution toward the middle.

Solution

Achieving the cost savings, management ease, and high availability for the solution depended upon using S3 to store the website’s contents (#1 in the diagram). If the CMS tools could have published directly to S3, the solution would have been completed by simply adjusting the CMS tools to target their output to S3. However, only one of the two CMS tools could do this. The other one expects to publish its output to a file system that is accessible to the on-premises server where the CMS tool runs. The team solved this problem by launching a t3.small EC2 instance (#2) to sit between the CMS tools and the S3 bucket that would store the website’s production content. Initially, it seemed like using two simple file sync processes could keep the file system of the EC2 instance synchronized with the CMS files. However, when the team first attempted this approach to build a copy of the website on EC2’s file system, they discovered that one of the sync processes would delete the other tool’s output rather than ignoring it when synchronizing updates from its tool to EC2.

To overcome this issue, the team created separate website roots in the EC2 file system into which each CMS would synchronize. Using Unionfs, a Linux utility that combines multiple directories into a single logical directory, a unified root folder for the website (#3) was created that could be easily pushed to S3 using the S3 CLI.

With this much of the solution in place the team had successfully created a new architecture for their website that was nearly as inexpensive as a static website hosted on S3, but that also maintained the tools and processes that their website authors were familiar with.

There was just one more technical issue to address: The IIS site contained internal metadata that redirected its users from virtual directories to the physical content located elsewhere in the website’s content. For example, https://..../law might be redirected to https://..../lawschool/ To achieve similar functionality, the IT team created one HTML file for each of these redirects and added them to a third website root directory in the EC2 instance (#4). These files include static HTML headers needed to redirect the user’s browser to the desired endpoint. Blending this directory with the other two through Unionfs creates a single logical copy of the website’s contents and that can be synchronized out to S3 with a S3 sync CLI command.

A final enhancement to the website was to use an Amazon CloudFront distribution (#5) to cache its contents providing improved response time for website visitors. The distribution object caching TTLs are set to defaults. The publishing process runs every 15 minutes, so to ensure that the website visitors would receive the latest content, the team wrote an AWS Lambda function (#6) that invalidates the cache each time an object is removed from (created in) the S3 bucket using S3 event notifications.

Conclusion

The University of Saint Thomas IT team found a creative way to implement a new solution for their university website that reduces the time and effort required to manage servers, achieves operational simplicity and cost savings by using cloud-native services, and yet doesn’t interfere with the web authoring tools and processes their customers were happy with. The mix of server-based and serverless components in their design illustrates how flexible cloud architectures can be and highlights the ingenuity of the team that built it.

Acknowledgements

Thank you to the following people at the University of Saint Thomas:

This solution was architected by Julian Mino, Cloud Architect. The creative use of Unionfs was suggested by William Bear, AVP for Applications and Infrastructure and former Linux administrator. Vicky Vue, Systems Engineer and Keith Ketchmark, Sr. Systems Engineer implemented the solution using Terraform, Ansible and Python. Daniel Strojny (Associate Director, Networks & IT Operations) helped resolved some internal DNS issues the team encountered.

Optimizing deep learning on P3 and P3dn with EFA

Post Syndicated from whiteemm original https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/

This post is written by Rashika Kheria, Software Engineer, Purna Sanyal, Senior Solutions Architect, Strategic Account and James Jeun, Sr. Product Manager

The Amazon EC2 P3dn.24xlarge instance is the latest addition to the Amazon EC2 P3 instance family, with upgrades to several components. This high-end size of the P3 family allows users to scale out to multiple nodes for distributed workloads more efficiently.  With these improvements to the instance, you can complete training jobs in a shorter amount of time and iterate on your Machine Learning (ML) models faster.

 

This blog reviews the significant upgrades with p3dn.24xlarge, walks you through deployment, and shows an example ML use case for these upgrades.

 

Overview of P3dn instance upgrades

The most notable upgrade to the p3dn.24xlarge instance is the 100-Gbps network bandwidth and the new EFA network interface that allows for highly scalable internode communication. This means you can scale runs on applications to use thousands of GPUs, which reduces time to get results. EFA’s operating system bypasses networking mechanisms and the underlying Scalable Reliable Protocol that is built in to the Nitro Controllers. The Nitro controllers enable a low-latency, low-jitter channel for inter-instance communication. EFA has been adopted in the mainline Linux and integrated with LibFabric and various distributions. AWS worked with NVIDIA for EFA to support NVIDIA Collective Communication Library (NCCL). NCCL optimizes multi-GPU and multi-node communication primitives and helps achieve high throughput over NVLink interconnects.

 

The following diagram shows the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

 

The following table summarizes the full set of differences between p3.16xlarge and p3dn.24xlarge.

Featurep3.16xlp3dn.24xl
ProcessorIntel Xeon E5-2686 v4Intel Skylake 8175 (w/ AVX 512)
vCPUs6496
GPU8x 16 GB NVIDIA Tesla V1008x 32 GB NVIDIA Tesla V100
RAM488 GB768 GB
Network25 Gbps ENA100 Gbps ENA + EFA
GPU InterconnectNVLink – 300 GB/sNVLink – 300 GB/s

 

P3dn.24xl offers more networking bandwidth than p3.16xl. Paired with EFA’s communication library, this feature increases scaling efficiencies drastically for large-scale, distributed training jobs. Other improvements include double the GPU memory for large datasets and batch sizes, increased system memory, and more vCPUs. This upgraded instance is the most performant GPU compute option on AWS.

 

The upgrades also improve your workload around distributed deep learning. The GPU memory improvement enables higher intranode batch sizes. The newer Layer-wise Adaptive Rate Scaling (LARS) has been tested with ResNet50 and other deep neural networks (DNNs) to allow for larger batch sizes. The increased batch sizes reduce wall-clock time per epoch with minimal loss of accuracy. Additionally, using 100-Gbps networking with EFA heightens performance with scale. Greater networking performance is beneficial when updating weights for a large number of parameters. You can see high scaling efficiency when running distributed training on GPUs for ResNet50 type models that primarily use images for object recognition. For more information, see Scalable multi-node deep learning training using GPUs in the AWS Cloud.

 

Natural language processing (NLP) also presents large compute requirements for model training. This large compute requirement is especially present with the arrival of large Transformer-based models like BERT and GPT-2, which have up to a billion parameters. The following describes how to set up distributed model trainings with scalability for both image and language-based models, and also notes how the AWS P3 and P3dn instances perform.

 

Optimizing your P3 family

First, optimize your P3 instances with an important environmental update. This update runs traditional TCP-based networking and is in the latest release of NCCL 2.4.8 as of this writing.

 

Two new environmental variables are available, which allow you to take advantage of multiple TCP sockets per thread: NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD.

 

These environmental variables allow the NCCL backend to exceed the 10-Gbps TCP single stream bandwidth limitation in EC2.

 

Enter the following command:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_NSOCKS_PERTHREAD=4 -x NCCL_SOCKET_NTHREADS=4 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

 

The following graph shows the synthetic NCCL tests and their increased performance with the additional directives.

synthetic NCCL tests and their increased performance with the additional directives

You can achieve a two-fold increase in throughput after a threshold in the synthetic payload size (around 1 MB).

 

 

Deploying P3dn

 

The following steps walk you through spinning up a cluster of p3dn.24xlarge instances in a cluster placement group. This allows you to take advantage of all the new performance features within the P3 instance family. For more information, see Cluster Placement Groups in the Amazon EC2 User Guide.

This post deploys the following stack:

 

  1. On the Amazon EC2 console, create a security group.

 Make sure that both inbound and outbound traffic are open on all ports and protocols within the security group.

 

  1. Modify the user variables in the packer build script so that the variables are compatible with your environment.

The following is the modification code for your variables:

 

{

  "variables": {

    "Region": "us-west-2",

    "flag": "compute",

    "subnet_id": "<subnet-id>",

    "sg_id": "<security_group>",

    "build_ami": "ami-0e434a58221275ed4",

    "iam_role": "<iam_role>",

    "ssh_key_name": "<keyname>",

    "key_path": "/path/to/key.pem"

},

3. Build and Launch the AMI by running the following packer script:

Packer build nvidia-efa-fsx-al2.yml

This entire workflow takes care of setting up EFA, compiling NCCL, and installing the toolchain. After building it, you have an AMI ID that you can launch in the EC2 console. Make sure to enable the EFA.

  1. Launch a second instance in a cluster placement group so you can run two node tests.
  2. Enter the following code to make sure that all components are built correctly:

/opt/nccl-tests/build/all_reduce_perf 

  1. The following output of the commend will confirm that the build is using EFA :

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

INFO: Function: main Line: 49: NET/OFI Process rank 8 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.

INFO: Function: main Line: 53: NET/OFI Received 1 network devices

INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

 

Synthetic two-node performance

This blog includes the NCCL-tests GitHub as part of the deployment stack. This shows synthetic benchmarking of the communication layer over NCCL and the EFA network.

When launching the two-node cluster, complete the following steps:

  1. Place the instances in the cluster placement group.
  2. SSH into one of the nodes.
  3. Fill out the hosts file.
  4. Run the two-node test with the following code:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x FI_PROVIDER="efa" -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

This test makes sure that the node performance works the way it is supposed to.

The following graph compares the NCCL bandwidth performance using -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp“. There is a three-fold increase in bus bandwidth when using EFA.

 

 -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp". There is a three-fold increase in bus bandwidth when using EFA. 

Now that you have run the two node tests, you can move on to a deep learning use case.

FAIRSEQ ML training on a P3dn cluster

Fairseq(-py) is a sequence modeling toolkit that allows you to train custom models for translation, summarization, language modeling, and other text-generation tasks. FAIRSEQ MACHINE TRANSLATION distributed training requires a fast network to support the Allreduce algorithm. Fairseq provides reference implementations of various sequence-to-sequence models, including convolutional neural networks (CNN), long short-term memory (LSTM) networks, and transformer (self-attention) networks.

 

After you receive consistent 10 GB/s bus-bandwidth on the new P3dn instance, you are ready for FAIRSEQ distributed training.

To install fairseq from source and develop locally, complete the following steps:

  1. Copy FAIRSEQ source code to one of the P3dn instance.
  2. Copy FAIRSEQ Training data in the data folder.
  3. Copy FAIRSEQ Test Data in the data folder.

 

git clone https://github.com/pytorch/fairseq

cd fairseq

pip install -- editable . 

Now that you have FAIRSEQ installed, you can run the training model. Complete the following steps:

  1. Run FAIRSEQ Training in 1 node/8 GPU p3dn instance to check the performance and the accuracy of FAIRSEQ operations.
  2. Create a custom AMI.
  3. Build the other 31 instances from the custom AMI.

 

Use the following scripts for distributed All Reduce FAIRSEQ Training :

 

export RANK=$1 # the rank of this process, from 0 to 127 in case of 128 GPUs
export LOCAL_RANK=$2 # the local rank of this process, from 0 to 7 in case of 8 GPUs per mac
export NCCL_DEBUG=INFO
export NCCL_TREE_THRESHOLD=0;
export FI_PROVIDER="efa";

export FI_EFA_TX_MIN_CREDIS=64;
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64/:/home/ec2-user/aws-ofi-nccl/install/lib/:/home/ec2-user/nccl/build/lib:$LD_LIBRARY_PATH;
echo $FI_PROVIDER
echo $LD_LIBRARY_PATH
python train.py data-bin/wmt18_en_de_bpej32k \
   --clip-norm 0.0 -a transformer_vaswani_wmt_en_de_big \
   --lr 0.0005 --source-lang en --target-lang de \
   --label-smoothing 0.1 --upsample-primary 16 \
   --attention-dropout 0.1 --dropout 0.3 --max-tokens 3584 \
   --log-interval 100  --weight-decay 0.0 \
   --criterion label_smoothed_cross_entropy --fp16 \
   --max-update 500000 --seed 3 --save-interval-updates 16000 \
   --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' \
   --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
   --warmup-updates 4000 --min-lr 1e-09 \
   --distributed-port 12597 --distributed-world-size 32 \
   --distributed-init-method 'tcp://172.31.43.34:9218' --distributed-rank $RANK \
   --device-id $LOCAL_RANK \
   --max-epoch 3 \
   --no-progress-bar  --no-save

Now that you have completed and validated your base infrastructure layer, you can add additional components to the stack for various workflows. The following charts show time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training.

time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training

 

Conclusion

EFA on p3dn.24xlarge allows you to take advantage of additional performance at scale with no change in code. With this updated infrastructure, you can decrease cost and time to results by using more GPUs to scale out and get more done on complex workloads like natural language processing. This blog provides much of the undifferentiated heavy lifting with the DLAMI integrated with EFA. Go power up your ML workloads with EFA!

 

Optimizing for cost, availability and throughput by selecting your AWS Batch allocation strategy

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/optimizing-for-cost-availability-and-throughput-by-selecting-your-aws-batch-allocation-strategy/

This post is contributed by Steve Kendrex, Senior Technical Product Manager, AWS Batch

 

Introduction

 

AWS offers a broad range of instances that are advantageous for batch workloads. The scale and provisioning speed of AWS’ compute instances allow you to get up and running at peak capacity in minutes without paying for downtime. Today, I’m pleased to introduce allocation strategies: a significant new capability in AWS Batch that  makes provisioning compute resources flexible and simple. In this blog post, I explain how the AWS Batch allocation strategies work, when you should use them for your workload, and provide an example CloudFormation script. This blog helps you get started on building your personalized Compute Environment (CE) most appropriate to your workloads.

Overview

AWS Batch is a fully managed, cloud-native batch scheduler. It manages the queuing and scheduling of your batch jobs, and the resources required to run your jobs. One of AWS Batch’s great strengths is the ability to manage instance provisioning as your workload requirements and budget needs change. AWS Batch takes advantage of AWS’s broad base of compute types. For example, you can launch compute based instances and memory instances that can handle different workload types, without having to worry about building a cluster to meet peak demand.

Previously, AWS Batch had a cost-controlling approach to manage compute instances for your workloads. The service chose an instance that was the best fit for your jobs based on vCPU, memory, and GPU requirements, at the lowest cost. Now, the newly added allocation strategies provide flexibility. They allow AWS Batch to consider capacity and throughput in addition to cost when provisioning your instances. This allows you to leverage different priorities when launching instances depending on your workloads’ needs, such as: controlling cost, maximizing throughput, or minimizing Amazon EC2 Spot instances interruption rates.

There are now three instance allocation strategies from which to choose when creating an AWS Batch Compute Environment (CE). They are:

1.        Spot Capacity Optimized

2.        Best Fit Progressive

3.        Best Fit

 

Spot Capacity Optimized

As the name implies, the Spot capacity optimized strategy is only available when launching Spot CEs in AWS Batch. In fact, I recommend the Spot capacity optimized strategy for most of your interruptible workloads running on Spot today. This strategy takes advantage of the recently released EC2 Auto Scaling and EC2 Fleet capacity optimized strategy. Next, I examine how this strategy behaves in AWS Batch.

Let’s say you’re running a simulation workload in AWS Batch. Your workload is Spot-appropriate (see this whitepaper to determine), so you want to take advantage of the savings you can glean from using Spot. However, you also want to minimize your Spot interruption rate, so you’ve followed the Spot best practices. Your instances can run across multiple instance types and multiple Availability Zones. When creating your Spot CE in AWS Batch, Input all the instance types with which your workload is compatible in the instance field. OR select ‘optimal’, which allows Batch to choose from among M, C, or R instance families. The image below shows how this appears in the console:

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

When evaluating your workload, AWS Batch selects from the allowed instance types. These allowed instance types are specified in the ‘compute resource parameter’, and are capable of running your jobs listed in your Spot CE.  From the capable instances, AWS calculates the correct assortment of instance types that have the most Spot capacity. AWS Batch then launches those instances on your behalf, and runs your jobs when those instances are available. This strategy gives you access to all AWS compute resources at a fraction of On-Demand cost. The Spot capacity optimized strategy works whether you’re trying to launch hundreds of thousands (or a million!) of vCPU’s in Spot, or simply trying to lower your chance of interruption. Additionally, AWS Batch manages your instance pool to meet the capacity needed to run your workload as time passes.

For example, as your workloads run, demand in an Availability Zone may shift. This might lead to several of your instances being reclaimed. In that event, AWS Batch automatically attempts to scale a different instance type based on the deepest capacity pools. Assuming you set a retry attempt count, your jobs then automatically retry. Then, AWS Batch scales new instances until either it meets the desired capacity, or it runs out of instance types to launch based on those provided.  That is why I recommend that you give AWS Batch as many instance types and families as possible to choose from when running Spot capacity optimized. Additional detail on behavior can be found in the capacity optimized documentation.

To launch a Spot capacity optimized CE, follow these steps:

1.       Navigate to the console

2.       Create a new Compute Environment.

3.       Select “Spot Capacity Optimized” in the Allocation Strategy field

4.       Alternatively, you can use the CreateComputeEnvironment API; in the Allocation Strategy field, pass in “Spot_Capacity_Optimized”. This command should look like the following:

…
"TestAllocationStrategyCE": { 
"Type": "AWS::Batch::ComputeEnvironment",
 "Properties": { 
"State": "ENABLED", 
"Type": "MANAGED", 
"ComputeResources": { 
"Subnets": [
 {"Ref": "TestSubnet"}
 ], 
"InstanceRole": {
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
" InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "SPOT_CAPACITY_OPTIMIZED", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 }
 },
…

Once you follow these steps your Spot capacity optimized CE should be up and running.

 

Best Fit Progressive

Imagine you have a time-sensitive machine learning workload that is very compute intensive. You want to run this workload on C5 instances because you know that those have a preferable vCPU/memory ratio for your jobs. In a pinch, however, you know that M5 instances can run your workload perfectly well. You’re happy to take advantage of Spot prices. However, you also have a base level of throughput you need so you have to run part of the workload on On-Demand instances.  In this case, I recommend the best fit progressive strategy. This strategy is available in both On-Demand and Spot CEs, and I recommend it for most On-Demand workloads. The best fit progressive strategy allows you to let AWS Batch choose the best fit instance for your workload (based on your jobs’ vCPU and memory requirements). In this context, “best fit” means AWS Batch provisions the least number of instances capable of running your jobs at the lowest cost.

Sometimes, AWS Batch cannot resource enough of the best fit instances to meet your capacity. When this is the case, AWS Batch progressively looks for the next best fit instance type from what you specified in the ‘compute resources’ parameter. Generally, AWS Batch attempts to spin up different instance sizes within the same family first. This is because AWS Batch has already determined that vCPU and memory ratio to fit your workload. If it still cannot find enough instances that can run your jobs to meet your capacity, AWS Batch launches instances from a different family. These attempts run until capacity is met, or until it runs out of available instances from which to select.

To create a best fit progressive CE, follow the steps detailed in the Spot capacity optimized strategy section. However, specify the strategy BEST_FIT_PROGRESSIVE when creating a CE, for example:


…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT_PROGRESSIVE", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important note: you can always restrict AWS Batch’s ability to launch instances by using the max vCPU setting in your CE. AWS Batch may go above Max vCPU to meet your capacity requirements for best fit progressive and Spot capacity optimized strategies. In this event, AWS Batch will never go above Max vCPU by more than a single instance (for example, no more than a single instance from among those specified in your CE compute resources parameter).

 

How to Combine Strategies

You can combine strategies using separate AWS Batch Compute Environments. Let’s take the case I mentioned earlier: you’re happy to take advantage of Spot prices, but you want a base level of throughput for your time-sensitive workloads.

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

 

In this case, you can create two AWS Batch CEs:

1.       Create an On-Demand CE that uses the best fit progressive strategy.

2.       Set the max vCPU at the level of throughput that is necessary for your workload.

3.       Create a Spot CE using the Spot capacity optimized strategy.

4.        Attach both CEs to your job queue, with the On-Demand CE higher in order. Once you start submitting jobs to your queue, AWS Batch spins up your On-Demand CE first and starts placing jobs.

If AWS Batch meets its max vCPU criteria, it will spin up instances in the next CE. In this case, the next CE is your Spot CE, and AWS Batch will place any additional jobs on this CE.  AWS Batch continues to place jobs on both CEs until the queue is empty.

Please see this repository for sample CloudFormation code to replicate this environment. Or, click here for more examples of leveraging Spot with AWS Batch.

 

Best Fit

Imagine you have a well-defined genomics sequencing workload. You know that this workload performs best on M5 instances, and you run this workload On-Demand because it is not interruptible. You’ve run this workload on AWS Batch before and you’re happy with its current behavior. You’re willing to trade off occasional capacity constraints in return for the knowledge you’re controlling cost strictly.  In this case, the best fit strategy may be a good option. This strategy used to be AWS Batch’s only behavior. It examines the queue and picks the best fit instance type and size for the workload. As described earlier, best fit to AWS Batch means the least number of instances capable of running the workload, at the lowest cost. In general, we recommend the best fit strategy only when you want the lowest cost for your instance, and you’re willing to trade cost for throughput and availability.

Note: AWS Batch will not launch instances above Max vCPU while using the best fit strategy. To launch a best fit CE, you can launch it similar to the following:

…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important Note for AWS Batch Allocation Strategies with Spot Instances:

You always have the option to set a percentage of On-Demand price when creating a Spot CE. When setting a percentage of an On-Demand price, AWS Batch will only launch instances that have Spot prices lower than the lowest per-unit-hour instance. In general, setting a percentage of an On-Demand price lowers your availability, and should only be used if you want cost controls.If you want to enjoy the cost savings with Spot with better availability, I recommend that you do not set a percentage of On-Demand price.

Conclusion

With these new allocation strategies, you now have much greater flexibility to control how AWS Batch provisions your instances. This allows you to make better throughput and cost trade-offs depending on the sensitivity of your workload. To learn more about how these strategies behave, please visit the AWS Batch documentation. Feel free to experiment with AWS Batch on your own to get an idea of how they help you run your specific workload.

 

Thanks to Chad Scmutzer for his support on the CloudFormation template

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/leveraging-efa-to-run-hpc-and-ml-workloads-on-aws-batch/

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

 This post is contributed by  Sean Smith, Software Development Engineer II, AWS ParallelCluster & Arya Hezarkhani, Software Development Engineer II, AWS Batch and HPC

 

On August 2, 2019, AWS Batch announced support for Elastic Fabric Adapter (EFA). This enables you to run highly performant, distributed high performance computing (HPC) and machine learning (ML) workloads by using AWS Batch’s managed resource provisioning and job scheduling.

EFA is a network interface for Amazon EC2 instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system bypasses the hardware interface and enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, HPC applications using the Message Passing Interface (MPI) and ML applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of cores or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud.

AWS Batch is a cloud-native batch scheduler that manages instance provisioning and job scheduling. AWS Batch automatically provisions instances according to job specifications, with the appropriate placement group, networking configurations, and any user-specified file system. It also automatically sets up the EFA interconnect to the instances it launches, which you specify through a single launch template parameter.

In this post, we walk through the setup of EFA on AWS Batch and run the NAS Parallel Benchmark (NPB), a benchmark suite that evaluates the performance of parallel supercomputers, using the open source implementation of MPI, OpenMPI.

 

Prerequisites

This walk-through assumes:

 

Configuring your compute environment

First, configure your compute environment to launch instances with the EFA device.

Creating an EC2 placement group

The first step is to create a cluster placement group. This is a logical grouping of instances within a single Availability Zone. The chief benefit of a cluster placement group is non-blocking, non-oversubscribed, fully bi-sectional network connectivity. Use a Region that supports EFA—currently, that is us-east-1, us-east-2, us-west-2, and eu-west-1. Run the following command:

$ aws ec2 create-placement-group –group-name “efa” –strategy “cluster” –region [your-region]

Creating an EC2 launch template

Next, create a launch template that contains a user-data script to install EFA libraries onto the instance. Launch templates enable you to store launch parameters so that you do not have to specify them every time you launch an instance. This will be the launch template used by AWS Batch to scale the necessary compute resources in your AWS Batch Compute Environment.

First, encode the user data into base64-encoding. This example uses the CLI utility base64 to do so.

 

$ echo “MIME-Version: 1.0

Content-Type: multipart/mixed; boundary=”==MYBOUNDARY==”

 

–==MYBOUNDARY==

Content-Type: text/cloud-boothook; charset=”us-ascii” cloud-init-per once yum_wget yum install -y wget

cloud-init-per once wget_efa wget -q –timeout=20 https://s3-us-west-2.amazonaws.com/

cloud-init-per once tar_efa tar -xf /tmp/aws-efa-installer-latest.tar.gz -C /tmp pushd /tmp/aws-efa-installer

cloud-init-per once install_efa ./efa_installer.sh -y pop /tmp/aws-efa-installer

cloud-init-per once efa_info /opt/amazon/efa/bin/fi_info -p efa

–==MYBOUNDARY==–” | base64

 

Save the base64-encoded output, because you need it to create the launch template.

 

Next, make sure that your default security group is configured correctly. On the EC2 console, select the default security group associated with your default VPC and edit the inbound rules to allow SSH and All traffic to itself. This must be set explicitly to the security group ID for EFA to work, as seen in the following screenshot.

SecurityGroupInboundRules

SecurityGroupInboundRules

 

Then edit the outbound rules and add a rule that allows all inbound traffic from the security group itself, as seen in the following screenshot. This is a requirement for EFA to work.

SecurityGroupOutboundRules

SecurityGroupOutboundRules

 

Now, create an ecsInstanceRole, the Amazon ECS instance profile that will be applied to Amazon EC2 instances in a Compute Environment. To create a role, follow these steps.

  1. Choose Roles, then Create Role.
  2. Select AWS Service, then EC2.
  3. Choose Permissions.
  4. Attach the managed policy AmazonEC2ContainerServiceforEC2Role.
  5. Name the role ecsInstanceRole.

 

You will create the launch template using the ID of the security group, the ID of a subnet in your default VPC, and the ecsInstanceRole that you created.

Next, choose an instance type that supports EFA, that’s denoted by the n in the instance name. This example uses c5n.18xlarge instances.

You also need an Amazon Machine Image (AMI) ID. This example uses the latest ECS-optimized AMI based on Amazon Linux 2. Grab the AMI ID that corresponds to the Region that you are using.

This example uses UserData to install EFA. This adds 1.5 minutes of bootstrap time to the instance launch. In production workloads, bake the EFA installation into the AMI to avoid this additional bootstrap delay.

Now create a file called launch_template.json with the following content, making sure to substitute the account ID, security group, subnet ID, AMI ID, and key name.

{

“LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “LaunchTemplateData”: {

“InstanceType”: “c5n.18xlarge”,

“IamInstanceProfile”: {

“Arn”: “arn:aws:iam::<Account Id>:instance-profile/ecsInstanceRole”

},

“NetworkInterfaces”: [

{

“DeviceIndex”: 0,

“Groups”: [

“<Security Group>”

],

“SubnetId”: “<Subnet Id>”,

“InterfaceType”: “efa”,

“Description”: “NetworkInterfaces Configuration For EFA and Batch”

}

],

“Placement”: {

“GroupName”: “efa”

},

“TagSpecifications”: [

{

“ResourceType”: “instance”,

“Tags”: [

{

“Key”: “from-lt”,

“Value”: “networkInterfacesConfig-EFA-Batch”

}

]

}

],

“UserData”: “TUlNRS1WZXJzaW9uOiAxLjAKQ29udGVudC1UeXBlOiBtdWx0aXBhcnQvbWl4ZWQ7IGJvdW5kYXJ5PSI9PU1ZQk9VTkRBUlk9PSIKCi0tPT1NWUJPVU5EQVJZPT0KQ29udGVudC1UeXBlOiB0ZXh0L2Nsb3VkLWJvb3Rob29rOyBjaGFyc2V0PSJ1cy1hc2NpaSIKCmNsb3VkLWluaXQtcGVyIG9uY2UgeXVtX3dnZXQgeXVtIGluc3RhbGwgLXkgd2dldAoKY2xvdWQtaW5pdC1wZXIgb25jZSB3Z2V0X2VmYSB3Z2V0IC1xIC0tdGltZW91dD0yMCBodHRwczovL3MzLXVzLXdlc3QtMi5hbWF6b25hd3MuY29tL2F3cy1lZmEtaW5zdGFsbGVyL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLU8gL3RtcC9hd3MtZWZhLWluc3RhbGxlci1sYXRlc3QudGFyLmd6CgpjbG91ZC1pbml0LXBlciBvbmNlIHRhcl9lZmEgdGFyIC14ZiAvdG1wL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLUMgL3RtcAoKcHVzaGQgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgpjbG91ZC1pbml0LXBlciBvbmNlIGluc3RhbGxfZWZhIC4vZWZhX2luc3RhbGxlci5zaCAteQpwb3AgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgoKY2xvdWQtaW5pdC1wZXIgb25jZSBlZmFfaW5mbyAvb3B0L2FtYXpvbi9lZmEvYmluL2ZpX2luZm8gLXAgZWZhCgotLT09TVlCT1VOREFSWT09LS0K”

}

}

Create a launch template from that file:

 

$ aws ec2 create-launch-template –cli-input-json file://launch_template.json

{

“LaunchTemplate”: {

“LatestVersionNumber”: 1,

“LaunchTemplateId”: “lt-*****************”, “LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “DefaultVersionNumber”: 1,

“CreatedBy”: “arn:aws:iam::************:user/desktop-user”, “CreateTime”: “2019-09-23T13:00:21.000Z”

}

}

 

Creating a compute environment

Next, create an AWS Batch Compute Environment. This uses the information from the launch template

EFA-Batch-Launch-Template created earlier.

 

{

“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”,

“type”: “MANAGED”,

“state”: “ENABLED”,

“computeResources”: {

“type”: “EC2”,

“minvCpus”: 0,

“maxvCpus”: 2088,

“desiredvCpus”: 0,

“instanceTypes”: [

“c5n.18xlarge”

],

“subnets”: [

“<same-subnet-as-in-LaunchTemplate>”

],

“instanceRole”: “arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole”,

“launchTemplate”: {

“launchTemplateName”: “EFA-Batch-LaunchTemplate”,

“version”: “$Latest”

}

},

“serviceRole”: “arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole”

}

 

Now, create the compute environment:

$ aws batch create-compute-environment –cli-input-json file://compute_environment.json

{

“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”, “computeEnvironmentArn”: “arn:aws:batch:us-east-1:<Account Id>:compute-environment”

}

 

Building the container image

To build the container, clone the repository that contains the Dockerfile used in this example.

First, install git:

$ git clone https://github.com/aws-samples/aws-batch-efa.git

 

In that repository, there are several files, one of which is the following Dockerfile.

 

FROM amazonlinux:1 ENV USER efauser

 

RUN yum update -y

RUN yum install -y which util-linux make tar.x86_64 iproute2 gcc-gfortran openssh-serv RUN pip-2.7 install supervisor

 

RUN useradd -ms /bin/bash $USER ENV HOME /home/$USER

 

##################################################### ## SSH SETUP

ENV SSHDIR $HOME/.ssh

RUN mkdir -p ${SSHDIR} \

&& touch ${SSHDIR}/sshd_config \

&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ” \

&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \ && cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \

&& echo ”  IdentityFile ${SSHDIR}/id_rsa” >> ${SSHDIR}/config \ && echo ”  StrictHostKeyChecking no” >> ${SSHDIR}/config \

&& echo ”  UserKnownHostsFile /dev/null” >> ${SSHDIR}/config \ && echo ”  Port 2022″ >> ${SSHDIR}/config \

&& echo ‘Port 2022’ >> ${SSHDIR}/sshd_config \

&& echo ‘UsePrivilegeSeparation no’ >> ${SSHDIR}/sshd_config \

&& echo “HostKey ${SSHDIR}/ssh_host_rsa_key” >> ${SSHDIR}/sshd_config \ && echo “PidFile ${SSHDIR}/sshd.pid” >> ${SSHDIR}/sshd_config \

&& chmod -R 600 ${SSHDIR}/* \

&& chown -R ${USER}:${USER} ${SSHDIR}/

 

# check if ssh agent is running or not, if not, run RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

 

################################################# ## EFA and MPI SETUP

RUN curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-1. && tar -xf aws-efa-installer-1.5.0.tar.gz \

&& cd aws-efa-installer \

&& ./efa_installer.sh -y –skip-kmod –skip-limit-conf –no-verify

 

RUN wget https://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz \ && tar xzf NPB3.3.1.tar.gz

COPY make.def_efa /NPB3.3.1/NPB3.3-MPI/config/make.def COPY suite.def      /NPB3.3.1/NPB3.3-MPI/config/suite.def

 

RUN cd /NPB3.3.1/NPB3.3-MPI \

&& make suite \

&& chmod -R 755 /NPB3.3.1/NPB3.3-MPI/

 

###################################################

## supervisor container startup

 

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh

RUN chmod 755 supervised-scripts/mpi-run.sh

 

EXPOSE 2022

ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh RUN chmod 755 batch-runtime-scripts/entry-point.sh

 

CMD /batch-runtime-scripts/entry-point.sh

 

To build this Dockerfile, run the included Makerfile with:

make

Now, push the created container image to Amazon Elastic Container Registry (ECR), so you can use it in your AWS Batch JobDefinition:

From the AWS CLI, create an ECR repository, we’ll call it aws-batch-efa:

$ aws ecr create-repository –repository-name aws-batch-efa

{

“repository”: {

“registryId”: “<Account-Id>”,
“repositoryName”: “aws-batch-efa”,
“repositoryArn”: “arn:aws:ecr:us-east-2:<Account-Id>:repository/aws-batch-efa”,
“createdAt”: 1568154893.0,
“repositoryUri”: “<Account-Id>.dkr.ecr.us-east-2.amazonaws.com/aws-batch-efa”

}
}

Edit the top of the makefile and add your AWS account number and AWS Region.

AWS_REGION=<REGION>
ACCOUNT_ID=<ACCOUNT-ID>

To push the image to the ECR repository, run:

make tag
make push

 

Run the application

To run the application using AWS Batch multi-node parallel jobs, follow these steps.

 

Setting up the AWS Batch multi-node job definition

Set up the AWS Batch multi-node job definition and expose the EFA device to the container by following these steps.

 

First, create a file called job_definition.json with the following contents. This file holds the configurations for the AWS Batch JobDefinition. Specifically, this JobDefinition uses the newly supported field LinuxParameters.Devices to expose a particular device—in this case, the EFA device path /dev/infiniband/uverbs0—to the container. Be sure to substitute the image URI with the one you pushed to ECR in the previous step. This is used to start the container.

 

{

“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“type”: “multinode”,

“nodeProperties”: {

“numNodes”: 8,

“mainNode”: 0,

“nodeRangeProperties”: [

{

“targetNodes”: “0:”,

“container”: {

“user”: “efauser”,

“image”: “<Docker Image From Previous Section>”,

“vcpus”: 72,

“memory”: 184320,

“linuxParameters”: {

“devices”: [

{

“hostPath”: “/dev/infiniband/uverbs0”

}

]

},

“ulimits”: [

{

“hardLimit”: -1,

“name”: “memlock”,

“softLimit”: -1

}

]

}

}

]

}

}

 

$ aws batch register-job-definition –cli-input-json file://job_definition.json

{

“jobDefinitionArn”: “arn:aws:batch:us-east-1:<account-id>:job-definition/EFA-MPI-JobDefinition”,

“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“revision”: 1

}

 

Run the job

Next, create a job queue. This job queue points at the compute environment created before. When jobs are submitted to it, they queue until instances are available to run them.

 

{

“jobQueueName”: “EFA-Batch-JobQueue”,

“state”: “ENABLED”,

“priority”: 10,

“computeEnvironmentOrder”: [

{

“order”: 1,

“computeEnvironment”: “EFA-Batch-ComputeEnvironment”

}

]

}

 

aws   batch create-job-queue –cli-input-json  file://job_queue.json

Now that you’ve created all the resources, submit the job. The numNodes=8 parameter tells the job definition to use eight nodes.

aws   batch submit-job –job-name      example-mpi-job –job-queue EFA-Batch-JobQueue –job-definition EFA-MPI-JobDefinition –node-overrides numNodes=8

 

NPB overview

NPB is a small set of benchmarks derived from computational fluid dynamics (CFD) applications. They consist of five kernels and three pseudo-applications. This example runs the 3D Fast Fourier Transform (FFT) benchmark, as it tests all-to-all communication. For this run, use c5n.18xlarge, as configured in the AWS compute environment earlier. This is an excellent choice for this workload as it has an Intel Skylake processor (72 hyperthreaded cores) and 100 GB Enhanced Networking (ENA), which you can take advantage of with EFA.

 

This test runs the FT “C” Benchmark with eight nodes * 72 vcpus = 576 vcpus.

 

NAS Parallel Benchmarks 3.3 — FT Benchmark

No input file inputft.data. Using compiled defaults Size : 512x 512x 512

Iterations : 20

Number of processes : 512 Processor array : 1x 512 Layout type : 1D

Initialization time = 1.3533580760000063

T = 1 Checksum = 5.195078707457D+02 5.149019699238D+02

T = 2 Checksum = 5.155422171134D+02 5.127578201997D+02
T = 3 Checksum = 5.144678022222D+02 5.122251847514D+02
T = 4 Checksum = 5.140150594328D+02 5.121090289018D+02
T = 5 Checksum = 5.137550426810D+02 5.121143685824D+02
T = 6 Checksum = 5.135811056728D+02 5.121496764568D+02
T = 7 Checksum = 5.134569343165D+02 5.121870921893D+02
T = 8 Checksum = 5.133651975661D+02 5.122193250322D+02
T = 9 Checksum = 5.132955192805D+02 5.122454735794D+02
T = 10 Checksum = 5.132410471738D+02 5.122663649603D+02
T = 11 Checksum = 5.131971141679D+02 5.122830879827D+02
T = 12 Checksum = 5.131605205716D+02 5.122965869718D+02
T = 13 Checksum = 5.131290734194D+02 5.123075927445D+02
T = 14 Checksum = 5.131012720314D+02 5.123166486553D+02
T = 15 Checksum = 5.130760908195D+02 5.123241541685D+02
T = 16 Checksum = 5.130528295923D+02 5.123304037599D+02
T = 17 Checksum = 5.130310107773D+02 5.123356167976D+02
T = 18 Checksum = 5.130103090133D+02 5.123399592211D+02
T = 19 Checksum = 5.129905029333D+02 5.123435588985D+02
T = 20 Checksum = 5.129714421109D+02 5.123465164008D+02

 

Result verification successful class = C

FT Benchmark Completed. Class = C

Size = 512x 512x 512 Iterations = 20

Time in seconds = 1.92 Total processes = 512 Compiled procs = 512 Mop/s total = 206949.17 Mop/s/process = 404.20

Operation type = floating point Verification = SUCCESSFUL

 

Summary

In this post, we covered how to run MPI Batch jobs with an EFA-enabled elastic network interface using AWS Batch multi-node parallel jobs and an EC2 launch template. We used a launch template to configure the AWS Batch compute environment to launch an instance with the EFA device installed. We showed you how to expose the EFA device to the container. You also learned how to package an MPI benchmarking application, the NPB, as a Docker container, and how to run the application as an AWS Batch multi-node parallel job.

We hope you found the information in this post helpful and encouraging as to all the possibilities for HPC on AWS.

Now Available: Bare Metal Arm-Based EC2 Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-bare-metal-arm-based-ec2-instances/

At AWS re:Invent 2018, we announced a new line of Amazon Elastic Compute Cloud (EC2) instances: the A1 family, powered by Arm-based AWS Graviton processors. This family is a great fit for scale-out workloads e.g. web front-ends, containerized microservices or caching fleets. By expanding the choice of compute options, A1 instances help customers use the right instances for the right applications, and deliver up to 45% cost savings. In addition, A1 instances enable Arm developers to build and test natively on Arm-based infrastructure in the cloud: no more cross compilation or emulation required.

Today, we are happy to expand the A1 family with a bare metal option.

Bare Metal for A1

Instance NameLogical ProcessorsMemoryEBS-Optimized BandwidthNetwork Bandwidth
a1.metal1632 GiB3.5 GbpsUp to 10 Gbps

Just like for existing bare metal instances (M5, M5d, R5, R5d, z1d, and so forth), your operating system runs directly on the underlying hardware with direct access to the processor.

As described in a previous blog post, you can leverage bare metal instances for applications that:

  • need access to physical resources and low-level hardware features, such as performance counters, that are not always available or fully supported in virtualized environments,
  • are intended to run directly on the hardware, or licensed and supported for use in non-virtualized environments.

Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

Working with A1 Instances
Bare metal or not, it’s never been easier to work with A1 instances. Initially launched in four AWS regions, they’re now available in four additional regions: Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney).

From a software perspective, you can run on A1 instances Amazon Machine Images for popular Linux distributions like Ubuntu, Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Debian, and of course Amazon Linux 2. Applications such as the Apache HTTP Server and NGINX Plus are available too. So are all major programming languages and run-times including PHP, Python, Perl, Golang, Ruby, NodeJS and multiple flavors of Java including Amazon Corretto, a supported open source OpenJDK implementation.

What about containers? Good news here as well! Amazon ECS and Amazon EKS both support A1 instances. Docker has announced support for Arm-based architectures in Docker Enterprise Edition, and most Docker official images support Arm. In addition, millions of developers can now use Arm emulation to build, run and test containers on their desktop machine before moving them to production.

As you would expect, A1 instances are seamlessly integrated with many AWS services, such as Amazon EBS, Amazon CloudWatch, Amazon Inspector, AWS Systems Manager and AWS Batch.

Now Available!
You can start using a1.metal instances today in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney). As always, we appreciate your feedback, so please don’t hesitate to get in touch via the AWS Compute Forum, or through your usual AWS support contacts.

Julien;

New M5n and R5n EC2 Instances, with up to 100 Gbps Networking

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-m5n-and-r5n-instances-with-up-to-100-gbps-networking/

AWS customers build ever-demanding applications on Amazon EC2. To support them the best we can, we listen to their requirements, go to work, and come up with new capabilities. For instance, in 2018, we upgraded the networking capabilities of Amazon EC2 C5 instances, with up to 100 Gbps networking, and significant improvements in packet processing performance. These are made possible by our new virtualization technology, aka the AWS Nitro System, and by the Elastic Fabric Adapter which enables low latency on 100 Gbps networking platforms.

In order to extend these benefits to the widest range of workloads, we’re happy to announce that these same networking capabilities are available today for both Amazon EC2 M5 and R5 instances.

Introducing Amazon EC2 M5n and M5dn instances
Since the very early days of Amazon EC2, the M family has been a popular choice for general-purpose workloads. The new M5(d)n instances uphold this tradition, and are a great fit for databases, High Performance Computing, analytics, and caching fleets that can take advantage of improved network throughput and packet rate performance.

The chart below lists out the new instances and their specs: each M5(d) instance size now has an M5(d)n counterpart, which supports the upgraded networking capabilities discussed above. For example, whereas the regular m5(d).8xlarge instance has a respectable network bandwidth of 10 Gbps, its m5(d)n.8xlarge sibling goes to 25 Gbps. The top of the line m5(d)n.24xlarge instance even hits 100 Gbps.

Here are the specs:

Instance NameLogical ProcessorsMemoryLocal Storage
(m5dn only)
EBS-Optimized BandwidthNetwork Bandwidth
m5n.large
m5dn.large
28 GiB1 x 75 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.xlarge
m5dn.xlarge
416 GiB1 x 150 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.2xlarge
m5dn.2xlarge
832 GiB1 x 300 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.4xlarge
m5dn.4xlarge
1664 GiB2 x 300 GB
NVMe SSD
3.5 GbpsUp to 25 Gbps
m5n.8xlarge
m5dn.8xlarge
32128 GiB2 x 600 GB
NVMe SSD
5 Gbps25 Gbps
m5n.12xlarge
m5dn.12xlarge
48192 GiB2 x 900 GB
NVMe SSD
7 Gbps50 Gbps
m5n.16xlarge
m5dn.16xlarge
64256 GiB4 x 600 GB
NVMe SSD
10 Gbps75 Gbps
m5n.24xlarge
m5dn.24xlarge
96384 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps
m5n.metal
m5dn.metal
96384 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps

Introducing Amazon EC2 R5n and R5dn instances
The R5 family is ideally suited for memory-hungry workloads, such as high performance databases, distributed web scale in-memory caches, mid-size in-memory databases, real time big data analytics, and other enterprise applications.

The logic here is exactly the same: each R5(d) instance size has an R5(d)n counterpart. Here are the specs:

Instance NameLogical ProcessorsMemoryLocal Storage
(r5dn only)
EBS-Optimized BandwidthNetwork Bandwidth
r5n.large
r5dn.large
216 GiB1 x 75 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.xlarge
r5dn.xlarge
432 GiB1 x 150 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.2xlarge
r5dn.2xlarge
864 GiB1 x 300 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.4xlarge
r5dn.4xlarge
16128 GiB2 x 300 GB
NVMe SSD
3.5 GbpsUp to 25 Gbps
r5n.8xlarge
r5dn.8xlarge
32256 GiB2 x 600 GB
NVMe SSD
5 Gbps25 Gbps
r5n.12xlarge
r5dn.12xlarge
48384 GiB2 x 900 GB
NVMe SSD
7 Gbps50 Gbps
r5n.16xlarge
r5dn.16xlarge
64512 GiB4 x 600 GB
NVMe SSD
10 Gbps75 Gbps
r5n.24xlarge
r5dn.24xlarge
96768 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps
r5n.metal
r5dn.metal
96768 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps

These new M5(d)n and R5(d)n instances are powered by custom second generation Intel Xeon Scalable Processors (based on the Cascade Lake architecture) with sustained all-core turbo frequency of 3.1 GHz and maximum single core turbo frequency of 3.5 GHz. Cascade Lake processors enable new Intel Vector Neural Network Instructions (AVX-512 VNNI) which will help speed up typical machine learning operations like convolution, and automatically improve inference performance over a wide range of deep learning workloads.

Now Available!
You can start using the M5(d)n and R5(d)n instances today in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Singapore).

We hope that these new instances will help you tame your network-hungry workloads! Please send us feedback, either on the AWS Forum for Amazon EC2, or through your usual support contacts.

Julien;

Running Java applications on Amazon EC2 A1 instances with Amazon Corretto

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/running-java-applications-on-amazon-ec2-a1-instances-with-amazon-corretto/

This post is contributed by Jeff Underhill | EC2 Principal Business Development Manager and Arthur Petitpierre | Arm Specialist Solutions Architect

 

Amazon EC2 A1 instances deliver up to 45% cost savings for scale-out applications and are powered by AWS Graviton Processors that feature 64-bit Arm Neoverse cores and custom silicon designed by AWS. Amazon Corretto is a no-cost, multiplatform, production-ready distribution of the Open Java Development Kit (OpenJDK).

Production-ready Arm 64-bit Linux builds of Amazon Corretto for JDK8 and JDK 11 were released Sep 17, 2019. This provided an additional Java runtime option when deploying your scale-out Java applications on Amazon EC2 A1 instances. We’re fortunate to have James Gosling, the designer of Java, as a member of the Amazon team, and he recently took to Twitter to announce the General Availability (GA) of Amazon Corretto for the Arm architecture:

For those of you that like playing with Linux on ARM, the Corretto build for ARM64 is now GA.  Fully production ready. Both JDK8 and JDK11

If you’re interested to experiment with Amazon Corretto on Amazon EC2 A1 instances then read on for step-by-step instructions that will have you up and running in no time.

Launching an A1 EC2 instance

The first step is to create a running Amazon EC2 A1 instance. In this example, we demonstrate how to boot your instance using Amazon Linux 2. Starting from the AWS Console, you need to log-in to your AWS account or create a new account if you don’t already have one. Once you logged into the AWS console, navigate to the Amazon Elastic Compute Cloud (Amazon EC2) as follows:

Once you logged into the AWS console, navigate to the Amazon Elastic Compute Cloud (Amazon EC2) and click on Launch a virtual machine

Next, select the operating system and compute architecture of the EC2 instance you want to launch.  In this case, use Amazon Linux 2 because we want an AWS Graviton-based A1 instance we’re selecting the 64-bit (Arm):use Amazon Linux 2 because we want an AWS Graviton-based A1 instance we’re selecting the 64-bit (Arm)

On the next page, we select an A1 instance type. select an a1.xlarge that offers 4 x vCPU’s and 8GB of memory (refer to the Amazon EC2 A1 page for more information). Then, select the “Review and Launch” button:

select an A1 instance type - an a1.xlarge that offers 4 x vCPU’s and 8GB of memory

Next, you can review a summary of your instance details. This summary is shown in the following pictures. Note: the only network port exposed is SSH via TCP port 22. This allows you to remotely connect to the instance via an SSH terminal:

review a summary of your instance details

Before proceeding be aware you are about to start spending money (and don’t forget to terminate the instance at the end to avoid ongoing charges). As the warning in the screen shot above states: the A1 instance selected is not eligible for free tier.  So, you are charged based on the pricing of the instance selected (refer to the Amazon EC2 on-demand pricing page for details. The a1.xlarge instance selected is $0.102 per Hour as of this writing).

Once you’re ready to proceed, select “Launch” to continue. At this point you need to create or supply an existing key-pair for use when connecting to the new instance via SSH. Details to establish this connection can be found in the EC2 documentation.

In this example, I connect from a Windows laptop using PuTTY.  The details of converting EC2 keys into the right format can be found here. You can connect the same way. In the following screenshot, I use an existing key-pair that I generated. You can create an existing key-pair that best suits your workload and do the following:

Select an existing key pair or create one
While your instance launches, you can click on “View Instances” to see the status of instances within my AWS account:

click on “View Instances” to see the status of instances

Once you click on “View Instances,” you can see that your newly launched instance is now in the Running state:

Once you click on “View Instances,” you can see that your newly launched instance is now in the Running state

Now, you can connect to your new instance. Right click on the instance from within the console, then select “Connect” from the pop-up menu to get details and instructions on connecting to the instance. This is shown in the following screenshot:

select “Connect” from the pop-up menu to get details and instructions on connecting to the instance
The following screenshot provides you with instructions and specific details needed to connect to your running A1 instance:

Connect to your instance using an SSH Client
You can now connect to the running a1.xlarge instance through instructions to map your preferred SSH client.

Then, the Amazon Linux 2 command prompt pops up as follows:

Note: I run the ‘uname -a’ command to show that you are running on an ‘aarch64’ architecture which is the Linux architecture name for 64-bit Arm.

 run the ‘uname -a’ command to show that you are running on an ‘aarch64’ architecture which is the Linux architecture name for 64-bit Arm

Once you complete this step, your A1 instance is up and running. From here, you can leverage Corretto8.

 

Installing corretto8

You can now install Amazon Corretto 8 on Amazon Linux 2 following the instructions from the documentation.  Use option 1 to install the application from Amazon Linux 2 repository:

$ sudo amazon-linux-extras enable corretto8

$ sudo yum clean metadata

$ sudo yum install -y java-1.8.0-amazon-corretto

This code initiates the installation. Once complete, you can use the java version command to see that you have the newest version of Amazon Corretto.  The java command is as follows (your version may be more recent):

$ java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment Corretto-8.232.09.1 (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM Corretto-8.232.09.1 (build 25.232-b09, mixed mode 

This command confirms that you have Amazon Corretto 8 version 8.232.09.1 installed and ready to go. If you see a version string that doesn’t mention Corretto, this means you have another version of Java already running. In this case, run the following command to change the default java providers:

$ sudo alternatives --config java

Installing tomcat8.5 and a simple JSP application

Once the latest Amazon Corretto is installed, confirm that the Java installation works. You can do this by installing and running a simple Java application.

To run this test, you need to install Apache Tomcat, which is a Java based application web server. Then, open up a public port to make it accessible and connect to it from a browser to confirm it’s running as expected.

Then, install tomcat8.5 from amazon-linux-extras using the following code:

$ sudo amazon-linux-extras enable tomcat8.5
$ sudo yum clean metadata
$ sudo yum install -y tomcat 

Now configure tomcat to use /dev/urandom as an entropy source. This is important to do because otherwise tomcat might hang on a freshly booted instance if there’s not enough entropy depth. Note: there’s a kernel patch in flight to provide an alternate entropy mechanism:

$ sudo bash -c 'echo JAVA_OPTS=\"-Djava.security.egd=file:/dev/urandom\" >> /etc/tomcat/tomcat.conf' 

Next, add a simple JavaServer Pages (JSP) application that will display details about your system.

First,  create default web application directory:

$ sudo install -d -o root -g tomcat /var/lib/tomcat/webapps/ROOT 

Then, add the small JSP application:

$ sudo bash -c 'cat <<EOF > /var/lib/tomcat/webapps/ROOT/index.jsp
<html>
<head>
<title>Corretto8 - Tomcat8.5 - Hello world</title>
</head>
<body>
  <table>
    <tr>
      <td>Operating System</td>
      <td><%= System.getProperty("os.name") %></td>
    </tr>
    <tr>
      <td>CPU Architecture</td>
      <td><%= System.getProperty("os.arch") %></td>
    </tr>
    <tr>
      <td>Java Vendor</td>
      <td><%= System.getProperty("java.vendor") %></td>
    </tr>
    <tr>
      <td>Java URL</td>
      <td><%= System.getProperty("java.vendor.url") %></td>
    </tr>
    <tr>
      <td>Java Version</td>
      <td><%= System.getProperty("java.version") %></td>
    </tr>
    <tr>
      <td>JVM Version</td>
      <td><%= System.getProperty("java.vm.version") %></td>
    </tr>
    <tr>
      <td>Tomcat Version</td>
      <td><%= application.getServerInfo() %></td>
    </tr>
</table>

</body>
</html>
EOF
'

Finally, start the Tomcat service:

$ sudo systemctl start tomcat 

Now the Tomcat service is running, you need to configure your EC2 instance to open TCP port 8080 (the default port that Tomcat listens on). This configuration allows you to access the instance from a browser and confirm Tomcat is running and serving content.

To do this, return to the AWS console and select your EC2 a1.xlarge instance. Then,  in the information panel below, select the associated security group so we can modify the inbound rules to allow TCP accesses on port 8080 as follows:

select the associated security group so we can modify the inbound rules to allow TCP accesses on port 8080

With these modifications you should now be able to connect to the Apache Tomcat default page by directing a browser to http://<your instance IPv4 Public IP>:8080 as follows:

connect to the Apache Tomcat default page by directing a browser to http://<your instance IPv4 Public IP>:8080
 Don’t forget to terminate your EC2 instance(s) when you’re done to avoid ongoing charges!

 

To summarize, we spun up an Amazon EC2 A1 instance, installed and enabled Amazon Corretto and Apache Tomcat server, configured the security group for the EC2 Instance to accept connections to TCP port 8080 and then created and connected to a simple default JSP web page. Being able to display the JSP page confirms  that you’re serving content, and can see the underlying Java Virtual Machine and platform architecture specifications. These steps demonstrate setting-up the Amazon Corretto + Apache Tomcat environment, and running a demo JSP web application on AWS Graviton based Amazon EC2 A1 instances using readily available open source software.

You can learn more at the Amazon Corretto website, and the downloads are all available here for Amazon Corretto 8Amazon Corretto 11 and if you’re using containers here’s the Docker Official image.  If you have any questions about your own workloads running on Amazon EC2 A1 instances, contact us at [email protected].

 

Now available in Amazon SageMaker: EC2 P3dn GPU Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-in-amazon-sagemaker-ec2-p3dn-gpu-instances/

In recent years, the meteoric rise of deep learning has made incredible applications possible, such as detecting skin cancer (SkinVision) and building autonomous vehicles (TuSimple). Thanks to neural networks, deep learning indeed has the uncanny ability to extract and model intricate patterns from vast amounts of unstructured data (e.g. images, video, and free-form text).

However, training these neural networks requires equally vasts amounts of computing power. Graphics Processing Units (GPUs) have long proven that they were up to that task, and AWS customers have quickly understood how they could use Amazon Elastic Compute Cloud (EC2) P2 and P3 instances to train their models, in particular on Amazon SageMaker, our fully-managed, modular, machine learning service.

Today, I’m very happy to announce that the largest P3 instance, named p3dn.24xlarge, is now available for model training on Amazon SageMaker. Launched last year, this instance is designed to accelerate large, complex, distributed training jobs: it has twice as much GPU memory as other P3 instances, 50% more vCPUs, blazing-fast local NVMe storage, and 100 Gbit networking.

How about we give it a try on Amazon SageMaker?

Introducing EC2 P3dn instances on Amazon SageMaker
Let’s start from this notebook, which uses the built-in image classification algorithm to train a model on the Caltech-256 dataset. All I have to do to use a p3dn.24xlarge instance on Amazon SageMaker is to set train_instance_type to 'ml.p3dn.24xlarge', and train!

ic = sagemaker.estimator.Estimator(training_image,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.p3dn.24xlarge',
                                         input_mode='File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)
...
ic.fit(...)

I ran some quick tests on this notebook, and I got a sweet 20% training speedup out of the box (your mileage may vary!). I’m using 'File' mode here, meaning that the full dataset is copied to the training instance: the faster network (100 Gbit, up from 25 Gbit) and storage (local NVMe instead of Amazon EBS) are certainly helping!

When working with large data sets, you could put 100 Gbit networking to good use either by streaming data from Amazon Simple Storage Service (S3) with Pipe Mode, or by storing it in Amazon Elastic File System or Amazon FSx for Lustre. It would also help with distributed training (using Horovod, maybe), as instances would be able to exchange parameter updates faster.

In short, the Amazon SageMaker and P3dn tag team packs quite a punch, and it should deliver a significant performance improvement for large-scale deep learning workloads.

Now available!
P3dn instances are available on Amazon SageMaker in the US East (N. Virginia) and US West (Oregon) regions. If you are ready to get started, please contact your AWS account team or use the Contact Us page to make a request.

As always, we’d love to hear your feedback, either on the AWS Forum for Amazon SageMaker, or through your usual AWS contacts.