Tag Archives: Compute

Deploying a highly available WordPress site on Amazon Lightsail, Part 4: Increasing performance and scalability with a Lightsail load balancer

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/deploying-a-highly-available-wordpress-site-on-amazon-lightsail-part-4-increasing-performance-and-scalability-with-a-lightsail-load-balancer/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail | Twitter: @mikegcoleman

This is the final post in a series about getting a highly available WordPress site up and running on Amazon Lightsail. For reference, the other blog posts are:

  1. Implementing a highly available Lightsail database with WordPress
  2. Using Amazon S3 with WordPress to securely deliver media files
  3. Increasing security and performance using Amazon CloudFront

In this post, you’ll learn how to stand up a Lightsail load balancer, create a snapshot of your WordPress server, deploy new instances from those snapshots, and place the instances behind your load balancer.

A load balancer accepts incoming web traffic and routes it to one or (usually) more servers. Having multiple servers allows you to scale the number of incoming requests your site can handle, as well as allowing your site to remain responsive if a web server fails. The following diagram shows the solution architecture, which features multiple front-end WordPress servers behind a Lightsail load balancer, a highly available Lightsail database, and uses S3 alongside CloudFront to deliver your media content securely.

Graphic showing the final HA architecture including 3 servers behind a load balancer, S3 alongside cloudfront, and a highly-available database

Prerequisites

This post assumes you built your WordPress site by following the previous posts in this series.

Configuring SSL requires a registered domain name and sufficient permissions to create DNS records for that domain.

You don’t need AWS or Lightsail to manage your domain, but this post uses Lightsail’s DNS management. For more information, see Creating a DNS zone to manage your domain’s DNS records in Amazon Lightsail.

Deploying the load balancer and configuring SSL

To deploy a Lightsail load balancer and configure it to support SSL, complete the following steps:

  1. Open the Lightsail console.
  2. From the menu, choose Networking.
  3. Choose Create Load Balancer.
  4. For Identify your load balancer, enter a name for your load balancer.

This post names the load balancer wp-lb.

  1. Choose Create Load Balancer.

The details page for your new load balancer opens. From here, add your initial WordPress server to the load balancer and configure SSL.

  1. For Target instances, choose your WordPress server.

The following screenshot indicates that this post chooses the server WordPress-1.

screenshot showing an instance being selected from the drop down

 

  1. Choose Attach.

It can take a few seconds for your instance to attach to the load balancer and the Health Check to report as Passed. See the following screenshot of the Health Check status.

 Picture of healh check status

  1. From the menu, choose Inbound traffic.
  2. Under Certificates, choose Create certificate.
  3. For PRIMARY DOMAIN, enter the domain name that you want to use to reach your WordPress site.

You can either accept the default certificate name Lightsail creates or change it to something you prefer. This post uses www.mikegcoleman.com.

  1. Choose Create.

The following screenshot shows the Create a certificate section.

Picture of the Create a Certificate section

Creating a CNAME record

As you did with CloudFront, you need to create a CNAME record as a way of validating that you have ownership of the domain for which you are creating a certificate.

  1. Copy the random characters and the subdomain from the name field.

The following screenshot shows the example record information.

screenshot showing the portion of the name value that needs to be copied

  1. Open a second copy of the Lightsail console in another tab or window.
  2. Choose Networking.
  3. Choose your domain name.
  4. Choose Add record.
  5. From the drop-down menu, choose CNAME record.

The following screenshot shows the drop-down menu options.

Screenshot showing CNAME record selected in the dropdown box

  1. For Subdomain, enter the random characters and subdomain you copied from the load balancer page.
  2. Return to the load balancer page.
  3. Copy the entire Value
  4. Return to the DNS page.
  5. For Maps to, enter the value string.
  6. Choose the green check box.

The following screenshot shows the CNAME record details.

Screenshot showing the completed cname record and green checkbox highlighted

  1. Return to the load balancer page and wait a few minutes before refreshing the page.

You should see a notification that the certificate is verified and ready to use. This process can take several minutes; continue refreshing the page every few minutes until the status updates. The following screenshot shows the notification message.

Screenshot showing the verification complete message for the load balancer ssl certificate

  1. In the HTTPS box, select your certificate from the drop-down menu.

The following screenshot shows the HTTPS box.

screenshot showing the newly validated certificate in the drop down box

  1. Copy the DNS name for your load balancer.

The following screenshot shows the example DNS name.

Screenshot showing the DNS name of the load balancer

  1. Return to the Lightsail DNS console and follow steps 13 through 23 as a guide in creating a CNAME record that maps your website address to the load balancer’s DNS name.

Use the subdomain you chose for your WordPress server (in this post, that’s www) and the DNS name you copied for the Maps to field.

The following screenshot shows the CNAME record details.

screenshot showing the completed cname record for the load balancer

Updating the wp-config file

The last step to configure SSL is updating the wp-config file to configure WordPress to deliver content over SSL.

  1. Start an SSH session with your WordPress server.
  2. Copy and paste the following code into the terminal window to create a temporary file that holds the configuration string that will be added to the WordPress configuration file.
cat <<EOT >> ssl_config.txt
if (\$_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https') \$_SERVER['HTTPS']='on';
EOT
  1. Copy and paste the following sed command into your terminal window to add the SSL line to the configuration file.
sed -i "/define( 'WP_DEBUG', false );/r ssl_config.txt" \
/home/bitnami/apps/wordpress/htdocs/wp-config.php
  1. The sed command changes the permissions on the configuration file, so you’ll need to reset them. See the following code:
sudo chown bitnami:daemon /home/bitnami/apps/wordpress/htdocs/wp-config.php

You also need to update two variables that WordPress uses to identify which address is used to access your site.

  1. Update the WP_SITEURL variable (be sure to specify https) by running the following command in the terminal window:
wp config set WP_SITEURL https://<your wordpress site domain name>

For example, this post uses the following code:

wp config set WP_SITEURL https://www.mikegcoleman.com

You should get a response that the variable updated.

  1. Update the WP_HOME variable (be sure to specify https) by issuing the following command in the terminal window:
wp config set WP_HOME https://<your wordpress site domain name>

For example, this post uses the following code:

wp config set WP_HOME https://www.mikegcoleman.com

You should get a response that the variable updated.

  1. Restart the WordPress server to read the new configuration with the following code:
sudo /opt/bitnami/ctlscript.sh restart

After the server has restarted, visit the DNS name for your WordPress site. The site should load and your browser should report the connection is secure.

You can now finish any customization of your WordPress site, such as adding additional plugins, setting the blog name, or changing the theme.

Scaling your WordPress servers

With your WordPress server fully configured, the last step is to create additional instances and place them behind the load balancer so that if one of your WordPress servers fails, your site is still reachable. An added benefit is that your site is more scalable because there are additional servers to handle incoming requests.

Complete the following steps:

  1. On the Lightsail console, choose the name of your WordPress server.
  2. Choose Snapshots.
  3. For Create instance snapshot, enter the name of your snapshot.

This post uses the name WordPress-version-1. See the following screenshot of your snapshot details.

Screenshot of the snapshot creation dialog

  1. Choose Create snapshot.

It can take a few minutes for the snapshot creation process to finish.

  1. Click the three-dot menu icon to the right of your snapshot name and choose Create new instance.

The following screenshot shows the Recent snapshots section.

Screenshot showing the location of the three dot menu

To provide the highest level of redundancy, deploy each of your WordPress servers into a different Availability Zone within the same region. By default, the first server was placed in zone A; place the subsequent servers in two different zones (B and C would be good choices). For more information, see Regions and Availability Zones.

  1. For Instance location, choose Change AWS Region and Availability Zone.
  2. Choose Change your Availability Zone.
  3. Choose an Availability Zone you have not used previously.

The following screenshot shows the Availability Zones to choose from.

screenshot showing availability zone b selected

  1. Give your instance a new name.

This post names the instance WordPress-2.

  1. Choose Create Instance.

You should have at least two WordPress server instances to provide a minimum degree of redundancy. To add more, create additional instances by following steps 1–10.

Return to the Lightsail console home page, and wait for your instances to report back Running.

Adding your instances to the load balancer

Now that you have your additional WordPress instances created, add them to the load balancer. This is the same process you followed previously to add the first instance:

  1. On the Lightsail console, choose Networking.
  2. Choose the load balancer you previously created.
  3. Choose Attach another.
  4. From the drop-down menu, select the name of your instance.

The following screenshot shows the available instances on the drop-down menu.

screenshot showing the WordPress instances in the load balancer drop down

  1. Choose Attach.
  2. Repeat steps 3–5 for any additional instances.

When the instances report back Passed, your site is fully up and running.

Conclusion

You have configured your site to provide a good degree of redundancy and performance, while delivering your content over secure connections. S3 and CloudFront allow your site to deliver static content over a secured connection, while the Lightsail load balancer and database make sure your site can keep serving your customers in the event of a failure.

If you haven’t got one already, head on over and create a free AWS account, and start building your own WordPress site – or whatever else might think of!

We love SQL Server running on AWS almost as much as our customers

Post Syndicated from Sandy Carter original https://aws.amazon.com/blogs/compute/we-love-sql-server-running-on-aws-almost-as-much-as-our-customers/

We love SQL Server running on AWS almost as much as our customers. Microsoft SQL server 2019 became generally available on November 8, 2019, and is now available to on AWS. More customers run SQL Server on AWS than any other cloud, and trust AWS for a number of reasons.

The first is performance. Recent performance benchmarks show that AWS delivers great price-performance for running SQL server. ZK Research points out that SQL Server on AWS consistently shows a price performance using HammerDB, a TPC-C-like benchmark tool compared to Azure, that’s over two times better. These results come from analysis done by ZKResearch based on independent testing results published by DBBest. Furthermore, we offer fast and high throughput storage options with Amazon EBS and local NVMe instance storage. We know that getting better application performance is critical for your customer’s satisfaction. In fact, excellent application performance leads to 39% higher* customer satisfaction, while poor performance may lead to damaged reputations or, even worse, customer attrition. To make sure you have the best possible experience for your customers, we have focused on pushing the boundaries around performance.

For example, the value of running SQL Server on AWS is shown in Pearson’s migration story. Pearson is a British-owned education publishing and assessment service to schools and corporations, as well for students directly. Pearson owns educational media brands including Addison–Wesley, Prentice Hall, eCollege, and others. Schoolnet , one of their offerings, tests tens of millions of students and is used by tens of thousands of educators. Pearson migrated Schoolnet, which was an on-premises application that used SQL Server, over to the AWS cloud. The goal of the migration was to ensure that the high volume of tests done daily would run efficiently and effectively. As many customers do, Pearson build for the worst-case demand, and had over-provisioned with a massive infrastructure utilized for potential (real!) peaks. When they moved to AWS, not only did they see efficiencies in cost but SQL Server was far easier to manage and was much faster. In fact, at our reInvent conference, Ian Wright told the audience that they had so much feedback coming into the level 2 support desk that was positive from the migration in particular, that Schoolnet was running faster than ever before!

Second, customers need high availability for mission-critical applications written using SQL Server. We have the best global infrastructure for running workloads that require high availability. The AWS Global Infrastructure underlays SQL Server on AWS and spans 69 Availability Zones (AZs) within 22 geographic regions around the world. These AZs are designed for physical redundancy and provide resilience, which enables uninterrupted performance. In 2018, the next-largest cloud provider had almost seven times more downtime hours than AWS.

Third, while CIOs tell us that cost is not the key factor in their decision to move to the cloud (agility and innovation are usually at the center of their motivation), they are often impressed with the cost savings they see when bringing their SQL server workloads to AWS. We typically see at least 20% savings with just a lift-and-shift. Over the first few months, you can continue to optimize your EC2 instances for an additional 10–20% savings. By adopting higher level services, you can further optimize—many customers saw 60% or more savings.

For example, Axinom ran its Windows-based applications in an on-premises environment that made it difficult to scale to meet increasing user traffic. The company also wanted to boost scalability and cut costs. Axinom moved its applications, including Axinom CMS and Axinom DRM, to the AWS Cloud. The company runs its Microsoft SQL Server–based platform on AWS and Spot Instances to optimize costs. As Johannes Jauch, the Chief Technology Officer of Axinom, said, “We have cut costs for supporting our digital media supply chain services by 70 percent using AWS products such as Amazon Spot Instances. As a result, we can provide more competitive pricing for our global customers.” And Axinom is not the only customers. Customers like Salesforce, Adobe, and Decisiv are benefitting from increased productivity and agility running SQL Server on AWS. You can read more about how customers are unlocking maximum business value by migrating to AWS.

And finally, not only does AWS offers more security, compliance, and governance services and key features than the next largest cloud provider, we also have the most migration experience.

All these benefits mean that the new features of SQL Server 2019, such as big data clusters, always being encrypted with secure enclaves, and improvements in SQL on Linux features, run better on AWS. We are also happy to announce that you can now launch EC2 instances that run Windows Server 2019/2016 and four editions of SQL Server 2019 (Web, Express, Standard, and Enterprise). The Amazon Machine Images (AMIs) are available today in all AWS Regions and run on a wide variety of EC2 instance types. You can launch these instances from the AWS Management Console, AWS CLI, or through AWS Marketplace. To get started with SQL 2019 on AWS, you can either purchase a License Included EC2 instance, or, if you have software assurance, you also have two Bring Your Own License (BYOL) options! You can BYOL SQL Server on an AWS instance with license-included Windows, or can BYOL SQL Server on a dedicated host with BYOL Windows (provided the Windows license was purchased before October 1, 2019).

Join the customers on SQL Server on AWS today! To learn more about SQL Server 2019 and to explore your licensing options, visit Microsoft SQL Server on AWS. If you need advice and guidance as you plan your migration effort, check out the AWS Partners who have qualified for the AWS Microsoft Workloads Competency and focus on database solutions. Please join me and the AWS team at AWS re:Invent (December 2–6 in Las Vegas).

*Source: Netmagic, https://www.netmagicsolutions.com/data/images/WP_How-End-User-Experience-Affects-Your-Bottom-Line16-08-231471935227.pdf

Deploying a Highly available WordPress site on Amazon Lightsail, Part 3: Increasing security and performance using Amazon CloudFront

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/deploying-a-highly-available-wordpress-site-on-amazon-lightsail-part-3-increasing-security-and-performance-using-amazon-cloudfront/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail | Twitter: @mikegcoleman

The previous posts in this series (Implementing a highly available Lightsail database with WordPress and Using Amazon S3 with WordPress to securely deliver media files), showed how to build a WordPress site and configure it to use Amazon S3 to serve your media assets. The next step is to use an Amazon CloudFront distribution to increase performance and add another level of security.

CloudFront is a content delivery network (CDN). It essentially caches content from your server on endpoints across the globe. When someone visits your site, they hit one of the CloudFront endpoints first. CloudFront determines if the requested content is cached. If so, CloudFront responds to the client with the requested information. If the content isn’t cached on the endpoint, it is loaded from the actual server and cached on CloudFront so subsequent requests can be served from the endpoint.

This process speeds up response time, because the endpoint is usually closer to the client than the actual server, and it reduces load on the server, because any request that CloudFront can handle is one less request your server needs to deal with.

In this post’s configuration, CloudFront only caches the media files stored in your S3 bucket, but CloudFront can cache more. For more information, see How to Accelerate Your WordPress Site with Amazon CloudFront.

Another benefit of CloudFront is that it responds to requests over HTTPS. Because some requests are served by the WordPress server and others from CloudFront, it’s important to secure both connections with HTTPS. Otherwise, the customer’s web browser shows a warning that the site is not secure. The next post in this series shows how to configure HTTPS for your WordPress server.

Solution overview

To configure CloudFront to work with your Lightsail WordPress site, complete the following steps:

  1. Request a certificate from AWS Certificate Manager.
  2. Create a CloudFront distribution.
  3. Configure the WP Offload Media Lite plugin to use the CloudFront distribution.

The following diagram shows the architecture of this solution.

Image showing WordPress architecture with Lightsail, CloudFront, S3, and a Lightsail database

Prerequisites

This post assumes that you built your WordPress site by following the previous posts in this series.

Configuring SSL requires that you have a registered domain name and sufficient permissions to create DNS records for that domain.

You don’t need AWS or Lightsail to manage your domain, but this post uses Lightsail’s DNS management. For more information, see Creating a DNS zone to manage your domain’s DNS records in Amazon Lightsail.

Creating the SSL certificate

This post uses two different subdomains for your WordPress site. One points to your main site, and the other points to your S3 bucket. For example, a customer visits https://www.example.com to access your site. When they click on a post that contains a media file, the post body loads off of https://www.example.com, but the media file loads from https://media.example.com, as depicted in the previous graphic.

Create the SSL certificate for CloudFront to use with the S3 bucket with the following steps. The next post in this series shows how to create the SSL certificate for your WordPress server.

  1. Open the ACM console.
  2. Under Provision certificates, choose Get started.
  3. Choose Request a public certificate.
  4. Choose Request a certificate.
  5. For Domain name*, enter the name of the domain to serve your media files.

This post uses media.mikegcoleman.com.

  1. Choose Next.

The following screenshot shows the example domain name.

Screen shot showing domain name dialog

  1. Select Leave DNS validation.
  2. Choose Review.
  3. Choose Confirm and request.

ACM needs you to validate that you have the necessary privileges to request a certificate for the given domain by creating a special DNS record.

Choose the arrow next to the domain name to show the values for the record you need to create. See the following screenshot.

Screen shot showing verification record

You now need to use Lightsail’s DNS management to create the CNAME record to validate your certificate.

  1. In a new tab or window, open the Lightsail console.
  2. Choose Networking.
  3. Choose the domain name for the appropriate domain.
  4. Under DNS records, choose Add record.
  5. From the drop-down menu, choose CNAME record.

The following screenshot shows the menu options.

Screen shot showing CNAME record selected in dropdown box

  1. Navigate back to ACM.

Under Name, the value is formatted as randomcharacters.subdomain.domain.com.

  1. Copy the random characters and the subdomain.

For example, if the value was _f836d9f10c45c6a6fbe6ba89a884d9c4.media.mikegcoleman.com, you would copy _f836d9f10c45c6a6fbe6ba89a884d9c4.media.

  1. Return to the Lightsail DNS console.
  2. Under Subdomain, enter the value you copied.
  3. Return to the ACM console.
  4. For Value, copy the entire string.
  5. Return to the Lightsail DNS console.
  6. For Maps to, enter the value you copied.
  7. Choose the green checkmark.

The following screenshot shows the completed record details in the Lightsail DNS console.

screen shot showing where the green checkmark is and completed domain record

ACM periodically checks DNS to see if this record exists, and validates your certificate when it finds the record. The validation usually takes approximately 15 minutes; however, it could take up to 48 hours.

To track the validation, return to the ACM console. You can periodically refresh this page until you see the status field change to Issued. See the following screenshot.

Screenshot highlighting the updated status of the domain validation

Building the CloudFront distribution

Now that you have the certificate, you’re ready to create your CloudFront distribution. Complete the following steps:

  1. Open the CloudFront console.
  2. Choose Create distribution.
  3. Under Web, choose Get Started.
  4. For Origin Domain Name, select the S3 bucket you created for your WordPress media files.

The following screenshot shows the options available in the Origin Domain Name drop-down menu. This post chooses mike-wp-bucket-s3-amazonaws.com.

Screenshot of the S3 bucket selected from the Origin Domain Name dialog

By default, WordPress does not pass information that indicates when to clear an item in the cache; specifying how to configure that functionality is out of the scope of this post.

Because WordPress doesn’t send this information, you need to set a default time to live (TTL) in CloudFront.

As a starting point, set the value to 900 seconds (15 minutes). This means that if you load a post, and that post includes media, CloudFront checks the cache for that media. If the media is in the cache, but has been there longer than 15 minutes, CloudFront requests the media from your WordPress server and update the cache.

While 15 minutes is a reasonable starting value for media, the optimal value for the TTL depends on how you want to balance delivering your clients the latest content with performance and cost.

  1. For Object Caching, choose Customize.
  2. For Default TTL, enter 900.

The following screenshot shows the TTL options.

Image of TTL options

CloudFront has endpoints across the globe, and the price you pay depends on the number of endpoints you have configured. If your traffic is localized to a certain geographic region, you may want to restrict which endpoints CloudFront uses.

  1. Under Distribution Settings, for Price Class, choose the appropriate setting.

This post chooses Use All Edge Locations. See the following screenshot. You should choose a setting that make sense for your site. Choosing only a subset of price classes will reduce costs.

Screenshot showing the choices for price classes

  1. For Alternate Domain Names (CNAMEs), enter the name of the subdomain for your S3 bucket.

This name is the same as the subdomain you created the certificate for in the previous steps. This post uses media.mikegcoleman.com.

Assign the certificate you created earlier.

  1. Choose Custom SSL Certificate.

When entering in the text field, the certificate you created earlier shows in the drop-down menu.

  1. Choose your certificate.

The following screenshot shows the drop-down menu with custom SSL certificates.

Screenshot showing the previously created SSL certificate being selected

  1. Choose Create Distribution.

It can take 15–30 minutes to configure the distribution.

The next step is to create a DNS record that points your media subdomain to the CloudFront distribution.

  1. Return to the CloudFront console, choose the ID of the distribution.

The following screenshot shows available distribution IDs.

screen shot showing the distribution ID highlighted

The page displays the distribution ID details.

  1. Copy the Domain Name

The following screenshot shows the distribution ID details.

Screenshot showing the domain name highlighted in the distribution details

  1. Return to the Lightsail DNS console.
  2. Choose Add record.
  3. Choose CNAME record.
  4. For Subdomain, enter the subdomain you are using for the S3 bucket.

This post is using the domain media.mikegcoleman.com, so the value is media. 

  1. For Maps to, enter the domain name of the CloudFront distribution that you previously copied.

The following screenshot shows the CNAME record details.

Screenshot of the completed DNS CNAME record

  1. Choose the green check box.
  2. Return to the CloudFront console.

It can take 15–30 minutes for the distribution status to change to Deployed. See the following screenshot.

Screenshot of the updated CloudFront distribution status

Configuring the plugin

The final step is to configure the WP Offload Media Lite plugin to use the newly created CloudFront distribution. Complete the following steps:

  1. Log in to the admin dashboard for your WordPress site.
  2. Under Plugins, choose WP Offload Media Lite.
  3. Choose Settings.
  4. For Custom Domain (CNAME), select On.
  5. Enter the domain name for your S3 bucket.

This post uses media.mikegcoleman.com as an example.

  1. For Force HTTPS, select On.

This makes sure that all media is served over HTTPS.

  1. Choose Save Changes.

The following screenshot shows the URL REWRITING and ADVANCED OPTIONS details.

Screenshot of the cloud distribution network settings in the S3 plugin

You can load an image from your media library to confirm that everything is working correctly.

  1. Under Media, choose Library.
  2. Choose an image in your library.

If you don’t have an image in your library, add a new one now.

The URL listed should start with the domain you configured for your S3 bucket. For this post, the URL is media.mikegcoleman.com.

The following screenshot shows the image details.

Screenshot highlighting the URL of the S3 asset

To confirm that the image loads correctly, and there are no SSL errors or warnings, copy the URL value and paste it into your web browser.

Conclusion

You now have your media content served securely from CloudFront. The final post in this series demonstrates how to create multiple instances of your WordPress server and place those behind a Lightsail load balancer to increase performance and enhance availability.

If you haven’t got one already, head on over and create a free AWS account, and start building your own WordPress site – or whatever else might think of!

Now Available: New C5d Instance Sizes and Bare Metal Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-new-c5d-instance-sizes-and-bare-metal-instances/

Amazon EC2 C5 instances are very popular for running compute-heavy workloads like batch processing, distributed analytics, high-performance computing, machine/deep learning inference, ad serving, highly scalable multiplayer gaming, and video encoding.

In 2018, we added blazing fast local NVMe storage, and named these new instances C5d. They are a great fit for applications that need access to high-speed, low latency local storage like video encoding, image manipulation and other forms of media processing. They will also benefit applications that need temporary storage of data, such as batch and log processing and applications that need caches and scratch files.

Just a few weeks ago, we launched new instances sizes and a bare metal option for C5 instances. Today, we are happy to add the same capabilities to the C5d family: 12xlarge, 24xlarge, and a bare metal option.

The new C5d instance sizes run on Intel’s Second Generation Xeon Scalable processors (code-named Cascade Lake) with sustained all-core turbo frequency of 3.6GHz and maximum single core turbo frequency of 3.9 GHz.

The new processors also enable a new feature called Intel Deep Learning Boost, a capability based on the AVX-512 instruction set. Thanks to the new Vector Neural Network Instructions (AVX-512 VNNI), deep learning frameworks will speed up typical machine learning operations like convolution, and automatically improve inference performance over a wide range of workloads.

These instances are based on the AWS Nitro System, with dedicated hardware accelerators for EBS processing (including crypto operations), the software-defined network inside of each Virtual Private Cloud (VPC), and ENA networking.

New Instance Sizes for C5d: 12xlarge and 24xlarge
Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.12xlarge4896 GiB2 x 900 GB NVMe SSD7 Gbps12 Gbps
c5d.24xlarge96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Previously, the largest C5d instance available was c5d.18xlarge, with 72 logical processors, 144 GiB of memory, and 1.8 TB of storage. As you can see, the new 24xlarge size increases available resources by 33%, in order to help you crunch those super heavy workloads. Last but not least, customers also get 50% more NVMe storage per logical processor on both 12xlarge and 24xlarge, with up to 3.6 TB of local storage!

Bare Metal C5d
As is the case with the existing bare metal instances (M5, M5d, R5, R5d, z1d, and so forth), your operating system runs on the underlying hardware and has direct access to processor and other hardware.

Bare metal instances can be used to run software with specific requirements, e.g. applications that are exclusively licensed for use on physical, non-virtualized hardware. These instances can also be used to run tools and applications that require access to low-level processor features such as performance counters.

Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.metal96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

Now Available!
You can start using these new instances today in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Canada (Central), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Europe (London), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), South America (São Paulo), and AWS GovCloud (US-West).

Please send us feedback, either on the AWS forum for Amazon EC2, or through your usual AWS support contacts.

Julien;

Processing batch jobs quickly, cost-efficiently, and reliably with Amazon EC2 On-Demand and Spot Instances

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.

 

Prerequisites

This approach will be demonstrated via a simple batch-processing environment with the following components:

  • A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
  • An Amazon SQS queue to manage the tasks.
  • A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
  • Amazon EC2 Auto Scaling groups to model scenarios.
  • Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.

 

Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance.  The batch total is $38.40.

 

Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.

 

Instance TypeProvisioningHost CountHourly Instance CostTotal 30-minute batch cost
c5.18xlargeSpot15 $     1.2367 $     9.2753
c5.2xlargeSpot22 $     0.1547 $     1.7017
c5.4xlargeSpot12 $     0.2772 $     1.6632
c5.9xlargeSpot11 $     0.6239 $     3.4315
c5.2xlargeOn-Demand13 $     0.3840 $     2.4960
c5.4xlargeOn-Demand3 $     0.7680 $     1.1520
c5.9xlargeOn-Demand4 $     1.7280 $     3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.

 

Summary

This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Migrating Azure VM to AWS using AWS SMS Connector for Azure

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/migrating-azure-vm-to-aws-using-aws-sms-connector-for-azure/

AWS SMS is an agentless service that facilitates and expedites the migration of your existing workloads to AWS. The service enables you to automate, schedule, and monitor incremental replications of active server volumes, which facilitates large-scale server migration coordination. Recently, you could only migrate virtual machines (VMs) running in VMware vSphere and Microsoft Hyper-V environments. Currently, you can use the simplicity and ease of AWS Server Migration Service (SMS) to migrate virtual machines running on Microsoft Azure. You can discover Azure VMs, group them into applications, and migrate a group of applications as a single unit without having to go through the hassle of coordinating the replication of the individual servers or decoupling application dependencies. SMS significantly reduces application migration time, as well as decreases the risk of errors in the migration process.

 

This post takes you step-by-step through how to provision the SMS virtual machine on Microsoft Azure, discover the virtual machines in a Microsoft Azure subscription, create a replication job, and finally launch the instance on AWS.

 

1- Provisioning the SMS virtual machine

To provision your SMS virtual machine on Microsoft Azure, complete the following steps.

  1. Download three PowerShell scripts listed under Step 1 of Installing the Server Migration Connection on Azure.
FileURL
Installation scripthttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1
MD5 hashhttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1.md5
SHA256 hashhttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1.sha256

 

  1. To validate the integrity of the files you can compare the checksums of the files. You can use PowerShell 5.1 or newer.

 

2.1 To validate the MD5 hash of the aws-sms-azure-setup.ps1 script, run the following command and wait for an output similar to the following result:

Command to validate the MD5 has of the aws-sems-azure-setup.ps1 script

2.2 To validate the SHA256 hash of the aws-sms-azure-setup.ps1 file, run the following command and wait for an output similar to the following result:

Command to validate the SHA256 hash of the aws-sms-azure-setup.ps1 file

2.3 Compare the returned values ​​by opening the aws-sms-azure-setup.ps1.md5 and aws-sms-azure-setup.ps1.sha256 files in your preferred text editor.

2.4 To validate if the PowerShell script has a valid Amazon Web Services signature, run the following command and wait for an output similar to the following result:

Command to validate validate if the PowerShell script has a valid Amazon Web Services signature

 

  1. Before running the script for provisioning the SMS virtual machine, you must have an Azure Virtual Network and an Azure Storage Account in which you will temporarily store metadata for the tasks that SSM performs against the Microsoft Azure Subscription. A good recommendation is to use the same Azure Virtual Network as the Azure Virtual Machines being migrated, since the SSM virtual machine performs REST API communications to communicate with AWS endpoints as well as the Azure Cloud Service. It is not necessary for the SMS virtual machine to have a Public IP or Internet Inbounds Rules.

 

4.  Run the installation script .\aws-sms-azure-setup.ps1

Screenshot of running the installation script

  1. Enter with the name of the existing Storage Account Name and Azure Virtual Network in the subscription:

Screenshot of where to enter Storage Account Name and Azure Virtual Network

  1. The Microsoft Azure modules imports into the local PowerShell, and you receive a prompt for credentials to access the subscription.

Azure login credentials

  1. A summary of the created features appears, similar to the following:

Screenshot of created features

  1. Wait for the process to complete. It may take a few minutes:

screenshot of processing jobs

  1. After the provisioning an output containing the Object Id of System Assigned Identity and Private IP. Save this information as it is going to be used to register the connector to the SMS service in the step 23.

Screenshot of the information to save

  1. To check the provisioned resources, log into the Microsoft Azure Portal and select the Resource Group option. The provided AWS script performed a role created in the Microsoft Azure IAM that allows the virtual machine to make use of the necessary services through REST APIs over HTTPS calls and to be authenticated via Azure Inbuilt Instance Metadata Service (IMDS).

Screenshot of provisioned resources log in Microsoft Azure Portal

  1. As a requirement, you need to create an IAM User that contains the necessary permissions for the SMS service to perform the migration. To do this, log into your AWS account at https://aws.amazon.com/console, under services select IAM. Then select User, and click Add user.

Screenshot of AWS console. add user

 

  1. In the Add user page, insert a username and check the option Programmatic access. Click: Next Permissions

Screenshot of adding a username

  1. Attach an existing policy with the name ServerMigrationConnector. This policy allows the AWS Connector to connects and executes API-requests against AWS. Click Next:Tags.

Adding policy ServerMigrationConnector

  1. Optionally add tags to the user. Click Next: Review.

Screenshot of option to add tags to the user

15. Click Create User and save the Access Key and Secret Access Key. This information is going to be used during the AWS SMS Connector setup.

Create User and save the access key and secret access key

 

  1. From a computer that has access to the Azure Virtual Network, access the SMS Virtual Machine configuration using a browser and the previously recorded private IP from the output of the script. In this example, the URL is https://10.0.0.4.

Screenshot of accessing the SMS Virtual Machine configuration

  1. On the main page of the SMS virtual machine, click Get Started Now

Screenshot of the SMS virtual machine start page

  1. Read and accept the terms of the contract, then click Next.

Screenshot of accepting terms of contract

  1. Create a password that will be used to login later in the management connector console and click Next.

Screenshot of creating a password

  1. Review the Network Info and click Next.

Screenshot of reviewing the network info

  1. Choose if you would like to opt-in to having anonymous log data set to AWS then click Next.

Screenshot of option to add log data to AWS

  1. Insert an Access Key and Secret Access Key for an IAM User whose only policy is attached: “ServerMigrationConnector” Also, select the region in which the SMS endpoint will be used and click Next. The access key mentioned it was created through step 11 to 15.

Selet AWS Region, and Insert Access Key and Secret Key

  1. Enter the Object Id of System Assigned Identify copied in step 9 and click Next.

Enter Object Id of System Assigned Identify

  1. Congratulations, you have successfully configured the Azure connector, click Go to connector dashboard.

Screenshot of the successful configuration of the Azure connector

  1. Verify that the connector status is HEALTHY by clicking Connectors on the menu.

Screenshot of verifying that the connector status is healthy

 

2 – Replicating Azure Virtual Machines to Amazon Web Services

  1. Access the SMS console and go to the Servers option. Click Import Server Catalog or Re-Import Server Catalog if it has been previously executed.

Screenshot of SMS console and servers option

  1. Select the Azure Virtual Machines to be migrated and click Create Replication Job.

Screenshot of Azure virtual machines migration

  1. Select which type of licensing best suits your environment, such as:

– Auto (Current licensing autodetection)

– AWS (License Included)

– BYOL (Bring Your Own License).
See options: https://aws.amazon.com/windows/resources/licensing/

Screenshot of best type of licensing for your environment

  1. Select the appropriate replication frequency, when the replication should start, and the IAM service role. You can leave it blank and the SMS service is going to use the built-in service role “sms”

Screenshot of replication jobs and IAM service role

  1. A summary of the settings are displayed and click Create. 
    Screenshot of the summary of settings displayed
  2. In the SMS Console, go to the Replication Jobs option and follow the replication job status:

Overview of replication jobs

  1. After completion, access the EC2 console, go to AMIs, and a list of the AMIs generated by SMS will now be in this list. In the example below, several AMIs were generated because the replication frequency is 1 hour.

List of AMIs generated by SMS

  1. Now navigate to the SMS console, click Launch Instance and follow the screen processes for creating a new Amazon EC2 instance.

SMS console and Launch Instance screenshot

 

3 – Conclusion

This solution provides a simple, agentless, non-intrusive way to the migration process with the AWS Server Migration Service.

 

For more about Windows Workloads on AWS go to:  http://aws.amazon.com/windows

 

About the Author

Photo of the Author

 

 

Marcio Morales is a Senior Solution Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on running their Microsoft workloads on AWS.

Optimizing deep learning on P3 and P3dn with EFA

Post Syndicated from whiteemm original https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/

This post is written by Rashika Kheria, Software Engineer, Purna Sanyal, Senior Solutions Architect, Strategic Account and James Jeun, Sr. Product Manager

The Amazon EC2 P3dn.24xlarge instance is the latest addition to the Amazon EC2 P3 instance family, with upgrades to several components. This high-end size of the P3 family allows users to scale out to multiple nodes for distributed workloads more efficiently.  With these improvements to the instance, you can complete training jobs in a shorter amount of time and iterate on your Machine Learning (ML) models faster.

 

This blog reviews the significant upgrades with p3dn.24xlarge, walks you through deployment, and shows an example ML use case for these upgrades.

 

Overview of P3dn instance upgrades

The most notable upgrade to the p3dn.24xlarge instance is the 100-Gbps network bandwidth and the new EFA network interface that allows for highly scalable internode communication. This means you can scale runs on applications to use thousands of GPUs, which reduces time to get results. EFA’s operating system bypasses networking mechanisms and the underlying Scalable Reliable Protocol that is built in to the Nitro Controllers. The Nitro controllers enable a low-latency, low-jitter channel for inter-instance communication. EFA has been adopted in the mainline Linux and integrated with LibFabric and various distributions. AWS worked with NVIDIA for EFA to support NVIDIA Collective Communication Library (NCCL). NCCL optimizes multi-GPU and multi-node communication primitives and helps achieve high throughput over NVLink interconnects.

 

The following diagram shows the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

 

The following table summarizes the full set of differences between p3.16xlarge and p3dn.24xlarge.

Featurep3.16xlp3dn.24xl
ProcessorIntel Xeon E5-2686 v4Intel Skylake 8175 (w/ AVX 512)
vCPUs6496
GPU8x 16 GB NVIDIA Tesla V1008x 32 GB NVIDIA Tesla V100
RAM488 GB768 GB
Network25 Gbps ENA100 Gbps ENA + EFA
GPU InterconnectNVLink – 300 GB/sNVLink – 300 GB/s

 

P3dn.24xl offers more networking bandwidth than p3.16xl. Paired with EFA’s communication library, this feature increases scaling efficiencies drastically for large-scale, distributed training jobs. Other improvements include double the GPU memory for large datasets and batch sizes, increased system memory, and more vCPUs. This upgraded instance is the most performant GPU compute option on AWS.

 

The upgrades also improve your workload around distributed deep learning. The GPU memory improvement enables higher intranode batch sizes. The newer Layer-wise Adaptive Rate Scaling (LARS) has been tested with ResNet50 and other deep neural networks (DNNs) to allow for larger batch sizes. The increased batch sizes reduce wall-clock time per epoch with minimal loss of accuracy. Additionally, using 100-Gbps networking with EFA heightens performance with scale. Greater networking performance is beneficial when updating weights for a large number of parameters. You can see high scaling efficiency when running distributed training on GPUs for ResNet50 type models that primarily use images for object recognition. For more information, see Scalable multi-node deep learning training using GPUs in the AWS Cloud.

 

Natural language processing (NLP) also presents large compute requirements for model training. This large compute requirement is especially present with the arrival of large Transformer-based models like BERT and GPT-2, which have up to a billion parameters. The following describes how to set up distributed model trainings with scalability for both image and language-based models, and also notes how the AWS P3 and P3dn instances perform.

 

Optimizing your P3 family

First, optimize your P3 instances with an important environmental update. This update runs traditional TCP-based networking and is in the latest release of NCCL 2.4.8 as of this writing.

 

Two new environmental variables are available, which allow you to take advantage of multiple TCP sockets per thread: NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD.

 

These environmental variables allow the NCCL backend to exceed the 10-Gbps TCP single stream bandwidth limitation in EC2.

 

Enter the following command:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_NSOCKS_PERTHREAD=4 -x NCCL_SOCKET_NTHREADS=4 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

 

The following graph shows the synthetic NCCL tests and their increased performance with the additional directives.

synthetic NCCL tests and their increased performance with the additional directives

You can achieve a two-fold increase in throughput after a threshold in the synthetic payload size (around 1 MB).

 

 

Deploying P3dn

 

The following steps walk you through spinning up a cluster of p3dn.24xlarge instances in a cluster placement group. This allows you to take advantage of all the new performance features within the P3 instance family. For more information, see Cluster Placement Groups in the Amazon EC2 User Guide.

This post deploys the following stack:

 

  1. On the Amazon EC2 console, create a security group.

 Make sure that both inbound and outbound traffic are open on all ports and protocols within the security group.

 

  1. Modify the user variables in the packer build script so that the variables are compatible with your environment.

The following is the modification code for your variables:

 

{

  "variables": {

    "Region": "us-west-2",

    "flag": "compute",

    "subnet_id": "<subnet-id>",

    "sg_id": "<security_group>",

    "build_ami": "ami-0e434a58221275ed4",

    "iam_role": "<iam_role>",

    "ssh_key_name": "<keyname>",

    "key_path": "/path/to/key.pem"

},

3. Build and Launch the AMI by running the following packer script:

Packer build nvidia-efa-fsx-al2.yml

This entire workflow takes care of setting up EFA, compiling NCCL, and installing the toolchain. After building it, you have an AMI ID that you can launch in the EC2 console. Make sure to enable the EFA.

  1. Launch a second instance in a cluster placement group so you can run two node tests.
  2. Enter the following code to make sure that all components are built correctly:

/opt/nccl-tests/build/all_reduce_perf 

  1. The following output of the commend will confirm that the build is using EFA :

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

INFO: Function: main Line: 49: NET/OFI Process rank 8 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.

INFO: Function: main Line: 53: NET/OFI Received 1 network devices

INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

 

Synthetic two-node performance

This blog includes the NCCL-tests GitHub as part of the deployment stack. This shows synthetic benchmarking of the communication layer over NCCL and the EFA network.

When launching the two-node cluster, complete the following steps:

  1. Place the instances in the cluster placement group.
  2. SSH into one of the nodes.
  3. Fill out the hosts file.
  4. Run the two-node test with the following code:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x FI_PROVIDER="efa" -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

This test makes sure that the node performance works the way it is supposed to.

The following graph compares the NCCL bandwidth performance using -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp“. There is a three-fold increase in bus bandwidth when using EFA.

 

 -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp". There is a three-fold increase in bus bandwidth when using EFA. 

Now that you have run the two node tests, you can move on to a deep learning use case.

FAIRSEQ ML training on a P3dn cluster

Fairseq(-py) is a sequence modeling toolkit that allows you to train custom models for translation, summarization, language modeling, and other text-generation tasks. FAIRSEQ MACHINE TRANSLATION distributed training requires a fast network to support the Allreduce algorithm. Fairseq provides reference implementations of various sequence-to-sequence models, including convolutional neural networks (CNN), long short-term memory (LSTM) networks, and transformer (self-attention) networks.

 

After you receive consistent 10 GB/s bus-bandwidth on the new P3dn instance, you are ready for FAIRSEQ distributed training.

To install fairseq from source and develop locally, complete the following steps:

  1. Copy FAIRSEQ source code to one of the P3dn instance.
  2. Copy FAIRSEQ Training data in the data folder.
  3. Copy FAIRSEQ Test Data in the data folder.

 

git clone https://github.com/pytorch/fairseq

cd fairseq

pip install -- editable . 

Now that you have FAIRSEQ installed, you can run the training model. Complete the following steps:

  1. Run FAIRSEQ Training in 1 node/8 GPU p3dn instance to check the performance and the accuracy of FAIRSEQ operations.
  2. Create a custom AMI.
  3. Build the other 31 instances from the custom AMI.

 

Use the following scripts for distributed All Reduce FAIRSEQ Training :

 

export RANK=$1 # the rank of this process, from 0 to 127 in case of 128 GPUs
export LOCAL_RANK=$2 # the local rank of this process, from 0 to 7 in case of 8 GPUs per mac
export NCCL_DEBUG=INFO
export NCCL_TREE_THRESHOLD=0;
export FI_PROVIDER="efa";

export FI_EFA_TX_MIN_CREDIS=64;
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64/:/home/ec2-user/aws-ofi-nccl/install/lib/:/home/ec2-user/nccl/build/lib:$LD_LIBRARY_PATH;
echo $FI_PROVIDER
echo $LD_LIBRARY_PATH
python train.py data-bin/wmt18_en_de_bpej32k \
   --clip-norm 0.0 -a transformer_vaswani_wmt_en_de_big \
   --lr 0.0005 --source-lang en --target-lang de \
   --label-smoothing 0.1 --upsample-primary 16 \
   --attention-dropout 0.1 --dropout 0.3 --max-tokens 3584 \
   --log-interval 100  --weight-decay 0.0 \
   --criterion label_smoothed_cross_entropy --fp16 \
   --max-update 500000 --seed 3 --save-interval-updates 16000 \
   --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' \
   --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
   --warmup-updates 4000 --min-lr 1e-09 \
   --distributed-port 12597 --distributed-world-size 32 \
   --distributed-init-method 'tcp://172.31.43.34:9218' --distributed-rank $RANK \
   --device-id $LOCAL_RANK \
   --max-epoch 3 \
   --no-progress-bar  --no-save

Now that you have completed and validated your base infrastructure layer, you can add additional components to the stack for various workflows. The following charts show time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training.

time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training

 

Conclusion

EFA on p3dn.24xlarge allows you to take advantage of additional performance at scale with no change in code. With this updated infrastructure, you can decrease cost and time to results by using more GPUs to scale out and get more done on complex workloads like natural language processing. This blog provides much of the undifferentiated heavy lifting with the DLAMI integrated with EFA. Go power up your ML workloads with EFA!

 

Optimizing for cost, availability and throughput by selecting your AWS Batch allocation strategy

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/optimizing-for-cost-availability-and-throughput-by-selecting-your-aws-batch-allocation-strategy/

This post is contributed by Steve Kendrex, Senior Technical Product Manager, AWS Batch

 

Introduction

 

AWS offers a broad range of instances that are advantageous for batch workloads. The scale and provisioning speed of AWS’ compute instances allow you to get up and running at peak capacity in minutes without paying for downtime. Today, I’m pleased to introduce allocation strategies: a significant new capability in AWS Batch that  makes provisioning compute resources flexible and simple. In this blog post, I explain how the AWS Batch allocation strategies work, when you should use them for your workload, and provide an example CloudFormation script. This blog helps you get started on building your personalized Compute Environment (CE) most appropriate to your workloads.

Overview

AWS Batch is a fully managed, cloud-native batch scheduler. It manages the queuing and scheduling of your batch jobs, and the resources required to run your jobs. One of AWS Batch’s great strengths is the ability to manage instance provisioning as your workload requirements and budget needs change. AWS Batch takes advantage of AWS’s broad base of compute types. For example, you can launch compute based instances and memory instances that can handle different workload types, without having to worry about building a cluster to meet peak demand.

Previously, AWS Batch had a cost-controlling approach to manage compute instances for your workloads. The service chose an instance that was the best fit for your jobs based on vCPU, memory, and GPU requirements, at the lowest cost. Now, the newly added allocation strategies provide flexibility. They allow AWS Batch to consider capacity and throughput in addition to cost when provisioning your instances. This allows you to leverage different priorities when launching instances depending on your workloads’ needs, such as: controlling cost, maximizing throughput, or minimizing Amazon EC2 Spot instances interruption rates.

There are now three instance allocation strategies from which to choose when creating an AWS Batch Compute Environment (CE). They are:

1.        Spot Capacity Optimized

2.        Best Fit Progressive

3.        Best Fit

 

Spot Capacity Optimized

As the name implies, the Spot capacity optimized strategy is only available when launching Spot CEs in AWS Batch. In fact, I recommend the Spot capacity optimized strategy for most of your interruptible workloads running on Spot today. This strategy takes advantage of the recently released EC2 Auto Scaling and EC2 Fleet capacity optimized strategy. Next, I examine how this strategy behaves in AWS Batch.

Let’s say you’re running a simulation workload in AWS Batch. Your workload is Spot-appropriate (see this whitepaper to determine), so you want to take advantage of the savings you can glean from using Spot. However, you also want to minimize your Spot interruption rate, so you’ve followed the Spot best practices. Your instances can run across multiple instance types and multiple Availability Zones. When creating your Spot CE in AWS Batch, Input all the instance types with which your workload is compatible in the instance field. OR select ‘optimal’, which allows Batch to choose from among M, C, or R instance families. The image below shows how this appears in the console:

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

When evaluating your workload, AWS Batch selects from the allowed instance types. These allowed instance types are specified in the ‘compute resource parameter’, and are capable of running your jobs listed in your Spot CE.  From the capable instances, AWS calculates the correct assortment of instance types that have the most Spot capacity. AWS Batch then launches those instances on your behalf, and runs your jobs when those instances are available. This strategy gives you access to all AWS compute resources at a fraction of On-Demand cost. The Spot capacity optimized strategy works whether you’re trying to launch hundreds of thousands (or a million!) of vCPU’s in Spot, or simply trying to lower your chance of interruption. Additionally, AWS Batch manages your instance pool to meet the capacity needed to run your workload as time passes.

For example, as your workloads run, demand in an Availability Zone may shift. This might lead to several of your instances being reclaimed. In that event, AWS Batch automatically attempts to scale a different instance type based on the deepest capacity pools. Assuming you set a retry attempt count, your jobs then automatically retry. Then, AWS Batch scales new instances until either it meets the desired capacity, or it runs out of instance types to launch based on those provided.  That is why I recommend that you give AWS Batch as many instance types and families as possible to choose from when running Spot capacity optimized. Additional detail on behavior can be found in the capacity optimized documentation.

To launch a Spot capacity optimized CE, follow these steps:

1.       Navigate to the console

2.       Create a new Compute Environment.

3.       Select “Spot Capacity Optimized” in the Allocation Strategy field

4.       Alternatively, you can use the CreateComputeEnvironment API; in the Allocation Strategy field, pass in “Spot_Capacity_Optimized”. This command should look like the following:

…
"TestAllocationStrategyCE": { 
"Type": "AWS::Batch::ComputeEnvironment",
 "Properties": { 
"State": "ENABLED", 
"Type": "MANAGED", 
"ComputeResources": { 
"Subnets": [
 {"Ref": "TestSubnet"}
 ], 
"InstanceRole": {
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
" InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "SPOT_CAPACITY_OPTIMIZED", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 }
 },
…

Once you follow these steps your Spot capacity optimized CE should be up and running.

 

Best Fit Progressive

Imagine you have a time-sensitive machine learning workload that is very compute intensive. You want to run this workload on C5 instances because you know that those have a preferable vCPU/memory ratio for your jobs. In a pinch, however, you know that M5 instances can run your workload perfectly well. You’re happy to take advantage of Spot prices. However, you also have a base level of throughput you need so you have to run part of the workload on On-Demand instances.  In this case, I recommend the best fit progressive strategy. This strategy is available in both On-Demand and Spot CEs, and I recommend it for most On-Demand workloads. The best fit progressive strategy allows you to let AWS Batch choose the best fit instance for your workload (based on your jobs’ vCPU and memory requirements). In this context, “best fit” means AWS Batch provisions the least number of instances capable of running your jobs at the lowest cost.

Sometimes, AWS Batch cannot resource enough of the best fit instances to meet your capacity. When this is the case, AWS Batch progressively looks for the next best fit instance type from what you specified in the ‘compute resources’ parameter. Generally, AWS Batch attempts to spin up different instance sizes within the same family first. This is because AWS Batch has already determined that vCPU and memory ratio to fit your workload. If it still cannot find enough instances that can run your jobs to meet your capacity, AWS Batch launches instances from a different family. These attempts run until capacity is met, or until it runs out of available instances from which to select.

To create a best fit progressive CE, follow the steps detailed in the Spot capacity optimized strategy section. However, specify the strategy BEST_FIT_PROGRESSIVE when creating a CE, for example:


…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT_PROGRESSIVE", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important note: you can always restrict AWS Batch’s ability to launch instances by using the max vCPU setting in your CE. AWS Batch may go above Max vCPU to meet your capacity requirements for best fit progressive and Spot capacity optimized strategies. In this event, AWS Batch will never go above Max vCPU by more than a single instance (for example, no more than a single instance from among those specified in your CE compute resources parameter).

 

How to Combine Strategies

You can combine strategies using separate AWS Batch Compute Environments. Let’s take the case I mentioned earlier: you’re happy to take advantage of Spot prices, but you want a base level of throughput for your time-sensitive workloads.

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

 

In this case, you can create two AWS Batch CEs:

1.       Create an On-Demand CE that uses the best fit progressive strategy.

2.       Set the max vCPU at the level of throughput that is necessary for your workload.

3.       Create a Spot CE using the Spot capacity optimized strategy.

4.        Attach both CEs to your job queue, with the On-Demand CE higher in order. Once you start submitting jobs to your queue, AWS Batch spins up your On-Demand CE first and starts placing jobs.

If AWS Batch meets its max vCPU criteria, it will spin up instances in the next CE. In this case, the next CE is your Spot CE, and AWS Batch will place any additional jobs on this CE.  AWS Batch continues to place jobs on both CEs until the queue is empty.

Please see this repository for sample CloudFormation code to replicate this environment. Or, click here for more examples of leveraging Spot with AWS Batch.

 

Best Fit

Imagine you have a well-defined genomics sequencing workload. You know that this workload performs best on M5 instances, and you run this workload On-Demand because it is not interruptible. You’ve run this workload on AWS Batch before and you’re happy with its current behavior. You’re willing to trade off occasional capacity constraints in return for the knowledge you’re controlling cost strictly.  In this case, the best fit strategy may be a good option. This strategy used to be AWS Batch’s only behavior. It examines the queue and picks the best fit instance type and size for the workload. As described earlier, best fit to AWS Batch means the least number of instances capable of running the workload, at the lowest cost. In general, we recommend the best fit strategy only when you want the lowest cost for your instance, and you’re willing to trade cost for throughput and availability.

Note: AWS Batch will not launch instances above Max vCPU while using the best fit strategy. To launch a best fit CE, you can launch it similar to the following:

…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important Note for AWS Batch Allocation Strategies with Spot Instances:

You always have the option to set a percentage of On-Demand price when creating a Spot CE. When setting a percentage of an On-Demand price, AWS Batch will only launch instances that have Spot prices lower than the lowest per-unit-hour instance. In general, setting a percentage of an On-Demand price lowers your availability, and should only be used if you want cost controls.If you want to enjoy the cost savings with Spot with better availability, I recommend that you do not set a percentage of On-Demand price.

Conclusion

With these new allocation strategies, you now have much greater flexibility to control how AWS Batch provisions your instances. This allows you to make better throughput and cost trade-offs depending on the sensitivity of your workload. To learn more about how these strategies behave, please visit the AWS Batch documentation. Feel free to experiment with AWS Batch on your own to get an idea of how they help you run your specific workload.

 

Thanks to Chad Scmutzer for his support on the CloudFormation template

Deploying a highly available WordPress site on Amazon Lightsail, Part 1: Implementing a highly available Lightsail database with WordPress

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/deploying-a-highly-available-wordpress-site-on-amazon-lightsail-part-1-implementing-a-highly-available-lightsail-database-with-wordpress/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail

This post walks you through what to consider when architecting a scalable, redundant WordPress site. It discusses how WordPress stores such elements as user accounts, posts, settings, media, and themes, and how to configure WordPress to work with a standalone database.

This walkthrough deploys a WordPress site on Amazon Lightsail. Lightsail is the easiest way to get started on AWS, and it might be the easiest (and least expensive) way to get started on WordPress. You can launch a new WordPress site in a few clicks with one of Lightsail’s blueprints, for a few dollars a month. This gives you a single Lightsail instance to host your WordPress site that’s perfect for a small personal blog.

However, you may need a more resilient site capable of scaling to meet increased demand and architected to provide a degree of redundancy. If you’re a novice cloud user, the idea of setting up a highly available WordPress implementation might seem daunting. But with Lightsail and other AWS services, it doesn’t need to be.

Subsequent posts in this series cover managing media files in a multi-server environment, using CloudFront to increase site security and performance, and scaling the WordPress front-end with a Lightsail load balancer.

What’s under the hood?

Even though you’re a WordPress user, you may not have thought about how WordPress is built. However, if you’re moving into managing your WordPress site, it’s essential to understand what’s under the hood. As a content management system (CMS), WordPress provides a lot of functionality; this post focuses on some of the more essential features in the context of how they relate to architecting your highly available WordPress site.

WordPress manages a variety of different data. There are user accounts, posts, media (such as images and videos), themes (code that customizes the look and feel of a given WordPress site), plugins (code that adds additional functionality to your site), and configuration settings.

Where WordPress stores your data varies depending on the type of data. At the most basic level, WordPress is a PHP application running on a web server and database. The web server is the instance that you create in Lightsail, and includes WordPress software and the MySQL database. The following diagram shows the Lightsail VPC architecture.

Lightsail's VPC Architecture

The database stores a big chunk of the data that WordPress needs; for example, all of the user account information and blog posts. However, the web server’s file system stores another portion of data; for example, a new image you upload to your WordPress server. Finally, with themes and plugins, both the database and the file system store information. For example, the database holds information on what plugins and which theme is currently active, but the file system stores the actual code for the themes and plugins.

To provide a highly available WordPress implementation, you need to provide redundancy not only for the database, but also the content that may live on the file system.

Prerequisites

This solution has the following prerequisites:

This post and the subsequent posts deal with setting up a new WordPress site. If you have an existing site, the processes to follow are similar, but you should consult the documentation for both Lightsail and WordPress. Also, be sure to back up your existing database as well as snapshot your existing WordPress instance.

Configuring the database

The standalone MySQL database you created is not yet configured to work with WordPress. You need to create the actual database and define the tables that WordPress needs. The easiest method is to export the table from the database on your WordPress instance and import it into your standalone MySQL database. To do so, complete the following steps:

  1. Connect to your WordPress instance by using your SSH client or the web-based SSH client in the Lightsail console. The screenshot below highlights the icon to click.

WordPress icon for connecting through SSH

  1. From the terminal prompt for your WordPress instance, set two environment variables (LSDB_USERNAME and LSDB_ENDPOINT) that contain the connection information for your standalone database.

You can find that information on the database’s management page from the Lightsail console. See the following screenshot of the Connection details page.

the Connection details page

  1. To set the environment variables, substitute the values for your instance into the following code example and input each line one at a time at the terminal prompt:

LSDB_USERNAME=UserName

LSDB_ENDPOINT=Endpoint

For example, your input should look similar to the following code:

LSDB_USERNAME=dbmasteruser

LSDB_ENDPOINT=ls.rds.amazonaws.com

  1. Retrieve the Bitnami application password for the database running on your WordPress instance.

This password is stored at /home/bitnami/bitnami_application_password.

Enter the following cat command in the terminal to display the value:

cat /home/bitnami/bitnami_application_password

  1. Copy and paste the following code into a text document and copy the password

cat /home/bitnami/bitnami_application_password

You need this password in the following steps.

  1. Enter the following command into the terminal window:

mysqldump \

 -u root \

--databases bitnami_wordpress \

--single-transaction \

--order-by-primary \

-p > dump.sql

This command creates a file (dump.sql) that defines the database and all the needed tables.

  1. When prompted for a password, enter the Bitnami application password you recorded previously.

The terminal window doesn’t show the password as you enter it.

Now that you have the right database structure exported, import that into your standalone database. You’ll do this by entering the contents of your dump file into the mysql command line.

  1. Enter the following command at the terminal prompt:

cat dump.sql | mysql \

--user $LSDB_USERNAME \

--host $LSDB_ENDPOINT \

-p

  1. When prompted for a password, enter the password for your Lightsail database.

The terminal window doesn’t show the password as you enter it.

  1. Enter the following mysql command in the instance terminal:

echo 'use bitnami_wordpress; show tables' | \

mysql \

--user $LSDB_USERNAME \

--host $LSDB_ENDPOINT \

-p

This command shows the structure of the WordPress database, and verifies that you created the database on your Lightsail instance successfully.

  1. When prompted for a password, enter the password for your standalone database.

You should receive the following output:

Tables_in_bitnami_wordpress

wp_commentmeta

wp_comments

wp_links

wp_options

wp_postmeta

wp_posts

wp_term_relationships

wp_term_taxonomy

wp_termmeta

wp_terms

wp_usermeta

wp_users

This test confirms that your standalone database is ready for you to use with your WordPress instance.

Configuring WordPress

Now that you have the standalone database configured, modify the WordPress configuration file (wp-config.php) to direct the WordPress instance to use the standalone database instead of the database on the instance.

The first step is to back up your existing configuration file. If you run into trouble, copy wp-config.php.bak over to wp-config.php to roll back any changes.

  1. Enter the following code:

cp /home/bitnami/apps/wordpress/htdocs/wp-config.php /home/bitnami/apps/wordpress/htdocs/wp-config.php.bak

You are using wp-cli to modify the wp-config file.

  1. Swap out the following values with those for your Lightsail database:

wp config set DB_USER UserName

wp config set DB_PASSWORD Password

wp config set DB_HOST Endpoint

The following screenshot shows the example database values.

example database values

For example:

wp config set DB_USER dbmasteruser

wp config set DB_PASSWORD ‘MySecurePassword!2019’

wp config set DB_HOST ls.rds.amazonaws.com

To avoid issues with any special characters the password may contain, make sure to wrap the password value in single quotes (‘).

  1. Enter the following command:

wp config list

The output should match the values on the database’s management page in the Lightsail console. This confirms that the changes went through.

  1. Restart WordPress by entering the following command:

sudo /opt/bitnami/ctlscript.sh restart

Conclusion

This post covered a lot of ground. Hopefully it educated and inspired you to sign up for a free AWS account and start building out a robust WordPress site. Subsequent posts in this series show how to deal with your uploaded media and scale the web front end with a Lightsail load balancer.

Now Available: Bare Metal Arm-Based EC2 Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-bare-metal-arm-based-ec2-instances/

At AWS re:Invent 2018, we announced a new line of Amazon Elastic Compute Cloud (EC2) instances: the A1 family, powered by Arm-based AWS Graviton processors. This family is a great fit for scale-out workloads e.g. web front-ends, containerized microservices or caching fleets. By expanding the choice of compute options, A1 instances help customers use the right instances for the right applications, and deliver up to 45% cost savings. In addition, A1 instances enable Arm developers to build and test natively on Arm-based infrastructure in the cloud: no more cross compilation or emulation required.

Today, we are happy to expand the A1 family with a bare metal option.

Bare Metal for A1

Instance NameLogical ProcessorsMemoryEBS-Optimized BandwidthNetwork Bandwidth
a1.metal1632 GiB3.5 GbpsUp to 10 Gbps

Just like for existing bare metal instances (M5, M5d, R5, R5d, z1d, and so forth), your operating system runs directly on the underlying hardware with direct access to the processor.

As described in a previous blog post, you can leverage bare metal instances for applications that:

  • need access to physical resources and low-level hardware features, such as performance counters, that are not always available or fully supported in virtualized environments,
  • are intended to run directly on the hardware, or licensed and supported for use in non-virtualized environments.

Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

Working with A1 Instances
Bare metal or not, it’s never been easier to work with A1 instances. Initially launched in four AWS regions, they’re now available in four additional regions: Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney).

From a software perspective, you can run on A1 instances Amazon Machine Images for popular Linux distributions like Ubuntu, Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Debian, and of course Amazon Linux 2. Applications such as the Apache HTTP Server and NGINX Plus are available too. So are all major programming languages and run-times including PHP, Python, Perl, Golang, Ruby, NodeJS and multiple flavors of Java including Amazon Corretto, a supported open source OpenJDK implementation.

What about containers? Good news here as well! Amazon ECS and Amazon EKS both support A1 instances. Docker has announced support for Arm-based architectures in Docker Enterprise Edition, and most Docker official images support Arm. In addition, millions of developers can now use Arm emulation to build, run and test containers on their desktop machine before moving them to production.

As you would expect, A1 instances are seamlessly integrated with many AWS services, such as Amazon EBS, Amazon CloudWatch, Amazon Inspector, AWS Systems Manager and AWS Batch.

Now Available!
You can start using a1.metal instances today in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney). As always, we appreciate your feedback, so please don’t hesitate to get in touch via the AWS Compute Forum, or through your usual AWS support contacts.

Julien;

New M5n and R5n EC2 Instances, with up to 100 Gbps Networking

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/new-m5n-and-r5n-instances-with-up-to-100-gbps-networking/

AWS customers build ever-demanding applications on Amazon EC2. To support them the best we can, we listen to their requirements, go to work, and come up with new capabilities. For instance, in 2018, we upgraded the networking capabilities of Amazon EC2 C5 instances, with up to 100 Gbps networking, and significant improvements in packet processing performance. These are made possible by our new virtualization technology, aka the AWS Nitro System, and by the Elastic Fabric Adapter which enables low latency on 100 Gbps networking platforms.

In order to extend these benefits to the widest range of workloads, we’re happy to announce that these same networking capabilities are available today for both Amazon EC2 M5 and R5 instances.

Introducing Amazon EC2 M5n and M5dn instances
Since the very early days of Amazon EC2, the M family has been a popular choice for general-purpose workloads. The new M5(d)n instances uphold this tradition, and are a great fit for databases, High Performance Computing, analytics, and caching fleets that can take advantage of improved network throughput and packet rate performance.

The chart below lists out the new instances and their specs: each M5(d) instance size now has an M5(d)n counterpart, which supports the upgraded networking capabilities discussed above. For example, whereas the regular m5(d).8xlarge instance has a respectable network bandwidth of 10 Gbps, its m5(d)n.8xlarge sibling goes to 25 Gbps. The top of the line m5(d)n.24xlarge instance even hits 100 Gbps.

Here are the specs:

Instance NameLogical ProcessorsMemoryLocal Storage
(m5dn only)
EBS-Optimized BandwidthNetwork Bandwidth
m5n.large
m5dn.large
28 GiB1 x 75 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.xlarge
m5dn.xlarge
416 GiB1 x 150 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.2xlarge
m5dn.2xlarge
832 GiB1 x 300 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
m5n.4xlarge
m5dn.4xlarge
1664 GiB2 x 300 GB
NVMe SSD
3.5 GbpsUp to 25 Gbps
m5n.8xlarge
m5dn.8xlarge
32128 GiB2 x 600 GB
NVMe SSD
5 Gbps25 Gbps
m5n.12xlarge
m5dn.12xlarge
48192 GiB2 x 900 GB
NVMe SSD
7 Gbps50 Gbps
m5n.16xlarge
m5dn.16xlarge
64256 GiB4 x 600 GB
NVMe SSD
10 Gbps75 Gbps
m5n.24xlarge
m5dn.24xlarge
96384 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps
m5n.metal
m5dn.metal
96384 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps

Introducing Amazon EC2 R5n and R5dn instances
The R5 family is ideally suited for memory-hungry workloads, such as high performance databases, distributed web scale in-memory caches, mid-size in-memory databases, real time big data analytics, and other enterprise applications.

The logic here is exactly the same: each R5(d) instance size has an R5(d)n counterpart. Here are the specs:

Instance NameLogical ProcessorsMemoryLocal Storage
(r5dn only)
EBS-Optimized BandwidthNetwork Bandwidth
r5n.large
r5dn.large
216 GiB1 x 75 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.xlarge
r5dn.xlarge
432 GiB1 x 150 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.2xlarge
r5dn.2xlarge
864 GiB1 x 300 GB
NVMe SSD
Up to 3.5 GbpsUp to 25 Gbps
r5n.4xlarge
r5dn.4xlarge
16128 GiB2 x 300 GB
NVMe SSD
3.5 GbpsUp to 25 Gbps
r5n.8xlarge
r5dn.8xlarge
32256 GiB2 x 600 GB
NVMe SSD
5 Gbps25 Gbps
r5n.12xlarge
r5dn.12xlarge
48384 GiB2 x 900 GB
NVMe SSD
7 Gbps50 Gbps
r5n.16xlarge
r5dn.16xlarge
64512 GiB4 x 600 GB
NVMe SSD
10 Gbps75 Gbps
r5n.24xlarge
r5dn.24xlarge
96768 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps
r5n.metal
r5dn.metal
96768 GiB4 x 900 GB
NVMe SSD
14 Gbps100 Gbps

These new M5(d)n and R5(d)n instances are powered by custom second generation Intel Xeon Scalable Processors (based on the Cascade Lake architecture) with sustained all-core turbo frequency of 3.1 GHz and maximum single core turbo frequency of 3.5 GHz. Cascade Lake processors enable new Intel Vector Neural Network Instructions (AVX-512 VNNI) which will help speed up typical machine learning operations like convolution, and automatically improve inference performance over a wide range of deep learning workloads.

Now Available!
You can start using the M5(d)n and R5(d)n instances today in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Singapore).

We hope that these new instances will help you tame your network-hungry workloads! Please send us feedback, either on the AWS Forum for Amazon EC2, or through your usual support contacts.

Julien;

Building a pocket platform-as-a-service with Amazon Lightsail

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/building-a-pocket-platform-as-a-service-with-amazon-lightsail/

This post was written by Robert Zhu, a principal technical evangelist at AWS and a member of the GraphQL Working Group. 

When you start a new web-based project, you figure out wha kind of infrastructure you need. For my projects, I prioritize simplicity, flexibility, value, and on-demand capacity, and find myself quickly needing the following features:

  • DNS configuration
  • SSL support
  • Subdomain to a service
  • SSL reverse proxy to localhost (similar to ngrok and serveo)
  • Automatic deployment after a commit to the source repo (nice to have)

new projects have different requirements compared to mature projects

Amazon Lightsail is perfect for building a simple “pocket platform” that provides all these features. It’s cheap and easy for beginners and provides a friendly interface for managing virtual machines and DNS. This post shows step-by-step how to assemble a pocket platform on Amazon Lightsail.

Walkthrough

The following steps describe the process. If you prefer to learn by watching videos instead, view the steps by watching the following: Part 1, Part 2, Part 3.

Prerequisites

You should be familiar with: Linux, SSH, SSL, Docker, Nginx, HTTP, and DNS.

Steps

Use the following steps to assemble a pocket platform.

Creating a domain name and static IP

First, you need a domain name for your project. You can register your domain with any domain name registration service, such as Amazon Route53.

  1. After your domain is registered, open the Lightsail console, choose the Networking tab, and choose Create static IP.

Lightsail console networking tab

  1. On the Create static IP page, give the static IP a name you can remember and don’t worry about attaching it to an instance just yet. Choose Create DNS zone.

 

  1. On the Create a DNS zone page, enter your domain name and then choose Create DNS zone. For this post, I use the domain “raccoon.news.DNS zone in Lightsail with two A records
  2. Choose Add Record and create two A records—“@.raccoon.news” and “raccoon.news”—both resolving to the static IP address you created earlier. Then, copy the values for the Lightsail name servers at the bottom of the page. Go back to your domain name provider, and edit the name servers to point to the Lightsail name servers. Since I registered my domain with Route53, here’s what it looks like:

Changing name servers in Route53

Note: If you registered your domain with Route53, make sure you change the name server values under “domain registration,” not “hosted zones.” If you registered your domain with Route53, you need to delete the hosted zone that Route53 automatically creates for your domain.

Setting up your Lightsail instance

While you wait for your DNS changes to propagate, set up your Lightsail instance.

  1. In the Lightsail console, create a new instance and select Ubuntu 18.04.

Choose OS Only and Ubuntu 18.04 LTS

For this post, you can use the cheapest instance. However, when you run anything in production, make sure you choose an instance with enough capacity for your workload.

  1. After the instance launches, select it, then click on the Networking tab and open two additional TCP ports: 443 and 2222. Then, attach the static IP allocated earlier.
  2. To connect to the Lightsail instance using SSH, download the SSH key, and save it to a friendly path, for example: ~/ls_ssh_key.pem.

Click the download link to download the SSH key

  • Restrict permissions for your SSH key:

chmod 400 ~/ls_ssh_key.pem

  • Connect to the instance using SSH:

ssh -i ls_ssh_key.pem [email protected]_IP

  1. After you connect to the instance, install Docker to help manage deployment and configuration:

sudo apt-get update && sudo apt-get install docker.io
sudo systemctl start docker
sudo systemctl enable docker
docker run hello-world

  1. After Docker is installed, set up a gateway using called the nginx-proxy container. This container lets you route traffic to other containers by providing the “VIRTUAL_HOST” environment variable. Conveniently, nginx-proxy comes with an SSL companion, nginx-proxy-letsencrypt, which uses Let’s Encrypt.

# start the reverse proxy container
sudo docker run --detach \
    --name nginx-proxy \
    --publish 80:80 \
    --publish 443:443 \
    --volume /etc/nginx/certs \
    --volume /etc/nginx/vhost.d \
    --volume /usr/share/nginx/html \
    --volume /var/run/docker.sock:/tmp/docker.sock:ro \
    jwilder/nginx-proxy

# start the letsencrypt companion
sudo docker run --detach \
    --name nginx-proxy-letsencrypt \
    --volumes-from nginx-proxy \
    --volume /var/run/docker.sock:/var/run/docker.sock:ro \
    --env "DEFAULT_EMAIL=YOUREMAILHERE" \
    jrcs/letsencrypt-nginx-proxy-companion

# start a demo web server under a subdomain
sudo docker run --detach \
    --name nginx \
    --env "VIRTUAL_HOST=test.EXAMPLE.COM" \
    --env "LETSENCRYPT_HOST=test.EXAMPLE.COM" \
    nginx

Pay special attention to setting a valid email for the DEFAULT_EMAIL environment variable on the proxy companion; otherwise, you’ll need to specify the email whenever you start a new container. If everything went well, you should be able to navigate to https://test.EXAMPLE.COM and see the nginx default content with a valid SSL certificate that has been auto-generated by Let’s Encrypt.

A publicly accessible URL served from our Lightsail instance with SSL support

Troubleshooting:

Deploying a localhost proxy with SSL

Most developers prefer to code on a dev machine (laptop or desktop) because they can access the file system, use their favorite IDE, recompile, debug, and more. Unfortunately, developing on a dev machine can introduce bugs due to differences from the production environment. Also, certain services (for example, Alexa Skills or GitHub Webhooks) require SSL to work, which can be annoying to configure on your local machine.

For this post, you can use an SSL reverse proxy to make your local dev environment resemble production from the browser’s point of view. This technique also helps allow your test application to make API requests to production endpoints with Cross-Origin Resource Sharing restrictions. While it’s not a perfect solution, it takes you one step closer toward a frictionless dev/test feedback loop. You may have used services like ngrok and serveo for this purpose. By running a reverse proxy, you won’t need to spread your domain and SSL settings across multiple services.

To run a reverse proxy, create an SSH reverse tunnel. After the reverse tunnel SSH session is initiated, all network requests to the specified port on the host are proxied to your dev machine. However, since your Lightsail instance is already using port 22 for VPS management, you need a different SSH port (use 2222 from earlier). To keep everything organized, run the SSH server for port 2222 inside a special proxy container. The following diagram shows this solution.

Diagram of how an SSL reverse proxy works with SSH tunneling

Using Dockerize an SSH service as a starting point, I created a repository with a working Dockerfile and nginx config for reference. Here are the summary steps:

git clone https://github.com/robzhu/nginx-local-tunnelcd nginx-local-tunnel

docker build -t {DOCKERUSER}/dev-proxy . --build-arg ROOTPW={PASSWORD}

# start the proxy container
# Note, 2222 is the port we opened on the instance earlier.
docker run --detach -p 2222:22 \
    --name dev-proxy \
    --env "VIRTUAL_HOST=dev.EXAMPLE.com" \
    --env "LETSENCRYPT_HOST=dev.EXAMPLE.com" \
    {DOCKERUSER}/dev-proxy

# Ports explained:
# 3000 refers to the port that your app is running on localhost.
# 2222 is the forwarded port on the host that we use to directly SSH into the container.
# 80 is the default HTTP port, forwarded from the host
ssh -R :80:localhost:3000 -p 2222 [email protected]

# Start sample app on localhost
cd node-hello && npm i
nodemon main.js

# Point browser to https://dev.EXAMPLE.com

The reverse proxy subdomain works only as long as the reverse proxy SSH connection remains open. If there is no SSH connection, you should see an nginx gateway error:

Nginx will return 502 if you try to access the reverse proxy without running the SSH tunnel

While this solution is handy, be extremely careful, as it could expose your work-in-progress to the internet. Consider adding additional authorization logic and settings for allowing/denying specific IPs.

Setting up automatic deployment

Finally, build an automation workflow that watches for commits on a source repository, builds an updated container image, and re-deploys the container on your host. There are many ways to do this, but here’s the combination I’ve selected for simplicity:

  1. First, create a GitHub repository to host your application source code. For demo purposes, you can clone my express hello-world example. On the Docker hub page, create a new repository, click the GitHub icon, and select your repository from the dropdown list.Create GitHub repo to host your application source code
  2. Now Docker watches for commits to the repo and builds a new image with the “latest” tag in response. After the image is available, start the container as follows:

docker run --detach \
    --name app \
    --env "VIRTUAL_HOST=app.raccoon.news" \
    --env "LETSENCRYPT_HOST=app.raccoon.news" \
    robzhu/express-hello

  1. Finally, use Watchtower to poll dockerhub and update the “app” container whenever a new image is detected:

docker run -d \
    --name watchtower \
    -v /var/run/docker.sock:/var/run/docker.sock \
    containrrr/watchtower \
    --interval 10 \
    APPCONTAINERNAME

 

Summary

Your Pocket PaaS is now complete! As long as you deploy new containers and add the VIRTUAL_HOST and LETSENCRYPT_HOST environment variables, you get automatic subdomain routing and SSL termination. With SSH reverse tunneling, you can develop on your local dev machine using your favorite IDE and test/share your app at https://dev.EXAMPLE.COM.

Because this is a public URL with SSL, you can test Alexa Skills, GitHub Webhooks, CORS settings, PWAs, and anything else that requires SSL. Once you’re happy with your changes, a git commit triggers an automated rebuild of your Docker image, which is automatically redeployed by Watchtower.

I hope this information was useful. Thoughts? Leave a comment or direct-message me on Twitter: @rbzhu.

Now available in Amazon SageMaker: EC2 P3dn GPU Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-in-amazon-sagemaker-ec2-p3dn-gpu-instances/

In recent years, the meteoric rise of deep learning has made incredible applications possible, such as detecting skin cancer (SkinVision) and building autonomous vehicles (TuSimple). Thanks to neural networks, deep learning indeed has the uncanny ability to extract and model intricate patterns from vast amounts of unstructured data (e.g. images, video, and free-form text).

However, training these neural networks requires equally vasts amounts of computing power. Graphics Processing Units (GPUs) have long proven that they were up to that task, and AWS customers have quickly understood how they could use Amazon Elastic Compute Cloud (EC2) P2 and P3 instances to train their models, in particular on Amazon SageMaker, our fully-managed, modular, machine learning service.

Today, I’m very happy to announce that the largest P3 instance, named p3dn.24xlarge, is now available for model training on Amazon SageMaker. Launched last year, this instance is designed to accelerate large, complex, distributed training jobs: it has twice as much GPU memory as other P3 instances, 50% more vCPUs, blazing-fast local NVMe storage, and 100 Gbit networking.

How about we give it a try on Amazon SageMaker?

Introducing EC2 P3dn instances on Amazon SageMaker
Let’s start from this notebook, which uses the built-in image classification algorithm to train a model on the Caltech-256 dataset. All I have to do to use a p3dn.24xlarge instance on Amazon SageMaker is to set train_instance_type to 'ml.p3dn.24xlarge', and train!

ic = sagemaker.estimator.Estimator(training_image,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.p3dn.24xlarge',
                                         input_mode='File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)
...
ic.fit(...)

I ran some quick tests on this notebook, and I got a sweet 20% training speedup out of the box (your mileage may vary!). I’m using 'File' mode here, meaning that the full dataset is copied to the training instance: the faster network (100 Gbit, up from 25 Gbit) and storage (local NVMe instead of Amazon EBS) are certainly helping!

When working with large data sets, you could put 100 Gbit networking to good use either by streaming data from Amazon Simple Storage Service (S3) with Pipe Mode, or by storing it in Amazon Elastic File System or Amazon FSx for Lustre. It would also help with distributed training (using Horovod, maybe), as instances would be able to exchange parameter updates faster.

In short, the Amazon SageMaker and P3dn tag team packs quite a punch, and it should deliver a significant performance improvement for large-scale deep learning workloads.

Now available!
P3dn instances are available on Amazon SageMaker in the US East (N. Virginia) and US West (Oregon) regions. If you are ready to get started, please contact your AWS account team or use the Contact Us page to make a request.

As always, we’d love to hear your feedback, either on the AWS Forum for Amazon SageMaker, or through your usual AWS contacts.

Automating your lift-and-shift migration at no cost with CloudEndure Migration

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/automating-your-lift-and-shift-migration-at-no-cost-with-cloudendure-migration/

This post is courtesy of Gonen Stein, Head of Product Strategy, CloudEndure 

Acquired by AWS in January 2019, CloudEndure offers a highly automated migration tool to simplify and expedite rehost (lift-and-shift) migrations. AWS recently announced that CloudEndure Migration is now available to all customers and partners at no charge.

Each free CloudEndure Migration license provides 90 days of use following agent installation. During this period, you can perform all migration steps: replicate your source machines, conduct tests, and perform a scheduled cutover to complete the migration.

Overview

In this post, I show you how to obtain free CloudEndure Migration licenses and how to use CloudEndure Migration to rehost a machine from an on-premises environment to AWS. Although I’ve chosen to focus on an on-premises-to-AWS use case, CloudEndure also supports migration from cloud-based environments. For those of you who are interested in advanced automation, I include information on how to further automate large-scale migration projects.

Understanding CloudEndure Migration

CloudEndure Migration works through an agent that you install on your source machines. No reboot is required, and there is no performance impact on your source environment. The CloudEndure Agent connects to the CloudEndure User Console, which issues an API call to your target AWS Region. The API call creates a Staging Area in your AWS account that is designated to receive replicated data.

CloudEndure Migration automated rehosting consists of three main steps :

  1. Installing the Agent: The CloudEndure Agent replicates entire machines to a Staging Area in your target.
  2. Configuration and testing: You configure your target machine settings and launch non-disruptive tests.
  3. Performing cutover: CloudEndure automatically converts machines to run natively in AWS.

The Staging Area comprises both lightweight Amazon EC2 instances that act as replication servers and staging Amazon EBS volumes. Each source disk maps to an identically sized EBS volume in the Staging Area. The replication servers receive data from the CloudEndure Agent running on the source machines and write this data onto staging EBS volumes. One replication server can handle multiple source machines replicating concurrently.

After all source disks copy to the Staging Area, the CloudEndure Agent continues to track and replicate any changes made to the source disks. Continuous replication occurs at the block level, enabling CloudEndure to replicate any application that runs on supported x86-based Windows or Linux operating systems via an installed agent.

When the target machines launch for testing or cutover, CloudEndure automatically converts the target machines so that they boot and run natively on AWS. This conversion includes injecting the appropriate AWS drivers, making appropriate bootloader changes, modifying network adapters, and activating operating systems using the AWS KMS. This machine conversion process generally takes less than a minute, irrespective of machine size, and runs on all launched machines in parallel.

CloudEndure Migration Architecture

CloudEndure Migration Architecture

Installing CloudEndure Agent

To install the Agent:

  1. Start your migration project by registering for a free CloudEndure Migration license. The registration process is quick–use your email address to create a username and password for the CloudEndure User Console. Use this console to create and manage your migration projects.

    After you log in to the CloudEndure User Console, you need an AWS access key ID and secret access key to connect your CloudEndure Migration project to your AWS account. To obtain these credentials, sign in to the AWS Management Console and create an IAM user.Enter your AWS credentials in the CloudEndure User Console.

  2. Configure and save your migration replication settings, including your migration project’s Migration Source, Migration Target, and Staging Area. For example, I selected the Migration Source: Other Infrastructure, because I am migrating on-premises machines. Also, I selected the Migration Target AWS Region: AWS US East (N. Virginia). The CloudEndure User Console also enables you to configure a variety of other settings after you select your source and target, such as subnet, security group, VPN or Direct Connect usage, and encryption.

    After you save your replication settings, CloudEndure prompts you to install the CloudEndure Agent on your source machines. In my example, the source machines consist of an application server, a web server, and a database server, and all three are running Debian GNU / Linux 9.

  3. Download the CloudEndure Agent Installer for Linux by running the following command:
    wget -O ./installer_linux.py https://console.cloudendure.com/installer_linux.py
  4. Run the Installer:
    sudo python ./installer_linux.py -t <INSTALLATION TOKEN> --no-prompt

    You can install the CloudEndure Agent locally on each machine. For large-scale migrations, use the unattended installation parameters with any standard deployment tool to remotely install the CloudEndure Agent on your machines.

    After the Agent installation completes, CloudEndure adds your source machines to the CloudEndure User Console. From there, your source machines undergo several initial replication steps. To obtain a detailed breakdown of these steps, in the CloudEndure User Console, choose Machines, and select a specific machine to open the Machine Details View page.

    details page

    Data replication consists of two stages: Initial Sync and Continuous Data Replication. During Initial Sync, CloudEndure copies all of the source disks’ existing content into EBS volumes in the Staging Area. After Initial Sync completes, Continuous Data Replication begins, tracking your source machines and replicating new data writes to the staging EBS volumes. Continuous Data Replication makes sure that your Staging Area always has the most up-to-date copy of your source machines.

  5. To track your source machines’ data replication progress, in the CloudEndure User Console, choose Machines, and see the Details view.
    When the Data Replication Progress status column reads Continuous Data Replication, and the Migration Lifecycle status column reads Ready for Testing, Initial Sync is complete. These statuses indicate that the machines are functioning correctly and are ready for testing and migration.

Configuration and testing

To test how your machine runs on AWS, you must configure the Target Machine Blueprint. This Blueprint is a set of configurations that define where and how the target machines are launched and provisioned, such as target subnet, security groups, instance type, volume type, and tags.

For large-scale migration projects, APIs can be used to configure the Blueprint for all of your machines within seconds.

I recommend performing a test at least two weeks before migrating your source machines, to give you enough time to identify potential problems and resolve them before you perform the actual cutover. For more information, see Migration Best Practices.

To launch machines in Test Mode:

  1. In the CloudEndure User Console, choose Machines.
  2. Select the Name box corresponding to each machine to test.
  3. Choose Launch Target Machines, Test Mode.Launch target test machines

After the target machines launch in Test Mode, the CloudEndure User Console reports those machines as Tested and records the date and time of the test.

Performing cutover

After you have completed testing, your machines continue to be in Continuous Data Replication mode until the scheduled cutover window.

When you are ready to perform a cutover:

  1. In the CloudEndure User Console, choose Machines.
  2. Select the Name box corresponding to each machine to migrate.
  3. Choose Launch Target Machines, Cutover Mode.
    To confirm that your target machines successfully launch, see the Launch Target Machines menu. As your data replicates, verify that the target machines are running correctly, make any necessary configuration adjustments, perform user acceptance testing (UAT) on your applications and databases, and redirect your users.
  4. After the cutover completes, remove the CloudEndure Agent from the source machines and the CloudEndure User Console.
  5. At this point, you can also decommission your source machines.

Conclusion

In this post, I showed how to rehost an on-premises workload to AWS using CloudEndure Migration. CloudEndure automatically converts your machines from any source infrastructure to AWS infrastructure. That means they can boot and run natively in AWS, and run as expected after migration to the cloud.

If you have further questions see CloudEndure Migration, or Registering to CloudEndure Migration.

Get started now with free CloudEndure Migration licenses.

Learn about AWS Services & Solutions – September AWS Online Tech Talks

Post Syndicated from Jenny Hang original https://aws.amazon.com/blogs/aws/learn-about-aws-services-solutions-september-aws-online-tech-talks/

Learn about AWS Services & Solutions – September AWS Online Tech Talks

AWS Tech Talks

Join us this September to learn about AWS services and solutions. The AWS Online Tech Talks are live, online presentations that cover a broad range of topics at varying technical levels. These tech talks, led by AWS solutions architects and engineers, feature technical deep dives, live demonstrations, customer examples, and Q&A with AWS experts. Register Now!

Note – All sessions are free and in Pacific Time.

Tech talks this month:

 

Compute:

September 23, 2019 | 11:00 AM – 12:00 PM PTBuild Your Hybrid Cloud Architecture with AWS – Learn about the extensive range of services AWS offers to help you build a hybrid cloud architecture best suited for your use case.

September 26, 2019 | 1:00 PM – 2:00 PM PTSelf-Hosted WordPress: It’s Easier Than You Think – Learn how you can easily build a fault-tolerant WordPress site using Amazon Lightsail.

October 3, 2019 | 11:00 AM – 12:00 PM PTLower Costs by Right Sizing Your Instance with Amazon EC2 T3 General Purpose Burstable Instances – Get an overview of T3 instances, understand what workloads are ideal for them, and understand how the T3 credit system works so that you can lower your EC2 instance costs today.

 

Containers:

September 26, 2019 | 11:00 AM – 12:00 PM PTDevelop a Web App Using Amazon ECS and AWS Cloud Development Kit (CDK) – Learn how to build your first app using CDK and AWS container services.

 

Data Lakes & Analytics:

September 26, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Provisioning Amazon MSK Clusters and Using Popular Apache Kafka-Compatible Tooling – Learn best practices on running Apache Kafka production workloads at a lower cost on Amazon MSK.

 

Databases:

September 25, 2019 | 1:00 PM – 2:00 PM PTWhat’s New in Amazon DocumentDB (with MongoDB compatibility) – Learn what’s new in Amazon DocumentDB, a fully managed MongoDB compatible database service designed from the ground up to be fast, scalable, and highly available.

October 3, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Enterprise-Class Security, High-Availability, and Scalability with Amazon ElastiCache – Learn about new enterprise-friendly Amazon ElastiCache enhancements like customer managed key and online scaling up or down to make your critical workloads more secure, scalable and available.

 

DevOps:

October 1, 2019 | 9:00 AM – 10:00 AM PT – CI/CD for Containers: A Way Forward for Your DevOps Pipeline – Learn how to build CI/CD pipelines using AWS services to get the most out of the agility afforded by containers.

 

Enterprise & Hybrid:

September 24, 2019 | 1:00 PM – 2:30 PM PT Virtual Workshop: How to Monitor and Manage Your AWS Costs – Learn how to visualize and manage your AWS cost and usage in this virtual hands-on workshop.

October 2, 2019 | 1:00 PM – 2:00 PM PT – Accelerate Cloud Adoption and Reduce Operational Risk with AWS Managed Services – Learn how AMS accelerates your migration to AWS, reduces your operating costs, improves security and compliance, and enables you to focus on your differentiating business priorities.

 

IoT:

September 25, 2019 | 9:00 AM – 10:00 AM PTComplex Monitoring for Industrial with AWS IoT Data Services – Learn how to solve your complex event monitoring challenges with AWS IoT Data Services.

 

Machine Learning:

September 23, 2019 | 9:00 AM – 10:00 AM PTTraining Machine Learning Models Faster – Learn how to train machine learning models quickly and with a single click using Amazon SageMaker.

September 30, 2019 | 11:00 AM – 12:00 PM PTUsing Containers for Deep Learning Workflows – Learn how containers can help address challenges in deploying deep learning environments.

October 3, 2019 | 1:00 PM – 2:30 PM PTVirtual Workshop: Getting Hands-On with Machine Learning and Ready to Race in the AWS DeepRacer League – Join DeClercq Wentzel, Senior Product Manager for AWS DeepRacer, for a presentation on the basics of machine learning and how to build a reinforcement learning model that you can use to join the AWS DeepRacer League.

 

AWS Marketplace:

September 30, 2019 | 9:00 AM – 10:00 AM PTAdvancing Software Procurement in a Containerized World – Learn how to deploy applications faster with third-party container products.

 

Migration:

September 24, 2019 | 11:00 AM – 12:00 PM PTApplication Migrations Using AWS Server Migration Service (SMS) – Learn how to use AWS Server Migration Service (SMS) for automating application migration and scheduling continuous replication, from your on-premises data centers or Microsoft Azure to AWS.

 

Networking & Content Delivery:

September 25, 2019 | 11:00 AM – 12:00 PM PTBuilding Highly Available and Performant Applications using AWS Global Accelerator – Learn how to build highly available and performant architectures for your applications with AWS Global Accelerator, now with source IP preservation.

September 30, 2019 | 1:00 PM – 2:00 PM PTAWS Office Hours: Amazon CloudFront – Just getting started with Amazon CloudFront and [email protected]? Get answers directly from our experts during AWS Office Hours.

 

Robotics:

October 1, 2019 | 11:00 AM – 12:00 PM PTRobots and STEM: AWS RoboMaker and AWS Educate Unite! – Come join members of the AWS RoboMaker and AWS Educate teams as we provide an overview of our education initiatives and walk you through the newly launched RoboMaker Badge.

 

Security, Identity & Compliance:

October 1, 2019 | 1:00 PM – 2:00 PM PTDeep Dive on Running Active Directory on AWS – Learn how to deploy Active Directory on AWS and start migrating your windows workloads.

 

Serverless:

October 2, 2019 | 9:00 AM – 10:00 AM PTDeep Dive on Amazon EventBridge – Learn how to optimize event-driven applications, and use rules and policies to route, transform, and control access to these events that react to data from SaaS apps.

 

Storage:

September 24, 2019 | 9:00 AM – 10:00 AM PTOptimize Your Amazon S3 Data Lake with S3 Storage Classes and Management Tools – Learn how to use the Amazon S3 Storage Classes and management tools to better manage your data lake at scale and to optimize storage costs and resources.

October 2, 2019 | 11:00 AM – 12:00 PM PTThe Great Migration to Cloud Storage: Choosing the Right Storage Solution for Your Workload – Learn more about AWS storage services and identify which service is the right fit for your business.

 

 

Running AWS Infrastructure On Premises with AWS Outposts

Post Syndicated from Matt Garman original https://aws.amazon.com/blogs/compute/running-aws-infrastructure-on-premises-with-aws-outposts/

We announced AWS Outposts at re:Invent last December and since then have seen immense customer interest. Customers have been asking for an AWS option on-premises to run applications with low latency and local data-processing requirements. AWS Outposts is a new service slated to launch in late 2019, that brings the same infrastructure, APIs, and tools that customers use in AWS to virtually any customer on-premises facility. It is a fully managed service; the physical infrastructure is delivered and installed by AWS, operated and monitored by AWS, and automatically updated and patched as part of being connected to an AWS Region.

You can use AWS Outposts to launch a range of Amazon EC2 instances (C5, M5, R5, I3en and G4, both with and without local storage options) and Amazon EBS volumes locally. In addition to EC2 and EBS, you can also run a wide range of AWS services locally on Outposts. At general availability, AWS services supported locally on Outposts will include Amazon ECS and Amazon EKS clusters for container-based applications, Amazon EMR clusters for data analytics, and Amazon RDS instances for relational database services. Additional services such as Amazon SageMaker and Amazon MSK are coming soon after launch.

An Outpost works as an extension of the AWS Region into your own data center; services running on the Outpost can seamlessly work with any AWS service or resource running in the cloud. For example, you can use private connectivity to your Amazon S3 buckets or Amazon DynamoDB tables in the public region. Amazon tools will work with Outposts as well. API calls will be logged via CloudTrail automatically and existing CloudFormation templates will work as well. When AWS launches new innovations, they will work with Outposts so customers can always take advantage of the latest technologies.

We are speaking with customers across verticals including manufacturing, healthcare, financial services, media and entertainment, and telecom who are interested in Outposts for their connected environments. One of the most common scenarios is applications that need single-digit millisecond latency to end-users or onsite equipment. Customers may need to run compute-intensive workloads on their manufacturing factory floors with precision and quality. Others have graphics-intensive applications such as image analysis that need low-latency access to end-users or storage-intensive workloads that collect and process hundreds of TBs of data a day. Customers want to integrate their cloud deployments with their on-premises environments and use AWS services for a consistent hybrid experience.

Some customers are also interested in leveraging AWS services in disconnected environments like cruise ships or remote mining locations. For these types of environments, AWS offers Snowball Edge, which is optimized to operate in environments with limited to no connectivity. Compared to Snowball Edge, Outposts are designed to be run exclusively in connected, on-premises environments.

I want to share an example of how an early user of Outposts is using it to control and operate industrial equipment at hundreds of sites around the world. They already run centralized decision-making applications in AWS to identify what work to execute at which site. Predictable low-latency access to local compute resources is essential for their on-premises control systems to manage materials with smoothness and speed. For instance, control systems need to process video streams to sense the product on the conveyor belt, and execute a robotic movement to direct the product to the right location. Their sites also run video monitoring applications where the captured data can exceed available bandwidth so they want to conduct video encoding on-premises.

We worked with our early customer to deploy an Outpost rack at one of their sites. After connecting their Outpost to the nearest local AWS Region, the customer has complete control over their virtual network, including selection of an IP address range, creation of subnets, and configuration of route tables and network gateways, just like with their Amazon VPC today. They can seamlessly extend their regional VPC to the Outpost by creating a subnet and associating it with the Outpost the same way they associate subnets with an Availability Zone today.

To launch instances on their Outposts, they use the exact same API call as they do in the public region, but target their Outpost subnet so the instances launch on the Outpost in their facility. These instances will run in their existing VPC and even be able to communicate with instances running in the public region via private IP addresses. As the infrastructure is the same hardware as what we use in our public AWS Region, the applications running on these instances will perform the same as they do in the public region. To get low-latency access to local compute and storage already running in their facility, the customer can create a local gateway in their VPC that makes it simple for the Outpost to route traffic directly to their local datacenter networks.

Using Outposts, the customer plans to standardize tooling across on-premises and the cloud, and automate deployments and configurations across hundreds of sites by using the same APIs, the same IAM permissions, the same EC2 AMIs, the same CloudFormation templates, and the same deployment pipelines everywhere. As Outposts are updated and patched as part of AWS regional operations, they no longer need to upgrade and patch their on-premises infrastructure or take downtime for maintenance.

We are excited about the enthusiasm we’re seeing from customers, and are eager to help them focus on innovating for their end-users without worrying about procuring, deploying, and operating the infrastructure for their applications. With AWS Outposts, we are bringing the same AWS infrastructure, APIs, services, and tools to help customers build applications that can run as reliably and securely at any of their sites as in the cloud. We recently released a new video that shares more details about the Outposts rack. If you haven’t already seen it, go check it out! As always, we look forward to your feedback.

To learn more about AWS Outposts, visit the product detail page.

Additional Resources:

What is an AWS Outposts Rack?

AWS Nitro System

Sharing automated blueprints for Amazon ECS continuous delivery using AWS Service Catalog

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/sharing-automated-blueprints-for-amazon-ecs-continuous-delivery-using-aws-service-catalog/

This post is contributed by Mahmoud ElZayet | Specialist SA – Dev Tech, AWS

 

Modern application development processes enable organizations to improve speed and quality continually. In this innovative culture, small, autonomous teams own the entire application life cycle. While such nimble, autonomous teams speed product delivery, they can also impose costs on compliance, quality assurance, and code deployment infrastructures.

Standardized tooling and application release code helps share best practices across teams, reduce duplicated code, speed on-boarding, create consistent governance, and prevent resource over-provisioning.

 

Overview

In this post, I show you how to use AWS Service Catalog to provide standardized and automated deployment blueprints. This helps accelerate and improve your product teams’ application release workflows on Amazon ECS. Follow my instructions to create a sample blueprint that your product teams can use to release containerized applications on ECS. You can also apply the blueprint concept to other technologies, such as serverless or Amazon EC2–based deployments.

The sample templates and scripts provided here are for demonstration purposes and should not be used “as-is” in your production environment. After you become familiar with these resources, create customized versions for your production environment, taking account of in-house tools and team skills, as well as all applicable standards and restrictions.

 

Prerequisites

To use this solution, you need the following resources:

 

Sample scenario

Example Corp. has various product teams that develop applications and services on AWS. Example Corp. teams have expressed interest in deploying their containerized applications managed by AWS Fargate on ECS. As part of Example Corp’s central tooling team, you want to enable teams to quickly release their applications on Fargate. However, you also make sure that they comply with all best practices and governance requirements.

For convenience, I also assume that you have supplied product teams working on the same domain, application, or project with a shared AWS account for service deployment. Using this account, they all deploy to the same ECS cluster.

In this scenario, you can author and provide these teams with a shared deployment blueprint on ECS Fargate. Using AWS Service Catalog, you can share the blueprint with teams as follows:

  1. Every time that a product team wants to release a new containerized application on ECS, they retrieve a new AWS Service Catalog ECS blueprint product. This enables them to obtain the required infrastructure, permissions, and tools. As a prerequisite, the ECS blueprint requires building blocks such as a git repository or an AWS CodeBuild project. Again, you can acquire those blocks through another AWS Service Catalog product.
  2. The product team completes the ECS blueprint’s required parameters, such as the desired number of ECS tasks and application name. As an administrator, you can constrain the value of some parameters such as the VPC and the cluster name. For more information, see AWS Service Catalog Template Constraints.
  3. The ECS blueprint product deploys all the required ECS resources, configured according to best practices. You can also use the AWS Cloud Development Kit (CDK) to maintain and provision pre-defined constructs for your infrastructure.
  4. A standardized CI/CD pipeline also generates, enabling your product teams to publish their application to ECS automatically. Ideally, this pipeline should have all stages, practices, security checks, and standards required for application release. Product teams must still author application code, create a Dockerfile, build specifications, run automated tests and deployment scripts, and complete other tasks required for application release.
  5. The ECS blueprint can be continually updated based on organization-wide feedback and to support new use cases. Your product team can always access the latest version through AWS Service Catalog. I recommend retaining multiple, customizable blueprints for various technologies.

 

For simplicity’s sake, my explanation envisions your environment as consisting of one AWS account. In practice, you can use IAM controls to segregate teams’ access to each other’s resources, even when they share an account. However, I recommend having at least two AWS accounts, one for testing and one for production purposes.

To see an example framework that helps deploy your AWS Service Catalog products to multiple accounts, see AWS Deployment Framework (ADF). This framework can also help you create cross-account pipelines that cater to different product teams’ needs, even when these teams deploy to the same technology stack.

To set up shared deployment blueprints for your production teams, follow the steps outlined in the following sections.

 

Set up the environment

In this section, I explain how to create a central ECS cluster in the appropriate VPC where teams can deploy their containers. I provide an AWS CloudFormation template to help you set up these resources. This template also creates an IAM role to be used by AWS Service Catalog later.

To run the CloudFormation template:

1. Use a git client to clone the following GitHub repository to a local directory. This will be the directory where you will run all the subsequent AWS CLI commands.

2. Using the AWS CLI, run the following commands. Replace <Application_Name> with a lowercase string with no spaces representing the application or microservice that your product team plans to release—for example, myapp.

aws cloudformation create-stack --stack-name "fargate-blueprint-prereqs" --template-body file://environment-setup.yaml --capabilities CAPABILITY_NAMED_IAM --parameters ParameterKey=ApplicationName,ParameterValue=<Application_Name>

3. Keep running the following command until the output reads CREATE_COMPLETE:

aws cloudformation describe-stacks --stack-name "fargate-blueprint-prereqs" --query Stacks[0].StackStatus

4. In case of error, use the describe-events CLI command or review error details on the console.

5. When the stack creation reads CREATE_COMPLETE, run the following command, and make a note of the output values in an editor of your choice. You need this information for a later step:

aws cloudformation describe-stacks  --stack-name fargate-blueprint-prereqs --query Stacks[0].Outputs

6. Run the following commands to copy those CloudFormation templates to Amazon S3. Replace <Template_Bucket_Name> with the template bucket output value you just copied into your editor of choice:

aws s3 cp core-build-tools.yml s3://<Template_Bucket_Name>/core-build-tools.yml

aws s3 cp ecs-fargate-deployment-blueprint.yml s3://<Template_Bucket_Name>/ecs-fargate-deployment-blueprint.yml

Create AWS Service Catalog products

In this section, I show you how to create two AWS Service Catalog products for teams to use in publishing their containerized app:

  1. Core Build Tools
  2. ECS Fargate Deployment Blueprint

To create an AWS Service Catalog portfolio that includes these products:

1. Using the AWS CLI, run the following command, replacing <Application_Name>
with the application name you defined earlier and replacing <Template_Bucket_Name>
with the template bucket output value you copied into your editor of choice:

aws cloudformation create-stack --stack-name "fargate-blueprint-catalog-products" --template-body file://catalog-products.yaml --parameters ParameterKey=ApplicationName,ParameterValue=<Application_Name> ParameterKey=TemplateBucketName,ParameterValue=<Template_Bucket_Name>

2. After a few minutes, check the stack creation completion. Run the following command until the output reads CREATE_COMPLETE:

aws cloudformation describe-stacks --stack-name "fargate-blueprint-catalog-products" --query Stacks[0].StackStatus

3. In case of error, use the describe-events CLI command or check error details in the console.

Your AWS Service Catalog configuration should now be ready.

 

Test product teams experience

In this section, I show you how to use IAM roles to impersonate a product team member and simulate their first experience of containerized application deployment.

 

Assume team role

To assume the role that you created during the environment setup step

1.     In the Management console, follow the instructions in Switching a Role.

  • For Account, enter the account ID used in the sample solution. To learn more about how to find an AWS account ID, see Your AWS Account ID and Its Alias.
  • For Role, enter <Application_Name>-product-team-role, where <Application_Name> is the same application name you defined in Environment Setup section.
  • (Optional) For Display name, enter a custom session value.

You are now logged in as a member of the product team.

 

Provision core build product

Next, provision the core build tools for your blueprint:

  1. In the Service Catalog console, you should now see the two products created earlier listed under Products.
  2. Select the first product, Core Build Tools.
  3. Choose LAUNCH PRODUCT.
  4. Name the product something such as <Application_Name>-build-tools, replacing <Application_Name> with the name previously defined for your application.
  5. Provide the same application name you defined previously.
  6. Leave the ContainerBuild parameter default setting as yes, as you are building a container requiring a container repository and its associated permissions.
  7. Choose NEXT three times, then choose LAUNCH.
  8. Under Events, watch the Status property. Keep refreshing until the status reads Succeeded. In case of failure, choose the URL value next to the key CloudformationStackARN. This choice takes you to the CloudFormation console, where you can find more information on the errors.

Now you have the following build tools created along with the required permissions:

  • AWS CodeCommit repository to store your code
  • CodeBuild project to build your container image and test your application code
  • Amazon ECR repository to store your container images
  • Amazon S3 bucket to store your build and release artifacts

 

Provision ECS Fargate deployment blueprint

In the Service Catalog console, follow the same steps to deploy the blueprint for ECS deployment. Here are the product provisioning details:

  • Product Name: <Application_Name>-fargate-blueprint.
  • Provisioned Product Name: <Application_Name>-ecs-fargate-blueprint.
  • For the parameters Subnet1, Subnet2, VpcId, enter the output values you copied earlier into your editor of choice in the Setup Environment section.
  • For other parameters, enter the following:
    • ApplicationName: The same application name you defined previously.
    • ClusterName: Enter the value example-corp-ecs-cluster, which is the name chosen in the template for the central cluster.
  • Leave the DesiredCount and LaunchType parameters to their default values.

After the blueprint product creation completes, you should have an ECS service with a sample task definition for your product team. The build tools created earlier include the permissions required for deploying to the ECS service. Also, a CI/CD pipeline has been created to guide your product teams as they publish their application to the ECS service. Ideally, this pipeline should have all stages, practices, security checks, and standards required for application release.

Product teams still have to author application code, create a Dockerfile, build specifications, run automated tests and deployment scripts, and perform other tasks required for application release. The blueprint product can provide wiki links to reference examples for these steps, or access to pre-provisioned sample pipelines.

 

Test your pipeline

Now, upload a sample app to test your pipeline:

  1. Log in with the product team role.
  2. In the CodeCommit console, select the repository with the application name that you defined in the environment setup section.
  3. Scroll down, choose Add file, Create file.
  4. Paste the following in the page editor, which is a script to build the container image and push it to the ECR repository:
version: 0.2
phases:
  pre_build:
    commands:
      - $(aws ecr get-login --no-include-email)
      - TAG="$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | head -c 8)"
      - IMAGE_URI="${REPOSITORY_URI}:${TAG}"
  build:
    commands:
      - docker build --tag "$IMAGE_URI" .
  post_build:
    commands:
      - docker push "$IMAGE_URI"      
      - printf '[{"name":"%s","imageUri":"%s"}]' "$APPLICATION_NAME" "$IMAGE_URI" > images.json
artifacts:
  files: 
    - images.json
    - '**/*'

5. For File name, enter buildspec.yml.

6. For Author name and Email address, enter your name and your preferred email address for the commit. Although optional, the addition of a commit message is a good practice.

7. Choose Commit changes.

8. Repeat the same steps for the Dockerfile. The sample Dockerfile creates a straightforward PHP application. Typically, you add your application content to that image.

File name: Dockerfile

File content:

FROM ubuntu:12.04

# Install dependencies
RUN apt-get update -y
RUN apt-get install -y git curl apache2 php5 libapache2-mod-php5 php5-mcrypt php5-mysql

# Configure apache
RUN a2enmod rewrite
RUN chown -R www-data:www-data /var/www
ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_LOG_DIR /var/log/apache2

EXPOSE 80

CMD ["/usr/sbin/apache2", "-D",  "FOREGROUND"]

Your pipeline should now be ready to run successfully. Although you can list all current pipelines in the Region, you can only describe and modify pipelines that have a prefix matching your application name. To confirm:

  1. In the AWS CodePipeline console, select the pipeline <Application_Name>-ecs-fargate-pipeline.
  2. The pipeline should now be running.

Because you performed two commits to the repository from the console, you must wait for the second run to complete before successful deployment to ECS Fargate.

 

Clean up

To clean up the environment, run the following commands in the AWS CLI, replacing <Application_Name>
with your application name, <Account_Id> with your AWS Account ID with no hyphens and <Template_Bucket_Name>
with the template bucket output value you copied into your editor of choice:

aws ecr delete-repository --repository-name <Application_Name> --force

aws s3 rm s3://<Application_Name>-artifactbucket-<Account_Id> --recursive

aws s3 rm s3://<Template_Bucket_Name> --recursive

 

To remove the AWS Service Catalog products:

  1. Log in with the Product team role
  2. In the console, follow the instructions at Deleting Provisioned Products.
  3. Delete the AWS Service Catalog products in reverse order, starting with the blueprint product.

Run the following commands to delete the administrative resources:

aws cloudformation delete-stack --stack-name fargate-blueprint-catalog-products

aws cloudformation delete-stack --stack-name fargate-blueprint-prereqs

Conclusion

In this post, I showed you how to design and build ECS Fargate deployment blueprints. I explained how these accelerate and standardize the release of containerized applications on AWS. Your product teams can keep getting the latest standards and coded best practices through those automated blueprints.

As always, AWS welcomes feedback. Please submit comments or questions below.

Optimizing NGINX load balancing on Amazon EC2 A1 instances

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/optimizing-nginx-load-balancing-on-amazon-ec2-a1-instances/

This post is contributed by Geoff Blake | Sr System Development Engineer

In a previous post, Optimizing Network Intensive Workloads on Amazon EC2 A1 Instances, I provided general guidance on tuning network-intensive workloads on A1 instances using Memcached as the example use case.

NGINX is another network-intensive application that, like Memcached, is a good fit for the A1 instances. This post describes how to configure NGINX as a load balancer on A1 for optimal performance using Amazon Linux 2, highlighting the important tuning parameters. You can extract up to a 30% performance benefit with these tunings over the default configuration on A1. However, depending on the particular data rates, processing required per request, instance size, and chosen AMI, the values in this post could change for your particular scenario. However, the methodologies described here are still applicable.

IRQ affinity and receive packet steering

Turning off irqbalance, pinning IRQs, and distributing network processing to specific cores helps the performance of NGINX when it runs on A1 instances.

Unlike Memcached in the previous post, NGINX does not benefit from using receive packet steering (RPS) to spread network processing among all cores or isolating processing to a subset of cores. It is better to allow NGINX access to all cores, while keeping IRQs and network processing isolated to a subset of cores.

Using a modified version of the script from the previous post, you can set IRQs, RPS, and NGINX workers to the following mappings. Finding the optimal balance of IRQs, RPS, and worker mappings can benefit performance by up to 10% on a1.4xlarge instances.

Instance typeIRQ settingsRPS settingsNGINX workers
a1.2xlargeCore 0, 4Core 0, 4Run on cores 0-7
a1.4xlargeCore 0, 8Core 0, 8Run on cores 0-15

NGINX access logging

For production deployments of NGINX, logging is critically important for monitoring the health of servers and debugging issues.

On large a1.4xlarge instance types, logging each request can become a performance bottleneck in certain situations. To alleviate this issue, tune the access_log configuration with the buffer modifier. Only a small amount of buffering is necessary to avoid logging becoming a bottleneck, on the order of 8 KB. This tuning parameter alone can give a significant boost of 20% or more on a1.4xlarge instances, depending on the traffic profile.

Additional Linux tuning parameters

The Linux networking stack is tuned to conserve resources for general use cases. When running NGINX as a load balancer, the server must be tuned to give the network stack considerably more resources than the default amount, for the prevention of dropped connections and packets. Here are the key parameters to tune:

  • core.somaxcon: Maximum number of backlogged sockets allowed. Increase this to 4096 or more to prevent dropping connection requests.
  • ipv4.tcp_max_syn_backlog: Maximum number of outstanding SYN requests. Set this to the same value as net.core.somaxconn.
  • ipv4.ip_local_port_range: To avoid prematurely running out of connections with clients, set to a larger ephemeral port range of 1024 65535.
  • core.rmem_max, net.core.wmem_max, net.ipv4.tcp_rmem, net.ipv4.tcp_wmem, net.ipv4.tcp_mem: Socket and TCP buffer settings. Tune these to be larger than the default. Setting the maximum buffer sizes to 8 MB should be sufficient.

Additional NGINX configuration parameters

To extract the most out of an NGINX load balancer on A1 instances, set the following  NGINX parameters higher than their default values:

 

  • worker_processes: Keeping this set to the default of auto works well on A1.
  • worker_rlimit_nofile: Set this to a high value such as 65536 to allow many connections and access to files.
  • worker_connections: Set this to a high value such as 49152 to cover most of the ephemeral port range.
  • keepalive_requests: The number of requests that a downstream client can make before a connection is closed. Setting this to a reasonably high number such as 10000 helps prevent connection churn and ephemeral port exhaustion.
  • Keepalive: Set to a value that covers your total number of backends plus expected growth, such as 100 in your upstream blocks to keep connections open to your backends from each worker process.

Summary

Using the above tuning parameters versus the defaults that come with Amazon Linux 2 and NGINX can significantly increase the performance of an NGINX load balancing workload by up to 30% on the a1.4xlarge instance type. Similar, but less dramatic performance gains were seen on the smaller a1.2xlarge instance type as well. If you have questions about your own workload running on A1 instances, contact us at [email protected].

 

Deploying GitOps with Weave Flux and Amazon EKS

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/deploying-gitops-with-weave-flux-and-amazon-eks/

This post is contributed by Jon Jozwiak | Senior Solutions Architect, AWS

 

You have countless options for deploying resources into an Amazon EKS cluster. GitOps—a term coined by Weaveworks—provides some substantial advantages over the alternatives. With only Git as the single, central source for controlling deployment into your cluster, GitOps provides easy version control on a platform your team already knows. Getting started with GitOps is straightforward: create a pull request, merge, and the configuration deploys to the EKS cluster.

Weave Flux makes running GitOps in your EKS cluster fast and easy, as it monitors your configuration in Git and image repositories and automates deployments. Weave Flux follows a pull model, automatically triggering deployments based on changes. This provides better security than most continuous deployment tools, which need permissions to access your cluster. This approach also provides Git with version control over your configuration and enables rollback.

This post walks through implementing Weave Flux and deploying resources to EKS using Git. To simplify the image build pipeline, I use AWS Service Catalog to provide a standardized pipeline. AWS Service Catalog lets you centrally define a portfolio of approved products that AWS users can provision. An AWS CloudFormation template defines each product, which can be version-controlled.

After you deploy the sample resources, I quickly demonstrate the GitOps approach where a new image results in the configuration automatically deploying to EKS. This new image may be a commit of Kubernetes manifests or a commit of Helm release definitions.

The following diagram shows the workflow.

Prerequisites

In GitOps, you manage Docker image builds separately from deployment configuration. For image builds, this example uses AWS CodePipeline and AWS CodeBuild, which provide a managed workflow from GitHub source through to an image landing in Amazon Elastic Container Registry (ECR).

This post assumes that you already have an EKS cluster deployed, including kubectl access. It also assumes that you have a GitHub account.

GitHub setup

First, create a GitHub repository to store the Kubernetes manifests (configuration files) to apply to the cluster.

In GitHub, create a GitHub repository. This repository holds Kubernetes manifests for your deployments. Name the repository k8s-config to align with this post. Leave it as a public repository, check the box for Initialize this repository with a README, and choose Create Repo.

On the GitHub repository page, choose Clone or Download and save the SSH string:

[email protected]:youruser/k8s-config.git

Next, create a GitHub token that allows creating and deleting repositories so AWS Service Catalog can deploy and remove pipelines.

  1. In your GitHub profile, access your token settings.
  2. Choose Generate New Token.
  3. Name your new token CodePipeline Service Catalog, and select the following options:
  • repo scopes (repo:status, repo_deployment, public_repo, and repo:invite)
  • read:org
  • write:public_key and read:public_key
  • write:repo_hook and read:repo_hook
  • read:user and user:email
  • delete_repo

4 . Choose Generate Token.

5. Copy and save your access token for future access.

 

Deploy Helm

Helm is a package manager for Kubernetes that allows you to define a chart. Charts are collections of related resources that let you create, version, share, and publish applications. By deploying Helm into your cluster, you make it much easier to deploy Weave Flux and other systems. If you’ve deployed Helm already, skip this section.

First, install the Helm client with the following command:

curl -LO https://git.io/get_helm.sh

chmod 700 get_helm.sh

./get_helm.sh

 

On macOS, you could alternatively enter the following command:

brew install kubernetes-helm

 

Next, set up a service account with cluster role for Tiller, Helm’s server-side component. This allows Tiller to manage resources in your cluster.

kubectl -n kube-system create sa tiller

kubectl create clusterrolebinding tiller-cluster-rule \

--clusterrole=cluster-admin \

--serviceaccount=kube-system:tiller

 

Finally, initialize Helm and verify your version. Tiller takes a few seconds to start.

helm init --service-account tiller --history-max 200

helm version

 

Deploy Weave Flux

With Helm installed, proceed with the Weave Flux installation. Begin by installing the Flux Custom Resource Definition.

kubectl apply -f https://raw.githubusercontent.com/fluxcd/flux/helm-0.10.1/deploy-helm/flux-helm-release-crd.yaml

Now add the Weave Flux Helm repository and proceed with the install. Make sure that you update the git.url to match the GitHub repository that you created earlier.

helm repo add fluxcd https://charts.fluxcd.io

helm upgrade -i flux --set helmOperator.create=true --set helmOperator.createCRD=false --set [email protected]:YOURUSER/k8s-config --namespace flux fluxcd/flux

 

You can use the following code to verify that you successfully deployed Flux. You should see three pods running:

kubectl get pods -n flux

NAME                                 READY     STATUS    RESTARTS   AGE

flux-5bd7fb6bb6-4sc78                1/1       Running   0          52s

flux-helm-operator-df5746688-84kw8   1/1       Running   0          52s

flux-memcached-6f8c446979-f45wj      1/1       Running   0          52s

 

Flux requires a deploy key to work with the GitHub repository. In this post, Flux generates the SSH key pair itself, but you can also specify a different key pair when deploying. To access the key, download fluxctl, a command line utility that interacts with the Flux API. The following steps work for Linux. For other OS platforms, see Installing fluxctl.

sudo wget -O /usr/local/bin/fluxctl https://github.com/fluxcd/flux/releases/download/1.14.1/fluxctl_linux_amd64

sudo chmod 755 /usr/local/bin/fluxctl

 

Validate that fluxctl installed successfully, then retrieve the public key pair using the following command. Specify the namespace where you deployed Flux.

fluxctl version

fluxctl --k8s-fwd-ns=flux identity

 

Copy the key and add that as a deploy key in your GitHub repository.

  1. In your GitHub repository, choose Settings, Deploy Keys.
  2. Choose Add deploy key and name the key Flux Deploy Key.
  3. Paste the key from fluxctl identity.
  4. Choose Allow Write Access, Add Key.

Now use AWS Service Catalog to set up your image build pipeline.

 

Set up AWS Service Catalog

To allow end users to consume product portfolios, you must associate a portfolio with an IAM principal (or principals): a user, group, or role. For this example, associate your current identity. After you master these basics, there are additional resources to teach you how to set up a multi-region, multi-account catalog.

To retrieve your current identity, use the AWS CLI to get your ARN:

aws sts get-caller-identity

Deploy the product portfolio that contains an image build pipeline service by doing the following:

  1. In the AWS CloudFormation console, launch the CloudFormation stack with the following link:

 

 

2. Choose Next.

3. On the Specify Details page, enter your ARN from get-caller-identity. Also enter an environment tag, which AWS applies to all resources from this portfolio.

4. Choose Next.

5. On the Options page, choose Next.

6. On the Review page, select the check box displayed next to I acknowledge that AWS CloudFormation might create IAM resources.

7. Choose Create. CloudFormation takes a few minutes to create your resources.

 

Deploy the image pipeline

The image pipeline provisions a GitHub repository, Amazon ECR repository, and AWS CodeBuild project. It also uses AWS CodePipeline to build a Docker image.

  1. In the AWS Management Console, go to the AWS Service Catalog products list and choose Pipeline for Docker Images.
  2. Choose Launch Product.
  3. For Name, enter ExamplePipeline, and choose Next.
  4. On the Parameters page, fill in a project name, description, and unique S3 bucket name. The specifics don’t matter, but make a note of the name and S3 bucket for later use.
  5. Fill in your GitHub User and GitHub Token values from earlier. Leave the rest of the fields as the default values.
  6. To clean up your GitHub repository on stack delete, change Delete Repository to true.
  7. Choose Next.
  8. On the TagOptions screen, choose Next.
  9. Choose Next on the Notifications page.
  10. On the Review page, choose Launch.

The launch process takes 1–2 minutes. You can verify that you now have a repository matching your project name (eks-example) in GitHub. You can also look at the pipeline created in the AWS CodePipeline console.

 

Deploying with GitOps

You can now provision workloads into the EKS cluster. With a GitOps approach, you only commit code and Kubernetes resource definitions to GitHub. AWS CodePipeline handles the image builds, and Weave Flux applies the desired state to Kubernetes.

First, create a simple Hello World application in your example pipeline. Clone the GitHub repository that you created in the previous step and substitute your GitHub user below.

git clone [email protected]:youruser/eks-example.git

cd eks-example

Create a base README file, a source directory, and download a simple NGINX configuration (hello.conf), home page (index.html), and Dockerfile.

echo "# eks-example" > README.md

mkdir src

wget -O src/hello.conf https://blog-gitops-eks.s3.amazonaws.com/hello.conf

wget -O src/index.html https://blog-gitops-eks.s3.amazonaws.com/index.html

wget https://blog-gitops-eks.s3.amazonaws.com/Dockerfile

 

Now that you have a simple Hello World app with Dockerfile, commit the changes to kick off the pipeline.

git add .

git commit -am "Initial commit"

[master (root-commit) d69a6ba] Initial commit

4 files changed, 34 insertions(+)

create mode 100644 Dockerfile

create mode 100644 README.md

create mode 100644 src/hello.conf

create mode 100644 src/index.html

git push

 

Watch in the AWS CodePipeline console to see the image build in process. This may take a minute to start. When it’s done, look in the ECR console to see the first version of the container image.

To deploy this image and the Hello World application, commit Kubernetes manifests for Flux. Create a namespace, deployment, and service in the Kubernetes Git repository (k8s-config) you created. Make sure that you aren’t in your eks-example repository directory.

cd ..

git clone [email protected]:youruser/k8s-config.git

cd k8s-config

mkdir charts namespaces releases workloads

 

The preceding directory structure helps organize the repository but isn’t necessary. Flux can descend into subdirectories and look for YAML files to apply.

Create a namespace Kubernetes manifest.

cat << EOF > namespaces/eks-example.yaml
apiVersion: v1
kind: Namespace
metadata:
  labels:
    name: eks-example
  name: eks-example
EOF

Now create a deployment manifest. Make sure that you update this image to point to your repository and image tag. For example, <Account ID>.dkr.ecr.us-east-1.amazonaws.com/eks-example:d69a6bac.

cat << EOF > workloads/eks-example-dep.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eks-example
  namespace: eks-example
  labels:
    app: eks-example
  annotations:
    # Container Image Automated Updates
    flux.weave.works/automated: "true"
    # do not apply this manifest on the cluster
    #flux.weave.works/ignore: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eks-example
  template:
    metadata:
      labels:
        app: eks-example
    spec:
      containers:
      - name: eks-example
        image: <Your Account>.dkr.ecr.us-east-1.amazonaws.com/eks-example:d69a6bac
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /
            port: http
        readinessProbe:
          httpGet:
            path: /
            port: http
EOF

 

Finally, create a service manifest to create a load balancer.

cat << EOF > workloads/eks-example-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: eks-example
  namespace: eks-example
  labels:
    app: eks-example
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: eks-example
EOF

 

In the preceding code, there are two Kubernetes annotations for Flux. The first, flux.weave.works/automated, tells Flux whether the container image should be automatically updated. This example sets the value to true, enabling updates to your deployment as new images arrive in the registry. This example comments out the second annotation, flux.weave.works/ignore. However, you can use it to tell Flux to ignore the deployment temporarily.

Commit the changes, and in a few minutes, it automatically deploys.

git add .
git commit -am "eks-example deployment"
[master 954908c] eks-example deployment
 3 files changed, 64 insertions(+)
 create mode 100644 namespaces/eks-example.yaml
 create mode 100644 workloads/eks-example-dep.yaml
 create mode 100644 workloads/eks-example-svc.yaml

 

Make sure that you push your changes.

git push

Now check the logs of your Flux pod:

kubectl get pods -n flux

Update the name below to reflect the name of the pod in your deployment. This sample pulls every five minutes for changes. When it triggers, you should see kubectl apply log messages to create the namespace, service, and deployment.

kubectl logs flux-5bd7fb6bb6-4sc78 -n flux

Find the load balancer input for your service with the following:

kubectl describe service eks-example -n eks-example

Now when you connect to the load balancer address in a browser, you can see the Hello World app.

Change the eks-example source code in a small way (such as changing index.html to say Hello World Deployment 2), then commit and push to Git.

After a few minutes, refresh your browser to see the deployed change. You can watch the changes in AWS CodePipeline, in ECR, and through Flux logs. Weave Flux automatically updated your deployment manifests in the k8s-config repository to deploy the new image as it detected it. To back out that change, use a git revert or git reset command.

Finally, you can use the same approach to deploy Helm charts. You can host these charts within the configuration Git repository (k8s-config in this example), or on an external chart repository. In the following example, you use an external chart repository.

In your k8s-config directory, get the latest changes from your repository and then create a Helm release from an external chart.

cd k8s-config

git pull

 

First, create the namespace manifest.

cat << EOF > namespaces/nginx.yaml
apiVersion: v1
kind: Namespace
metadata:
  labels:
    name: nginx
  name: nginx
EOF

 

Then create the Helm release manifest. This is a custom resource definition provided by Weave Flux.

cat << EOF > releases/nginx.yaml
apiVersion: flux.weave.works/v1beta1
kind: HelmRelease
metadata:
  name: mywebserver
  namespace: nginx
  annotations:
    flux.weave.works/automated: "true"
    flux.weave.works/tag.nginx: semver:~1.16
    flux.weave.works/locked: 'true'
    flux.weave.works/locked_msg: '"Halt updates for now"'
    flux.weave.works/locked_user: User Name <[email protected]>
spec:
  releaseName: mywebserver
  chart:
    repository: https://charts.bitnami.com/bitnami/
    name: nginx
    version: 3.3.2
  values:
    usePassword: true
    image:
      registry: docker.io
      repository: bitnami/nginx
      tag: 1.16.0-debian-9-r46
    service:
      type: LoadBalancer
      port: 80
      nodePorts:
        http: ""
      externalTrafficPolicy: Cluster
    ingress:
      enabled: false
    livenessProbe:
      httpGet:
        path: /
        port: http
      initialDelaySeconds: 30
      timeoutSeconds: 5
      failureThreshold: 6
    readinessProbe:
      httpGet:
        path: /
        port: http
      initialDelaySeconds: 5
      timeoutSeconds: 3
      periodSeconds: 5
    metrics:
      enabled: false
EOF

git add . 
git commit -am "Adding NGINX Helm release"
git push

 

There are a few new annotations for Flux above. The flux.weave.works/locked annotation tells Flux to lock the deployment. This is useful if you find a known bad image and must roll back to a previous version. In addition, the flux.weave.works/tag.nginx annotation filters image tags by semantic versioning.

Wait up to five minutes for Flux to pull the configuration and verify this deployment as you did in the previous example:

kubectl get pods -n flux

kubectl logs flux-5bd7fb6bb6-4sc78 -n flux

 

kubectl get all -n nginx

 

If this doesn’t deploy, ensure Helm initialized as described earlier in this post.

kubectl get pods -n kube-system | grep tiller

kubectl get pods -n flux

kubectl logs flux-helm-operator-df5746688-84kw8 -n flux

 

Clean up

Log in as an administrator and follow these steps to clean up your sample deployment.

  1. Delete all images from the Amazon ECR repository.

2. In AWS Service Catalog provisioned products, select the three dots to the left of your ExamplePipeline service and choose Terminate provisioned product. Wait until it completes termination (1–2 minutes).

3. Delete your Amazon S3 artifact bucket.

4. Delete Weave Flux:

helm delete flux --purge

kubectl delete ns flux

kubectl delete crd helmreleases.flux.weave.works

5. Delete the load balancer services:

helm delete mywebserver --purge

kubectl delete ns nginx

kubectl delete svc eks-example -n eks-example

kubectl delete deployment eks-example -n eks-example

kubectl delete ns eks-example

6. Clean up your GitHub repositories:

 – Go to your k8s-config repository in GitHub, choose Settings, scroll to the bottom and choose Delete this repository. If you set delete to false in the pipeline service, you also must delete your eks-example repository.

 – Delete the personal access token that you created.

7.     If you provisioned an EKS cluster at the beginning of this post, delete it:

eksctl get cluster

eksctl delete cluster <clustername>

8.     In the AWS CloudFormation console, select the DevServiceCatalog stack, and choose the Actions, Delete Stack.

Conclusion

In this post, I demonstrated how to use a GitOps approach, which allows you to focus on committing code and configuration to Git rather than learning new CI/CD tooling. Git acts as the single source of truth, and Weave Flux pulls changes and ensures that the Kubernetes cluster configuration matches the desired state.

In addition, AWS Service Catalog can be used to create a portfolio of services that enables you to standardize your offerings, such as an image build pipeline based on AWS CodePipeline.

As always, AWS welcomes feedback. Please submit comments or questions below.

Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances

Post Syndicated from Chad Schmutzer original https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/

AWS announces the new capacity-optimized allocation strategy for Amazon EC2 Auto Scaling and EC2 Fleet. This new strategy automatically makes the most efficient use of spare capacity while still taking advantage of the steep discounts offered by Spot Instances. It’s a new way for you to gain easy access to extra EC2 compute capacity in the AWS Cloud.

This post compares how the capacity-optimized allocation strategy deploys capacity compared to the current lowest-price allocation strategy.

Overview

Spot Instances are spare EC2 compute capacity in the AWS Cloud available to you at savings of up to 90% off compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by EC2 with two minutes of notification when EC2 needs the capacity back.

When making requests for Spot Instances, customers can take advantage of allocation strategies within services such as EC2 Auto Scaling and EC2 Fleet. The allocation strategy determines how the Spot portion of your request is fulfilled from the possible Spot Instance pools you provide in the configuration.

The existing allocation strategy available in EC2 Auto Scaling and EC2 Fleet is called “lowest-price” (with an option to diversify across N pools). This strategy allocates capacity strictly based on the lowest-priced Spot Instance pool or pools. The “diversified” allocation strategy (available in EC2 Fleet but not in EC2 Auto Scaling) spreads your Spot Instances across all the Spot Instance pools you’ve specified as evenly as possible.

As the AWS global infrastructure has grown over time in terms of geographic Regions and Availability Zones as well as the raw number of EC2 Instance families and types, so has the amount of spare EC2 capacity. Therefore it is important that customers have access to tools to help them utilize spare EC2 capacity optimally. The new capacity-optimized strategy for both EC2 Auto Scaling and EC2 Fleet provisions Spot Instances from the most-available Spot Instance pools by analyzing capacity metrics.

Walkthrough

To illustrate how the capacity-optimized allocation strategy deploys capacity compared to the existing lowest-price allocation strategy, here are examples of Auto Scaling group configurations and use cases for each strategy.

Lowest-price (diversified over N pools) allocation strategy

The lowest-price allocation strategy deploys Spot Instances from the pools with the lowest price in each Availability Zone. This strategy has an optional modifier SpotInstancePools that provides the ability to diversify over the N lowest-priced pools in each Availability Zone.

Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The lowest-price strategy does not account for pool capacity depth as it deploys Spot Instances.

As a result, the lowest-price allocation strategy is a good choice for workloads with a low cost of interruption that want the lowest possible prices, such as:

  • Time-insensitive workloads
  • Extremely transient workloads
  • Workloads that are easily check-pointed and restarted

Example

The following example configuration shows how capacity could be allocated in an Auto Scaling group using the lowest-price allocation strategy diversified over two pools:

{
  "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale",
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "my-launch-template",
        "Version": "$Latest"
      },
      "Overrides": [
        {
          "InstanceType": "c3.large"
        },
        {
          "InstanceType": "c4.large"
        },
        {
          "InstanceType": "c5.large"
        }
      ]
    },
    "InstancesDistribution": {
      "OnDemandPercentageAboveBaseCapacity": 0,
      "SpotAllocationStrategy": "lowest-price",
      "SpotInstancePools": 2
    }
  },
  "MinSize": 10,
  "MaxSize": 100,
  "DesiredCapacity": 60,
  "HealthCheckType": "EC2",
  "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456"
}

In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to lowest-price over two SpotInstancePools.

First, EC2 Auto Scaling tries to make sure that it balances the requested capacity across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the lowest-price allocation strategy allocates the Spot Instance launches to the lowest-priced pool per zone.

Using the example Spot prices shown in the following table, the resulting allocation is:

  • 20 Spot Instances from us-east-1a (10 c3.large, 10 c4.large)
  • 20 Spot Instances from us-east-1b (10 c3.large, 10 c4.large)
  • 20 Spot Instances from us-east-1c (10 c3.large, 10 c4.large)
Availability ZoneInstance typeSpot Instances allocatedSpot price
us-east-1ac3.large10$0.0294
us-east-1ac4.large10$0.0308
us-east-1ac5.large0$0.0408
us-east-1bc3.large10$0.0294
us-east-1bc4.large10$0.0308
us-east-1bc5.large0$0.0387
us-east-1cc3.large10$0.0294
us-east-1cc4.large10$0.0331
us-east-1cc5.large0$0.0353

The cost for this Auto Scaling group is $1.83/hour. Of course, the Spot Instances are allocated according to the lowest price and are not optimized for capacity. The Auto Scaling group could experience higher interruptions if the lowest-priced Spot Instance pools are not as deep as others, since upon interruption the Auto Scaling group will attempt to re-provision instances into the lowest-priced Spot Instance pools.

Capacity-optimized allocation strategy

There is a price associated with interruptions, restarting work, and checkpointing. While the overall hourly cost of capacity-optimized allocation strategy might be slightly higher, the possibility of having fewer interruptions can lower the overall cost of your workload.

The effectiveness of the capacity-optimized allocation strategy depends on following Spot best practices by being flexible and providing as many instance types and Availability Zones (Spot Instance pools) as possible in the configuration. It is also important to understand that as capacity demands change, the allocations provided by this strategy also change over time.

Remember that Spot pricing changes slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. The capacity-optimized strategy does account for pool capacity depth as it deploys Spot Instances, but it does not account for Spot prices.

As a result, the capacity-optimized allocation strategy is a good choice for workloads with a high cost of interruption, such as:

  • Big data and analytics
  • Image and media rendering
  • Machine learning
  • High performance computing

Example

The following example configuration shows how capacity could be allocated in an Auto Scaling group using the capacity-optimized allocation strategy:

{
  "AutoScalingGroupName": "runningAmazonEC2WorkloadsAtScale",
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "my-launch-template",
        "Version": "$Latest"
      },
      "Overrides": [
        {
          "InstanceType": "c3.large"
        },
        {
          "InstanceType": "c4.large"
        },
        {
          "InstanceType": "c5.large"
        }
      ]
    },
    "InstancesDistribution": {
      "OnDemandPercentageAboveBaseCapacity": 0,
      "SpotAllocationStrategy": "capacity-optimized"
    }
  },
  "MinSize": 10,
  "MaxSize": 100,
  "DesiredCapacity": 60,
  "HealthCheckType": "EC2",
  "VPCZoneIdentifier": "subnet-a1234567890123456,subnet-b1234567890123456,subnet-c1234567890123456"
}

In this configuration, you request 60 Spot Instances because DesiredCapacity is set to 60 and OnDemandPercentageAboveBaseCapacity is set to 0. The example follows Spot best practices (especially critical when using the capacity-optimized allocation strategy) and is flexible across c3.large, c4.large, and c5.large in us-east-1a, us-east-1b, and us-east-1c (mapped according to the subnets in VPCZoneIdentifier). The Spot allocation strategy is set to capacity-optimized.

First, EC2 Auto Scaling tries to make sure that the requested capacity is evenly balanced across all the Availability Zones provided in the request. To do so, it splits the target capacity request of 60 across the three zones. Then, the capacity-optimized allocation strategy optimizes the Spot Instance launches by analyzing capacity metrics per instance type per zone. This is because this strategy effectively optimizes by capacity instead of by the lowest price (hence its name).

Using the example Spot prices shown in the following table, the resulting allocation is:

  • 20 Spot Instances from us-east-1a (20 c4.large)
  • 20 Spot Instances from us-east-1b (20 c3.large)
  • 20 Spot Instances from us-east-1c (20 c5.large)
Availability ZoneInstance typeSpot Instances allocatedSpot price
us-east-1ac3.large0$0.0294
us-east-1ac4.large20$0.0308
us-east-1ac5.large0$0.0408
us-east-1bc3.large20$0.0294
us-east-1bc4.large0$0.0308
us-east-1bc5.large0$0.0387
us-east-1cc3.large0$0.0294
us-east-1cc4.large0$0.0308
us-east-1cc5.large20$0.0353

The cost for this Auto Scaling group is $1.91/hour, only 5% more than the lowest-priced example above. However, notice the distribution of the Spot Instances is different. This is because the capacity-optimized allocation strategy determined this was the most efficient distribution from an available capacity perspective.

Conclusion

Consider using the new capacity-optimized allocation strategy to make the most efficient use of spare capacity. Automatically deploy into the most available Spot Instance pools—while still taking advantage of the steep discounts provided by Spot Instances.

This allocation strategy may be especially useful for workloads with a high cost of interruption, including:

  • Big data and analytics
  • Image and media rendering
  • Machine learning
  • High performance computing

No matter which allocation strategy you choose, you still enjoy the steep discounts provided by Spot Instances. These discounts are possible thanks to the stable Spot pricing made available with the new Spot pricing model.

Chad Schmutzer is a Principal Developer Advocate for the EC2 Spot team. Follow him on twitter to get the latest updates on saving at scale with Spot Instances, to provide feedback, or just say HI.