Tag Archives: Amazon EC2

Introducing AWS Solutions: Expert architectures on demand

Post Syndicated from AWS Admin original https://aws.amazon.com/blogs/architecture/introducing-aws-solutions-expert-architectures-on-demand/

AWS Solutions Architects are on the front line of helping customers succeed using our technologies. Our team members leverage their deep knowledge of AWS technologies to build custom solutions that solve specific problems for clients. But many customers want to solve common technical problems that don’t require custom solutions, or they want a general solution they can use as a reference to build their own custom solution. For these customers, we offer AWS Solutions: vetted, technical reference implementations built by AWS Solutions Architects and AWS Partner Network partners. AWS Solutions are designed to help customers solve common business and technical problems, or they can be customized for specific use cases.

AWS Solutions are built to be operationally effective, performant, reliable, secure, and cost-effective; and incorporate architectural frameworks such as the Well-Architected Framework. Every AWS Solution comes with a detailed architecture diagram, a deployment guide, and instructions for both manual and automated deployment.

Here are some Solutions we are particularly excited about.

Media2Cloud

We released the Media2Cloud solution in January 2019. This solution helps customers migrate their existing video archives to the cloud. Media2Cloud sets up a serverless end-to-end workflow to ingest your videos and establish metadata, proxy videos, and image thumbnails.

Because it can be a challenging and slow process to migrate your existing video archives to the cloud, the Media2Cloud solution builds the following architecture.

Media2Cloud architecture

The solution leverages the Media Analysis Solution to analyze and extract valuable metadata from your video archives using Amazon Rekognition, Amazon Transcribe, and Amazon Comprehend.

The solution also includes a simple web interface that helps make it easier to get started ingesting your videos to the AWS Cloud. This solution is set up to integrate with AWS Partner Network partners to help customers migrate their video archives to the cloud.

AWS Instance Scheduler

In October 2018, we updated the AWS Instance Scheduler, a solution that enables customers to easily configure custom start and stop schedules for their Amazon EC2 and Amazon RDS instances.

When you deploy the solution’s template, the solution builds the following architecture.

AWS Instance Scheduler

 

For customers who leave all of their instances running at full utilization, this solution can result in up to 70% cost savings for those instances that are only necessary during regular business.

The Instance Scheduler solution gives you the flexibility to automatically manage multiple schedules as necessary, configure multiple start and stop schedules by either deploying multiple Instance Schedulers or modifying individual resource tags, and review Instance Scheduler metrics to better assess your Instance capacity and usage, and to calculate your cost savings.

AWS Connected Vehicle Solution

In January 2018, we updated the AWS Connected Vehicle Solution, a solution that provides secure vehicle connectivity to the AWS Cloud. This solution includes capabilities for local computing within vehicles, sophisticated event rules, and data processing and storage. The solution also allows you to implement a core framework for connected vehicle services that allows you to focus on developing new functionality rather than managing infrastructure.

When you deploy the solution’s template, the solution builds the following architecture.

Connected Vehicle solution

You can build upon this framework to address a variety of use cases such as voice interaction, navigation and other location-based services, remote vehicle diagnostics and health monitoring, predictive analytics and required maintenance alerts, media streaming services, vehicle safety and security services, head unit applications, and mobile applications.

These are just some of our current offerings. Other notable Solutions include AWS WAF Security Automations, Machine Learning for Telecommunication, and AWS Landing Zone. In the coming months, we plan to continue expanding our portfolio of AWS Solutions to address common business and technical problems that our customers face. Visit our homepage to keep up to date with the latest AWS Solutions.

Now Available – Five New Amazon EC2 Bare Metal Instances: M5, M5d, R5, R5d, and z1d

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-five-new-amazon-ec2-bare-metal-instances-m5-m5d-r5-r5d-and-z1d/

Today we are launching the five new EC2 bare metal instances that I promised you a few months ago. Your operating system runs on the underlying hardware and has direct access to the processor and other hardware. The instances are powered by AWS-custom Intel® Xeon® Scalable Processor (Skylake) processors that deliver sustained all-core Turbo performance.

Here are the specs:

Instance Name Sustained All-Core Turbo
Logical Processors Memory Local Storage EBS-Optimized Bandwidth Network Bandwidth
m5.metal Up to 3.1 GHz 96 384 GiB 14 Gbps 25 Gbps
m5d.metal Up to 3.1 GHz 96 384 GiB 4 x 900 GB NVMe SSD 14 Gbps 25 Gbps
r5.metal Up to 3.1 GHz 96 768 GiB 14 Gbps 25 Gbps
r5d.metal Up to 3.1 GHz 96 768 GiB 4 x 900 GB NVMe SSD 14 Gbps 25 Gbps
z1d.metal Up to 4.0 GHz 48 384 GiB 2 x 900 GB NVMe SSD 14 Gbps 25 Gbps

The M5 instances are designed for general-purpose workloads, such as web and application servers, gaming servers, caching fleets, and app development environments. The R5 instances are designed for high performance databases, web scale in-memory caches, mid-sized in-memory databases, real-time big data analytics, and other memory-intensive enterprise applications. The M5d and R5d variants also include 3.6 TB of local NVMe SSD storage.

z1d instances provide high compute performance and lots of memory, making them ideal for electronic design automation (EDA) and relational databases with high per-core licensing costs. The high CPU performance allows you to license fewer cores and significantly reduce your TCO for Oracle or SQL Server workloads.

All of the instances are powered by the AWS Nitro System, with dedicated hardware accelerators for EBS processing (including crypto operations), the software-defined network inside of each Virtual Private Cloud (VPC), ENA networking, and access to the local NVMe storage on the M5d, R5d, and z1d instances. Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

In addition to being a great home for old-school applications and system software that are licensed specifically and exclusively for use on physical, non-virtualized hardware, bare metal instances can be used to run tools and applications that require access to low-level processor features such as performance counters. For example, Mozilla’s Record and Replay Framework (rr) records and replays program execution with low overhead, using the performance counters to measure application performance and to deliver signals and context-switch events with high fidelity. You can read their paper, Engineering Record And Replay For Deployability, to learn more.

Launch One Today
m5.metal instances are available in the US East (N. Virginia and Ohio), US West (N. California and Oregon), Europe (Frankfurt, Ireland, London, Paris, and Stockholm), and Asia Pacific (Mumbai, Seoul, Singapore, Sydney, and Tokyo) AWS regions.

m5d.metal instances are available in the US East (N. Virginia and Ohio), US West (Oregon), Europe (Frankfurt, Ireland, Paris, and Stockholm), and Asia Pacific (Mumbai, Seoul, Singapore, and Sydney) AWS regions.

r5.metal instances are available in the US East (N. Virginia and Ohio), US West (N. California and Oregon), Europe (Frankfurt, Ireland, Paris, and Stockholm), Asia Pacific (Mumbai, Seoul, and Singapore), and AWS GovCloud (US-West) AWS regions.

r5d.metal instances are available in the US East (N. Virginia and Ohio), US West (N. California), Europe (Frankfurt, Paris, and Stockholm), Asia Pacific (Mumbai, Seoul, and Singapore), and AWS GovCloud (US-West) AWS regions.

z1d.metal instances are available in the US East (N. Virginia), US West (N. California and Oregon), Europe (Ireland), and Asia Pacific (Singapore and Tokyo) AWS regions.

The bare metal instances will become available in even more AWS regions as soon as possible.

Jeff;

 

Best Practices for Porting Applications to the EC2 A1 Instance Type

Post Syndicated from Martin Yip original https://aws.amazon.com/blogs/compute/best-practices-for-porting-applications-to-the-ec2-a1-instance-type/

This post courtesy of Dr. Jonathan Shapiro-Ward, AWS Solutions Architect

The new Amazon EC2 A1 instance types are powered by an ARM based AWS Graviton CPU. A1 instances are extremely cost effective and are ideal for scale-out scenarios where a large number of smaller instances are required.

Prior the launch of A1 instances, all AWS instances were x86 based. There are a number of significant differences between the two instruction sets. The predominant difference is that ARM follows a RISC (Reduced Instruction Set Computer) design, whereas x86 is a CISC (Complex Instruction Set Computer) architecture. In short, ARM is a comparatively simple architecture composed of simple instructions which execute within a single cycle. Meanwhile, x86 has mostly complex instructions that execute over multiple cycles. This key difference has a range of implications for compiler and hardware complexity, power efficiency, and performance, but the key question to application developers is portability.

Workloads built for x86 will not run on the A1 family of instances, they must be ported. In many cases this is trivial. An extensive range of Free and Open Source software supports ARM and requires no modification to port workloads. Aside from installing a different binary, the process for installing the Apache Web Server, Nginx, PostgreSQL, Docker, and many more applications is unchanged. Unfortunately, porting is not always so simple, especially for applications developed in house. Ideally, software is written to be portable, but this is not always the case. Even workloads written in languages designed for portability such as Java can prove a challenge. In this blog post, we’ll review common challenges and migration paths from porting from x86 based architectures to ARM.

A General Porting Strategy

  1. Check for Core Language and Platform Support. The vast majority of common languages such as Java, Python, Perl, Ruby, PHP, Go, Rust, and so forth support modern ARM architectures. Likewise, major frameworks such as Django, Spring, Hadoop, Apache Spark, and many more run on ARM. More niche languages and frameworks may not have as robust support. If your language or crucial framework is dependent on x86, you may not be able to run your application on the A1 instance type.
  2. Identify all third party libraries and dependencies. All non-trivial applications rely on third party libraries to provide essential functionality. These can range from a standard library, to open source libraries, to paid for proprietary libraries. Examine these libraries and determine if they support ARM. In the event that a library is dependent upon a specific architecture, search for an open issue around ARM support or inquire with the vendor as to the roadmap. If possible, investigate alternative libraries if ARM support is not forthcoming.
  3. Identify Porting Path. There are three common strategies for porting. The strategy that applies will depend upon the language being used.
    1. For interpreted and those compiled to bytecode, the first step is to translate runbooks, scripts, AMIs, and templates to install the ARM equivalent of the interpreter or language VM. For instance, installing the ARM version of the JVM or CPython. If installation is done via package manager, no change may be necessary. Subsequently it is necessary to ensure that any native code libraries are replaced with the ARM equivalent. For Java, this would entail swapping out libraries leveraging JNI. For Python, this would entail swapping out CPython C-API based libraries (such as numpy). Once again, if this is done via a package manager such as yum or pip, manual intervention should not be necessary.
    2. In the case of compiled languages such as C/C++, the application will have to be re-compiled for ARM. If your application utilizes any machine specific features or relies on behavior that varies between compilers, it may be necessary to re-write parts of your application. This is discussed in more detail below. If your application follows common standards such as ANSI C and avoids any machine specific dependencies, the majority of effort will be spent in modifying the build process. This will depend upon your build process and will likely involve modifying build configurations, makefiles, configure scripts, or other assets. The binary can either be compiled on an A1 instance or cross compiled. Once a binary has been produced, the deployment process should be adapted to target the A1 instance.
    3. For applications built using platform specific languages and frameworks, which have an ARM alternative, the application will have to be ported to this alternative. By far the most common example of this is the .NET Framework. In order to run on ARM, .NET applications must be ported to .NET Core or to Mono. This will likely entail a non-trivial modification to the application codebase. This is discussed in more detail below.
  4. Test! All tests must be ported over and, if there is insufficient test coverage, it may be necessary to write new tests. This is especially pertinent if you had to recompile your application. A strong test suite should identify any issues arising from machine specific code that behaves incorrectly on the A1 instance type. Perform the usual range of unit testing, acceptance testing, and pre-prod testing.
  5. Update your infrastructure as code resources, such as AWS CloudFormation templates, to provision your application to A1 instances. This will likely be a simple change, modifying the instance type, AMI, and user data to reflect the change to the A1 instance.
  6. Perform a Green/Blue Deployment. Create your new A1 based stack alongside your existing stack and leverage Route 53 weighted routing to route 10% of requests to the new stack. Monitor error rates, user behavior, load, and other critical factors in order to determine the health of the ported application in production. If the application behaves correctly, swap over all traffic to the new stack. Otherwise, reexamine the application and identify the root cause of any errors.

Porting C/C++ Applications to A1 Instances

C and C++ are very portable languages. Indeed, C runs on more architectures than any other language but that is not to say that all C will run on all systems. When porting applications to A1 instances, many of the challenges one might expect do not arise. The AWS Graviton chip is little endian, just like x86 and int, float, double, and other common types are the same size between architectures. This does not, however, guarantee portability. Most commonly, issues porting a C based application arise from aspects of the C standard which are dependent upon the architecture and implementation. Let’s briefly look at some examples of C that are not portable between architectures.

The most frequently discussed issue in porting C from x86 to ARM is the use of the character datatype. This issue arises from the C99 standard which requires the implementation to decide if the char datatype is signed or unsigned. On x86 Linux a char is signed by default. On ARM Linux a char is unsigned. This discrepancy is due to performance, with unsigned char types resulting in more efficient ARM assembly. This can, however, cause issues. Let’s examine the following code listing:

//Code Listing 1
#include <stdio.h>

int main(){
    char c = -1;
    if (c < 0){
        printf("The value of the char is less than 0\n");
    } else {
        printf("The value of the char is greater than 0\n");
    }
    return 0;
}

On an x86 instance (in this case a t3.large) the above code has the expected result – printing “The value of the char is less than 0”. On an A1 instance, it does not. There are mechanisms around this, for example gcc has the -fsigned-char flag which forces all char types to become signed upon compilation. Crucially, a developer must be aware of these types of issues ahead of time. Not all compilers and warn levels will provide appropriate warnings around char signedness (and indeed many other issues arising from architectural differences). Resultantly, without rigorous testing, it is possible to introduce unexpected errors by porting. This makes a comprehensive set of tests for your application an essential part of the porting process.

If your application has only ever been built for a single target environment, (e.g. x86 Linux with gcc) there are potentially unexpected behaviors which will emerge when that application is built for a different architecture or with a different compiler.

The key best practice for ensuring portability of your C applications (and other compiled languages) is to adhere to a standard. Vanilla C99 will ensure the broadest compatibility across architectures and operating systems. A compiler specific standard such as gnu99 will ensure compatibility but can tether you to one compiler.

Static analysis should always be used when building your applications. Static analysis helps to detect bugs, security flaws, and pertinent to our discussion: compatibility issues.

Porting .NET Applications to A1 Instances

Porting a .NET application from a Windows instance to an A1 Linux instance can yield significant cost reductions and is an effective way to economically scale out web apps and other parallelizable workloads.

For all intents and purposes, the .NET Framework only runs on x86 Windows. This limitation does not necessarily prevent running your .NET application on an A1 instance. There are two .NET implementations, .NET Core and .NET Framework. The .NET Framework is the modern evolution of the original .NET release and is tightly coupled to x86 Windows. Meanwhile, .NET Core is a recent open source project, developed by Microsoft, which is decoupled from Windows and runs on a variety of platforms, including ARM Linux. There is one final option, Mono – an open source implementation of the .NET Framework which runs on a variety of architectures.

For greenfield projects .NET Core has become the de facto option as it has a number of advantages over the alternatives. Projects based on .NET Core are cross platform and significantly lighter than .Net Framework and Mono projects – making them far better suited to developing microservices and to running in containers or serverless environments. From version 2.1 onward, .NET Core supports ARM.

For existing projects, .NET Core is the best migration path to containers and to running on A1 instances. There are, however, a number of factors that might prohibit replatforming to .NET Core. These include

• Dependency on Windows specific APIs
• Reliance on features that are only available in .NET Framework such as WPF or Windows Forms. Many of these features are coming to .NET Core as part of .NET Core 3, but at time or writing, this is in preview.
• Use of third-party libraries that do not support .NET Core.
• Use of an unsupported language Currently, .NET Core only supports C#, F#, Visual Studio.

There are a number of tools to help port from .NET Framework to .NET Core. These include the .Net portability analyzer, which will analyze a .NET codebase and determine any factors that might prohibit porting.

Conclusion

The A1 instances can deliver significant cost savings over other instances types and are ideal for scale-out applications such as microservices. In many cases, moving to A1 instances can be easy. Many languages, frameworks, and applications have strong support for ARM. For Python, Java, Ruby, and other open source languages, porting can be trivial. For other applications such as native binary applications or .net applications, there can be some challenges. By examining your application and determining what, if any, x86 dependencies exist you can devise a migration strategy which will enable you to make use of A1 instances.

AWS Ops Automator v2 features vertical scaling (Preview)

Post Syndicated from AWS Admin original https://aws.amazon.com/blogs/architecture/aws-ops-automator-v2-features-vertical-scaling-preview/

The new version of the AWS Ops Automator, a solution that enables you to automatically manage your AWS resources, features vertical scaling for Amazon EC2 instances. With vertical scaling, the solution automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost. The solution can resize your instances by restarting your existing instance with a new size. Or, the solution can resize your instances by replacing your existing instance with a new, resized instance.

With this update, the AWS Ops Automator can help make setting up vertical scaling easier. All you have to do is define the time-based or event-based trigger that determines when the solution scales your instances, and choose whether you want to change the size of your existing instances or replace your instances with new, resized instances. The time-based or event-based trigger invokes the AWS Lambda to scale your instances.

Ops Automator Vertical Scaling

Restarting with a new size

When you choose to resize your instances by restarting the instance with a new size, the solution increases or decreases the size of your existing instances in response to changes in demand or at a specified point in time. The solution automatically changes the instance size to the next defined size up or down.

Replacing with a new, resized instance

Alternatively, you can choose to have the Ops Automator replace your instance with a new, resized instance instead of restarting your existing instance. When the solution determines that your instances need to be scaled, the solution launches new instances with the next defined instance size up or down. The solution is also integrated with Elastic Load Balancing to automatically register the new instance with load Balancers.

Getting Started

To learn more, visit the solution webpage and request access to the private preview.

Learn about hourly-replication in Server Migration Service and the ability to migrate large data volumes

Post Syndicated from Martin Yip original https://aws.amazon.com/blogs/compute/learn-about-hourly-replication-in-server-migration-service-and-the-ability-to-migrate-large-data-volumes/

This post courtesy of Shane Baldacchino, AWS Solutions Architect

AWS Server Migration Service (AWS SMS) is an agentless service that makes it easier and faster for you to migrate thousands of on-premises workloads to AWS. AWS SMS allows you to automate, schedule, and track incremental replications of live server volumes, making it easier for you to coordinate large-scale server migrations.

In my previous blog posts, we introduced how you can use AWS Server Migration Service (AWS SMS) to migrate a popular commercial off the shelf software, WordPress into AWS.

For details and a walkthrough on how setup the AWS Server Migration Service, please see the following blog posts for Hyper-V and VMware hypervisors which will guide you through the high level process.

In this article we are going to step it up a few notches and look past common the migration of off-the-shelf software and provide you a pattern on how you can use AWS SMS and some of the recently launched features to migrate a more complicated environment, especially compression and resiliency for replication jobs and the support for data volumes greater than 4TB.

This post covers a migration of a complex internally developed eCommerce system comprising of a polyglot architecture. It is made up a Windows Microsoft IIS presentation tier, Tomcat application tier, and Microsoft SQL Server database tier. All workloads run on-premises as virtual machines in a VMware vCenter 5.5 and ESX 5.5 environment.

This theoretical customer environment has various business and infrastructure requirements.

Application downtime: During any migration activities, the application cannot be offline for more than 2 hours
Licensing: The customer has renewed their Microsoft SQL Server license for an additional 3 years and holds License Mobility with Software Assurance option for Microsoft SQL Server and therefore wants to take advantage of AWS BYOL licensing for Microsoft SQL server and Microsoft Windows Server.
Large data volumes: The Microsoft SQL Server database engine (.mdf, .ldf and .ndf files) consumes 11TB of storage

Walkthrough

Key elements of this migration process are identical to the process outlined in my previous blog posts and for more information on this process, please see the following blog posts Hyper-V and VMware hypervisors, but a high level you will need to.

• Establish your AWS environment.
• Download the SMS Connector from the AWS Management Console.
• Configure AWS SMS and Hypervisor permissions.
• Install and configure the SMS Connector appliance.
• Import your virtual machine inventory and create a replication jobs
• Launch your Amazon EC2 instances and associated NACL’s, Security Groups and AWS Elastic Load Balancers
• Change your DNS records to resolve the custom application to an AWS Elastic Load Balancer.

Before you start, ensure that your source systems OS and vCenter version are supported by AWS. For more information, see the Server Migration Service FAQ.

Planning the Migration

Once you have downloaded and configured the AWS SMS connector with your given Hypervisor you can get started in creating replication jobs.

The artifacts derived from our replication jobs with AWS SMS will be AMI’s (Amazon Machine Images) and as such we do not need to replicate each server individually and that is because we have a three-tier architecture that has commonality between servers with multiple Application and Web servers performing the same function, and as such we can leverage a common AMI and create three replication jobs.

1. Microsoft SQL Server – Database Tier
2. Ubuntu Server – Application Tier
3. IIS Web server – Webserver Tier

Performing the Replication

After validating that the SMS Connector is in a “HEALTHY” state, import your server catalog from your Hypervisor to AWS SMS. This process can take up to a minute.

Select the three servers (Microsoft SQL Server, Ubuntu Server, IIS Web server) to migrate and choose Create replication job. AWS SMS now supports creating replications jobs with frequencies as short as 1 hour, and as such to ensure our business RTO (Recovery Time Objective) of 2 hours is met we will create our replication jobs with a frequency of 1 hour. This will minimize the risk of any delta updates during the cutover windows not completing.

Given the businesses existing licensing investment in Microsoft SQL Server, they will leverage these the BYOL (Bring Your Own License) offering when creating the Microsoft SQL Server replication job.

The AWS SMS console guides you through the process. The time that the initial replication task takes to complete is dependent on available bandwidth and the size of your virtual machines.

After the initial seed replication, network bandwidth requirement is minimized as AWS SMS replicates only incremental changes occurring on the VM.

The progress updates from AWS SMS are automatically sent to AWS Migration Hub so that you can track tasks in progress.

AWS Migration Hub provides a single location to track the progress of application migrations across multiple AWS and partner solutions. In this post, we are using AWS SMS as a mechanism to migrate the virtual machines (VMs) and track them via AWS Migration Hub.

Migration Hub and AWS SMS are both free. You pay only for the cost of the individual migration tools that you use, and any resources being consumed on AWS

The dashboard reflects any status changes that occur in the linked services. You can see from the following image that two servers are complete whilst another is in progress.

Using Migration Hub, you can view the migration progress of all applications. This allows you to quickly get progress updates across all of your migrations, easily identify and troubleshoot any issues, and reduce the overall time and effort spent on your migration projects.

After validating that the SMS Connector

Testing Your Replicated Instances

Thirty hours after creating the replication jobs, notification was received via AWS SNS (Simple Notification Service) that all 3 replication jobs have completed. During the 30-hour replication window the customers ISP experienced downtime and sporadic flapping of the link, but this was negated by the network auto-recovery feature of SMS. It recovered and resumed replication without any intervention.

With the replication tasks being complete. The artifact created by AWS SMS is a custom AMI that you can use to deploy an EC2 instance. Follow the usual process to launch your EC2 instance, noting that you may need to replace any host-based firewalls with security groups and NACLs and any hardware based load balancers with Elastic Load Balancing to achieve fault tolerance, scalability, performance and security.

As this environment is a 3-tier architecture with commonality been tiers (Application and Presentation Tier) we can create during the EC2 Launch process an ASG (Auto Scaling Group) to ensure that deployed capacity matches user demand. The ASG will be based on the custom AMI’s generated by the replication jobs.

When you create an EC2 instance, ensure that you pick the most suitable EC2 instance type and size to match your performance and cost requirements.

While your new EC2 instances are a replica of your on-premises VM, you should always validate that applications are functioning. How you do this differs on an application-by-application basis. You can use a combination of approaches, such as editing a local host file and testing your application, SSH, RDP and Telnet.

For our Windows Presentation and database tier, I can RDP in to my systems and validate IIS 8.0 and other services are functioning correctly.

For our Ubuntu Application tier, we can SSH in to perform validation.

Post validation of each individual server we can now continue to test the application end to end. This is because our systems have been instantiated inside a VPC with no route back to our on-premises environment which allows us to test functionality without the risk of communication back to our production application.

After validation of systems it is now time to cut over, plan your runbook accordingly to ensure you either eliminate or minimize application disruption.

Cutting Over

As the replication window specified in AWS SMS replication jobs was 1 hour, there were hourly AMI’s created that provide delta updates since the initial seed replication was performed. The customer verified the stack by executing the previously created runbook using the latest AMIs, and verified the application behaved as expected.

After another round of testing, the customer decided to plan the cutover on the coming Saturday at midnight, by announcing a two-hour scheduled maintenance window. During the cutover window, the customer took the application offline, shutdown Microsoft SQL Server instance and performed an on-demand sync of all systems.

This generate a new versioned AMI that contained all on-premise data. The customer then executed the runbook on the new AMI’s. For the application and presentation tier these AMI’s were used in the ASG configuration. After application validation Amazon Route 53 was updated to resolve the application CNAME to the Application Load Balancer CNAME used to load balance traffic to the fleet of IIS servers.

Based on the TTL (Time To Live) of your Amazon Route 53 DNS zone file, end users slowly resolve the application to AWS, in this case within 300 seconds. Once this TTL period had elapsed the customer brought their application back online and exited their maintenance window, with time to spare.

After modifying the Amazon Route 53 Zone Apex, the physical topology now looks as follows with traffic being routed to AWS.

After validation of a successful migration the customer deleted their AWS Server Migration Service replication jobs and began planning to decommission their on-premises resources.

Summary

This is an example pattern on migrate a complex custom polyglot environment in to AWS using AWS migration services, specifically leveraging many of the new features of the AWS SMS service.

Many architectures can be extended to use many of the inherent benefits of AWS, with little effort. For example this article illustrated how AWS Migration Services can be used to migrate complex environments in to AWS and then use native AWS services such as Amazon CloudWatch metrics to drive Auto Scaling policies to ensure deployed capacity matches user demand whilst technologies such as Application Load Balancers can be used to achieve fault tolerance and scalability

Think big and get building!

 

 

Handling AWS Chargebacks for Enterprise Customers

Post Syndicated from Varad Ram original https://aws.amazon.com/blogs/architecture/handling-aws-chargebacks-for-enterprise-customers/

As AWS product portfolios and feature sets grow, as an enterprise customer, you are likely to migrate your existing workloads and innovate your new products on AWS. To help you keep your cloud charges simple, you can use consolidated billing. This can, however, create complexity for your internal chargebacks, especially if some of your resources and services are not tagged correctly. To help your individual teams and business units normalize and reduce their costs as your AWS implementation grows, you can implement chargebacks transparently and automate billing.

This blog post includes a walkthrough of an end-to-end mechanism that you can use to automate your consolidated billing charges for either your existing AWS accounts, or for newly created accounts.

Walkthrough

Prerequisites for implementation:

  • One account that is the payer account, which consolidates billing and links all other accounts (including admin accounts)
  • An understanding of billing, Detailed Billing Report (DBR), Cost and Usage Report (CUR), and blended and unblended costs
  • Activate propagation of necessary cost allocation tags to consolidated billing
  • Access to reservations across the linked accounts
  • Read permission on the source bucket and write permission to the transformed bucket
  • An automated method (such as database access or an API) to verify the cost centers tagged to AWS resources
  • Permissions to get access to the services described in this solution on the account targeted for this automation

Before you begin, it is important to understand the blended costs and unblended costs in consolidated billing. Blended costs are calculated based on the blended rate (the average rates for the reserved and on-demand instances that are used by your member accounts) for each service your accounts used, multiplied by the account usage of those services. Unblended costs are the charges for those services broken out for each linked account.

Based on your organization’s strategy for savings (centralized or not), you could consider either the blended or unblended costs. The consolidated billing files that include the information for the chargeback are the Detailed Billing Report (DBR) and Cost and Usage Report (CUR). Both of these reports provide both the blended and unblended rates as separate columns.

To help you create and maintain your AWS accounts, you can use AWS Account Vending Machine (AVM). You can launch AVM from either the AWS Landing Zone or with a custom solution. AVM keeps all your account information in a DynamoDB table (such as the account number, root mail ID, default cost center, name of the owner, etc.) and maintains reservation-related data (such as invoice ID, instance type, region, amount, cost center, etc.) in another table. To enable your account administrator to add invoice details for all your reservations, you can use a web page hosted on AWS Lambda, Amazon Simple Storage Service (Amazon S3), or a web server.

To begin the process of billing transformation, you must add a trigger on an S3 bucket (which contains raw AWS billing files) that pushes messages (PutObject) into Amazon Simple Queue Service (SQS) and your billing transformation program (written in Python, Nodejs, Java, .net, etc. using AWS SDK) that runs on an Amazon Elastic Compute Cloud (Amazon EC2) instance, containers, or Lambda (if the bill can be processed within 15 minutes with file size restrictions).

The billing transformation program must do the following:

  • Cache the Account details and reservation DynamoDB tables
  • Verify if there are any messages in SQS
  • Ignore if the file is not a DBR or CUR file (process either of them, not both)
  • Download the file, unzip, and read row-by-row; for a DBR file, consider only the “LineItem” RecordType
  • Add two new columns: Bill_CostCenter and Bill_Notes
    • If there is a valid value in the CostCenter tag (verified with internal automation processes), add the same value to the Bill_CostCenter column and any notes to the Bill_Notes column
    • If the CostCenter is invalid, get the default Cost Center from the cached account details and add the information to the Bill_CostCenter and Bill_Notes columns
    • If the row is a reservation invoice, the cost center information comes from the reservation table and is added to the correct column
  • Cache consolidation of cost centers with the blended or unblended cost of each row
  • Write each of these processed line items into a new file
  • Handle exceptions by the normal organization practices (for example, email the owner of the cost center or the finance team)
  • Push the new file into the transformed Amazon S3 bucket
  • Write the consolidated lines into a different file and upload to Transformed Amazon S3 bucket
Figure 1 – Architecture of processing a billing chargeback

Figure 1 – Architecture of processing a billing chargeback

 

Figure 2 – Validating the Cost Center process

After you have the consolidated billing file aggregated by cost center, you can easily see and handle your internal chargebacks. To further simplify your chargeback model, you can get help from AWS Technical Account Managers and Billing Concierge, if your organization would like AWS to provide custom invoices from the consolidated billing file.

Because the cost centers in your organization can expire over time, it’s important validate them frequently with automation, such as a Lambda program.

Improvements

If your organization has a more complex chargeback structure, you can extend the logic described above to support deeper and broader chargeback codes, or implement hierarchical chargeback structure.

You can also extend the transformation logic to support several chargeback codes (such as comma separated or with additional tags) if you have multiple teams or project that want to share a resource.

Summary

As enterprise organizations grow and consume more cloud services, the cost optimization process grows and evolves with them. Sophisticated chargeback models enable the teams and business units in the organization to be accountable and contribute to take the steps necessary to normalize the usage and costs of AWS services.

About the Author

Varad RamVarad Ram likes to help customers adopt to cloud technologies and he is particularly interested in Artificial Intelligence. He believes Deep Learning will power future technology growth. In his spare time, his daughter and toddler son keep him busy biking and hiking.

Podcast #289: A Look at Amazon FSx For Windows File Server

Post Syndicated from Simon Elisha original https://aws.amazon.com/blogs/aws/podcast-amazon-fsx-for-windows-file-server/

In this episode, Simon speaks with Andrew Crudge (Senior Product Manager, FSx) about this newly released service, capabilities available to customers and how to make the best use of it in your environment.

Additional Resources

About the AWS Podcast

The AWS Podcast is a cloud platform podcast for developers, dev ops, and cloud professionals seeking the latest news and trends in storage, security, infrastructure, serverless, and more. Join Simon Elisha and Jeff Barr for regular updates, deep dives and interviews. Whether you’re building machine learning and AI models, open source projects, or hybrid cloud solutions, the AWS Podcast has something for you. Subscribe with one of the following:

Like the Podcast?

Rate us on iTunes and send your suggestions, show ideas, and comments to [email protected]. We want to hear from you!

Western Digital HDD Simulation at Cloud Scale – 2.5 Million HPC Tasks, 40K EC2 Spot Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/western-digital-hdd-simulation-at-cloud-scale-2-5-million-hpc-tasks-40k-ec2-spot-instances/

Earlier this month my colleague Bala Thekkedath published a story about Extreme Scale HPC and talked about how AWS customer Western Digital built a cloud-scale HPC cluster on AWS and used it to simulate crucial elements of upcoming head designs for their next-generation hard disk drives (HDD).

The simulation described in the story encompassed a little over 2.5 million tasks, and ran to completion in just 8 hours on a million-vCPU Amazon EC2 cluster. As Bala shared in his story, much of the simulation work at Western Digital revolves around the need to evaluate different combinations of technologies and solutions that comprise an HDD. The engineers focus on cramming ever-more data into the same space, improving storage capacity and increasing transfer speed in the process. Simulating millions of combinations of materials, energy levels, and rotational speeds allows them to pursue the highest density and the fastest read-write times. Getting the results more quickly allows them to make better decisions and lets them get new products to market more rapidly than before.

Here’s a visualization of Western Digital’s energy-assisted recording process in action. The top stripe represents the magnetism; the middle one represents the added energy (heat); and the bottom one represents the actual data written to the medium via the combination of magnetism and heat:

I recently spoke to my colleagues and to the teams at Western Digital and Univa who worked together to make this record-breaking run a reality. My goal was to find out more about how they prepared for this run, see what they learned, and to share it with you in case you are ready to run a large-scale job of your own.

Ramping Up
About two years ago, the Western Digital team was running clusters as big as 80K vCPUs, powered by EC2 Spot Instances in order to be as cost-effective as possible. They had grown to the 80K vCPU level after repeated, successful runs with 8K, 16K, and 32K vCPUs. After these early successes, they decided to shoot for the moon, push the boundaries, and work toward a one million vCPU run. They knew that this would stress and tax their existing tools, and settled on a find/fix/scale-some-more methodology.

Univa’s Grid Engine is a batch scheduler. It is responsible for keeping track of the available compute resources (EC2 instances) and dispatching work to the instances as quickly and efficiently as possible. The goal is to get the job done in the smallest amount of time and at the lowest cost. Univa’s Navops Launch supports container-based computing and also played a critical role in this run by allowing the same containers to be used for Grid Engine and AWS Batch.

One interesting scaling challenge arose when 50K hosts created concurrent connections to the Grid Engine scheduler. Once running, the scheduler can dispatch up to 3000 tasks per second, with an extra burst in the (relatively rare) case that an instance terminates unexpectedly and signals the need to reschedule 64 or more tasks as quickly as possible. The team also found that referencing worker instances by IP addresses allowed them to sidestep some internal (AWS) rate limits on the number of DNS lookups per Elastic Network Interface.

The entire simulation is packed in a Docker container for ease of use. When newly launched instances come online they register their specs (instance type, IP address, vCPU count, memory, and so forth) in an ElastiCache for Redis cluster. Grid Engine uses this data to find and manage instances; this is more efficient and scalable than calling DescribeInstances continually.

The simulation tasks read and write data from Amazon Simple Storage Service (S3), taking advantage of S3’s ability to store vast amounts of data and to handle any conceivable request rate.

Inside a Simulation Task
Each potential head design is described by a collection of parameters; the overall simulation run consists of an exploration of this parameter space. The results of the run help the designers to find designs that are buildable, reliable, and manufacturable. This particular run focused on modeling write operations.

Each simulation task ran for 2 to 3 hours, depending on the EC2 instance type. In order to avoid losing work if a Spot Instance is about to be terminated, the tasks checkpoint themselves to S3 every 15 minutes, with a bit of extra logic to cover the important case where the job finishes after the termination signal but before the actual shutdown.

Making the Run
After just 6 weeks of planning and prep (including multiple large-scale AWS Batch runs to generate the input files), the combined Western Digital / Univa / AWS team was ready to make the full-scale run. They used an AWS CloudFormation template to start Grid Engine and launch the cluster. Due to the Redis-based tracking that I described earlier, they were able to start dispatching tasks to instances as soon as they became available. The cluster grew to one million vCPUs in 1 hour and 32 minutes and ran full-bore for 6 hours:

When there were no more undispatched tasks available, Grid Engine began to shut the instances down, reaching the zero-instance point in about an hour. During the run, Grid Engine was able to keep the instances fully supplied with work over 99% of the time. The run used a combination of C3, C4, M4, R3, R4, and M5 instances. Here’s the overall breakdown over the course of the run:

The job spanned all six Availability Zones in the US East (N. Virginia) Region. Spot bids were placed at the On-Demand price. Over the course of the run, about 1.5% of the instances in the fleet were terminated and automatically replaced; the vast majority of the instances stayed running for the entire time.

And That’s That
This job ran 8 hours and cost $137,307 ($17,164 per hour). The folks I talked to estimated that this was about half the cost of making the run on an in-house cluster, if they had one of that size!

Evaluating the success of the run, Steve Phillpott (CIO of Western Digital) told us:

“Storage technology is amazingly complex and we’re constantly pushing the limits of physics and engineering to deliver next-generation capacities and technical innovation. This successful collaboration with AWS shows the extreme scale, power and agility of cloud-based HPC to help us run complex simulations for future storage architecture analysis and materials science explorations. Using AWS to easily shrink simulation time from 20 days to 8 hours allows Western Digital R&D teams to explore new designs and innovations at a pace un-imaginable just a short time ago.”

The Western Digital team behind this one is hiring an R&D Engineering Technologist; they also have many other open positions!

A Run for You
If you want to do a run on the order of 100K to 1M cores (or more), our HPC team is ready to help, as are our friends at Univa. To get started, Contact HPC Sales!

Jeff;

Optimizing a Lift-and-Shift for Security

Post Syndicated from Jonathan Shapiro-Ward original https://aws.amazon.com/blogs/architecture/optimizing-a-lift-and-shift-for-security/

This is the third and final blog within a three-part series that examines how to optimize lift-and-shift workloads. A lift-and-shift is a common approach for migrating to AWS, whereby you move a workload from on-prem with little or no modification. This third blog examines how lift-and-shift workloads can benefit from an improved security posture with no modification to the application codebase. (Read about optimizing a lift-and-shift for performance and for cost effectiveness.)

Moving to AWS can help to strengthen your security posture by eliminating many of the risks present in on-premise deployments. It is still essential to consider how to best use AWS security controls and mechanisms to ensure the security of your workload. Security can often be a significant concern in lift-and-shift workloads, especially for legacy workloads where modern encryption and security features may not present. By making use of AWS security features you can significantly improve the security posture of a lift-and-shift workload, even if it lacks native support for modern security best practices.

Adding TLS with Application Load Balancers

Legacy applications are often the subject of a lift-and-shift. Such migrations can help reduce risks by moving away from out of date hardware but security risks are often harder to manage. Many legacy applications leverage HTTP or other plaintext protocols that are vulnerable to all manner of attacks. Often, modifying a legacy application’s codebase to implement TLS is untenable, necessitating other options.

One comparatively simple approach is to leverage an Application Load Balancer or a Classic Load Balancer to provide SSL offloading. In this scenario, the load balancer would be exposed to users, while the application servers that only support plaintext protocols will reside within a subnet which is can only be accessed by the load balancer. The load balancer would perform the decryption of all traffic destined for the application instance, forwarding the plaintext traffic to the instances. This allows  you to use encryption on traffic between the client and the load balancer, leaving only internal communication between the load balancer and the application in plaintext. Often this approach is sufficient to meet security requirements, however, in more stringent scenarios it is never acceptable for traffic to be transmitted in plaintext, even if within a secured subnet. In this scenario, a sidecar can be used to eliminate plaintext traffic ever traversing the network.

Improving Security and Configuration Management with Sidecars

One approach to providing encryption to legacy applications is to leverage what’s often termed the “sidecar pattern.” The sidecar pattern entails a second process acting as a proxy to the legacy application. The legacy application only exposes its services via the local loopback adapter and is thus accessible only to the sidecar. In turn the sidecar acts as an encrypted proxy, exposing the legacy application’s API to external consumers via TLS. As unencrypted traffic between the sidecar and the legacy application traverses the loopback adapter, it never traverses the network. This approach can help add encryption (or stronger encryption) to legacy applications when it’s not feasible to modify the original codebase. A common approach to implanting sidecars is through container groups such as pod in EKS or a task in ECS.

Implementing the Sidecar Pattern With Containers

Figure 1: Implementing the Sidecar Pattern With Containers

Another use of the sidecar pattern is to help legacy applications leverage modern cloud services. A common example of this is using a sidecar to manage files pertaining to the legacy application. This could entail a number of options including:

  • Having the sidecar dynamically modify the configuration for a legacy application based upon some external factor, such as the output of Lambda function, SNS event or DynamoDB write.
  • Having the sidecar write application state to a cache or database. Often applications will write state to the local disk. This can be problematic for autoscaling or disaster recovery, whereby having the state easily accessible to other instances is advantages. To facilitate this, the sidecar can write state to Amazon S3, Amazon DynamoDB, Amazon Elasticache or Amazon RDS.

A sidecar requires customer development, but it doesn’t require any modification of the lift-and-shifted application. A sidecar treats the application as a blackbox and interacts with it via its API, configuration file, or other standard mechanism.

Automating Security

A lift-and-shift can achieve a significantly stronger security posture by incorporating elements of DevSecOps. DevSecOps is a philosophy that argues that everyone is responsible for security and advocates for automation all parts of the security process. AWS has a number of services which can help implement a DevSecOps strategy. These services include:

  • Amazon GuardDuty: a continuous monitoring system which analyzes AWS CloudTrail Events, Amazon VPC Flow Log and DNS Logs. GuardDuty can detect threats and trigger an automated response.
  • AWS Shield: a managed DDOS protection services
  • AWS WAF: a managed Web Application Firewall
  • AWS Config: a service for assessing, tracking, and auditing changes to AWS configuration

These services can help detect security problems and implement a response in real time, achieving a significantly strong posture than traditional security strategies. You can build a DevSecOps strategy around a lift-and-shift workload using these services, without having to modify the lift-and-shift application.

Conclusion

There are many opportunities for taking advantage of AWS services and features to improve a lift-and-shift workload. Without any alteration to the application you can strengthen your security posture by utilizing AWS security services and by making small environmental and architectural changes that can help alleviate the challenges of legacy workloads.

About the author

Dr. Jonathan Shapiro-Ward is an AWS Solutions Architect based in Toronto. He helps customers across Canada to transform their businesses and build industry leading cloud solutions. He has a background in distributed systems and big data and holds a PhD from the University of St Andrews.

Optimizing a Lift-and-Shift for Cost Effectiveness and Ease of Management

Post Syndicated from Jonathan Shapiro-Ward original https://aws.amazon.com/blogs/architecture/optimizing-a-lift-and-shift-for-cost/

Lift-and-shift is the process of migrating a workload from on premise to AWS with little or no modification. A lift-and-shift is a common route for enterprises to move to the cloud, and can be a transitionary state to a more cloud native approach. This is the second blog post in a three-part series which investigates how to optimize a lift-and-shift workload. The first post is about performance.

A key concern that many customers have with a lift-and-shift is cost. If you move an application as is  from on-prem to AWS, is there any possibility for meaningful cost savings? By employing AWS services, in lieu of self-managed EC2 instances, and by leveraging cloud capability such as auto scaling, there is potential for significant cost savings. In this blog post, we will discuss a number of AWS services and solutions that you can leverage with minimal or no change to your application codebase in order to significantly reduce management costs and overall Total Cost of Ownership (TCO).

Automate

Even if you can’t modify your application, you can change the way you deploy your application. The adopting-an-infrastructure-as-code approach can vastly improve the ease of management of your application, thereby reducing cost. By templating your application through Amazon CloudFormation, Amazon OpsWorks, or Open Source tools you can make deploying and managing your workloads a simple and repeatable process.

As part of the lift-and-shift process, rationalizing the workload into a set of templates enables less time to spent in the future deploying and modifying the workload. It enables the easy creation of dev/test environments, facilitates blue-green testing, opens up options for DR, and gives the option to roll back in the event of error. Automation is the single step which is most conductive to improving ease of management.

Reserved Instances and Spot Instances

A first initial consideration around cost should be the purchasing model for any EC2 instances. Reserved Instances (RIs) represent a 1-year or 3-year commitment to EC2 instances and can enable up to 75% cost reduction (over on demand) for steady state EC2 workloads. They are ideal for 24/7 workloads that must be continually in operation. An application requires no modification to make use of RIs.

An alternative purchasing model is EC2 spot. Spot instances offer unused capacity available at a significant discount – up to 90%. Spot instances receive a two-minute warning when the capacity is required back by EC2 and can be suspended and resumed. Workloads which are architected for batch runs – such as analytics and big data workloads – often require little or no modification to make use of spot instances. Other burstable workloads such as web apps may require some modification around how they are deployed.

A final alternative is on-demand. For workloads that are not running in perpetuity, on-demand is ideal. Workloads can be deployed, used for as long as required, and then terminated. By leveraging some simple automation (such as AWS Lambda and CloudWatch alarms), you can schedule workloads to start and stop at the open and close of business (or at other meaningful intervals). This typically requires no modification to the application itself. For workloads that are not 24/7 steady state, this can provide greater cost effectiveness compared to RIs and more certainty and ease of use when compared to spot.

Amazon FSx for Windows File Server

Amazon FSx for Windows File Server provides a fully managed Windows filesystem that has full compatibility with SMB and DFS and full AD integration. Amazon FSx is an ideal choice for lift-and-shift architectures as it requires no modification to the application codebase in order to enable compatibility. Windows based applications can continue to leverage standard, Windows-native protocols to access storage with Amazon FSx. It enables users to avoid having to deploy and manage their own fileservers – eliminating the need for patching, automating, and managing EC2 instances. Moreover, it’s easy to scale and minimize costs, since Amazon FSx offers a pay-as-you-go pricing model.

Amazon EFS

Amazon Elastic File System (EFS) provides high performance, highly available multi-attach storage via NFS. EFS offers a drop-in replacement for existing NFS deployments. This is ideal for a range of Linux and Unix usecases as well as cross-platform solutions such as Enterprise Java applications. EFS eliminates the need to manage NFS infrastructure and simplifies storage concerns. Moreover, EFS provides high availability out of the box, which helps to reduce single points of failure and avoids the need to manually configure storage replication. Much like Amazon FSx, EFS enables customers to realize cost improvements by moving to a pay-as-you-go pricing model and requires a modification of the application.

Amazon MQ

Amazon MQ is a managed message broker service that provides compatibility with JMS, AMQP, MQTT, OpenWire, and STOMP. These are amongst the most extensively used middleware and messaging protocols and are a key foundation of enterprise applications. Rather than having to manually maintain a message broker, Amazon MQ provides a performant, highly available managed message broker service that is compatible with existing applications.

To use Amazon MQ without any modification, you can adapt applications that leverage a standard messaging protocol. In most cases, all you need to do is update the application’s MQ endpoint in its configuration. Subsequently, the Amazon MQ service handles the heavy lifting of operating a message broker, configuring HA, fault detection, failure recovery, software updates, and so forth. This offers a simple option for reducing management overhead and improving the reliability of a lift-and-shift architecture. What’s more is that applications can migrate to Amazon MQ without the need for any downtime, making this an easy and effective way to improve a lift-and-shift.

You can also use Amazon MQ to integrate legacy applications with modern serverless applications. Lambda functions can subscribe to MQ topics and trigger serverless workflows, enabling compatibility between legacy and new workloads.

Integrating Lift-and-Shift Workloads with Lambda via Amazon MQ

Figure 1: Integrating Lift-and-Shift Workloads with Lambda via Amazon MQ

Amazon Managed Streaming Kafka

Lift-and-shift workloads which include a streaming data component are often built around Apache Kafka. There is a certain amount of complexity involved in operating a Kafka cluster which incurs management and operational expense. Amazon Kinesis is a managed alternative to Apache Kafka, but it is not a drop-in replacement. At re:Invent 2018, we announced the launch of Amazon Managed Streaming Kafka (MSK) in public preview. MSK provides a managed Kafka deployment with pay-as-you-go pricing and an acts as a drop-in replacement in existing Kafka workloads. MSK can help reduce management costs and improve cost efficiency and is ideal for lift-and-shift workloads.

Leveraging S3 for Static Web Hosting

A significant portion of any web application is static content. This includes videos, image, text, and other content that changes seldom, if ever. In many lift-and-shifted applications, web servers are migrated to EC2 instances and host all content – static and dynamic. Hosting static content from an EC2 instance incurs a number of costs including the instance, EBS volumes, and likely, a load balancer. By moving static content to S3, you can significantly reduce the amount of compute required to host your web applications. In many cases, this change is non-disruptive and can be done at the DNS or CDN layer, requiring no change to your application.

Reducing Web Hosting Costs with S3 Static Web Hosting

Figure 2: Reducing Web Hosting Costs with S3 Static Web Hosting

Conclusion

There are numerous opportunities for reducing the cost of a lift-and-shift. Without any modification to the application, lift-and-shift workloads can benefit from cloud-native features. By using AWS services and features, you can significantly reduce the undifferentiated heavy lifting inherent in on-prem workloads and reduce resources and management overheads.

About the author

Dr. Jonathan Shapiro-Ward is an AWS Solutions Architect based in Toronto. He helps customers across Canada to transform their businesses and build industry leading cloud solutions. He has a background in distributed systems and big data and holds a PhD from the University of St Andrews.

Optimizing a Lift-and-Shift for Performance

Post Syndicated from Jonathan Shapiro-Ward original https://aws.amazon.com/blogs/architecture/optimizing-a-lift-and-shift-for-performance/

Many organizations begin their cloud journey with a lift-and-shift of applications from on-premise to AWS. This approach involves migrating software deployments with little, or no, modification. A lift-and-shift avoids a potentially expensive application rewrite but can result in a less optimal workload that a cloud native solution. For many organizations, a lift-and-shift is a transitional stage to an eventual cloud native solution, but there are some applications that can’t feasibly be made cloud-native such as legacy systems or proprietary third-party solutions. There are still clear benefits of moving these workloads to AWS, but how can they be best optimized?

In this blog series post, we’ll look at different approaches for optimizing a black box lift-and-shift. We’ll consider how we can significantly improve a lift-and-shift application across three perspectives: performance, cost, and security. We’ll show that without modifying the application we can integrate services and features that will make a lift-and-shift workload cheaper, faster, more secure, and more reliable. In this first blog, we’ll investigate how a lift-and-shift workload can have improved performance through leveraging AWS features and services.

Performance gains are often a motivating factor behind a cloud migration. On-premise systems may suffer from performance bottlenecks owing to legacy infrastructure or through capacity issues. When performing a lift-and-shift, how can you improve performance? Cloud computing is famous for enabling horizontally scalable architectures but many legacy applications don’t support this mode of operation. Traditional business applications are often architected around a fixed number of servers and are unable to take advantage of horizontal scalability. Even if a lift-and-shift can’t make use of auto scaling groups and horizontal scalability, you can achieve significant performance gains by moving to AWS.

Scaling Up

The easiest alternative to scale up to compute is vertical scalability. AWS provides the widest selection of virtual machine types and the largest machine types. Instances range from small, burstable t3 instances series all the way to memory optimized x1 series. By leveraging the appropriate instance, lift-and-shifts can benefit from significant performance. Depending on your workload, you can also swap out the instances used to power your workload to better meet demand. For example, on days in which you anticipate high load you could move to more powerful instances. This could be easily automated via a Lambda function.

The x1 family of instances offers considerable CPU, memory, storage, and network performance and can be used to accelerate applications that are designed to maximize single machine performance. The x1e.32xlarge instance, for example, offers 128 vCPUs, 4TB RAM, and 14,000 Mbps EBS bandwidth. This instance is ideal for high performance in-memory workloads such as real time financial risk processing or SAP Hana.

Through selecting the appropriate instance types and scaling that instance up and down to meet demand, you can achieve superior performance and cost effectiveness compared to running a single static instance. This affords lift-and-shift workloads far greater efficiency that their on-prem counterparts.

Placement Groups and C5n Instances

EC2 Placement groups determine how you deploy instances to underlying hardware. One can either choose to cluster instances into a low latency group within a single AZ or spread instances across distinct underlying hardware. Both types of placement groups are useful for optimizing lift-and-shifts.

The spread placement group is valuable in applications that rely on a small number of critical instances. If you can’t modify your application  to leverage auto scaling, liveness probes, or failover, then spread placement groups can help reduce the risk of simultaneous failure while improving the overall reliability of the application.

Cluster placement groups help improve network QoS between instances. When used in conjunction with enhanced networking, cluster placement groups help to ensure low latency, high throughput, and high network packets per second. This is beneficial for chatty applications and any application that leveraged physical co-location for performance on-prem.

There is no additional charge for using placement groups.

You can extend this approach further with C5n instances. These instances offer 100Gbps networking and can be used in placement group for the most demanding networking intensive workloads. Using both placement groups and the C5n instances require no modification to your application, only to how it is deployed – making it a strong solution for providing network performance to lift-and-shift workloads.

Leverage Tiered Storage to Optimize for Price and Performance

AWS offers a range of storage options, each with its own performance characteristics and price point. Through leveraging a combination of storage types, lift-and-shifts can achieve the performance and availability requirements in a price effective manner. The range of storage options include:

Amazon EBS is the most common storage service involved with lift-and-shifts. EBS provides block storage that can be attached to EC2 instances and formatted with a typical file system such as NTFS or ext4. There are several different EBS types, ranging from inexpensive magnetic storage to highly performant provisioned IOPS SSDs. There are also storage-optimized instances that offer high performance EBS access and NVMe storage. By utilizing the appropriate type of EBS volume and instance, a compromise of performance and price can be achieved. RAID offers a further option to optimize EBS. EBS utilizes RAID 1 by default, providing replication at no additional cost, however an EC2 instance can apply other RAID levels. For instance, you can apply RAID 0 over a number of EBS volumes in order to improve storage performance.

In addition to EBS, EC2 instances can utilize the EC2 instance store. The instance store provides ephemeral direct attached storage to EC2 instances. The instance store is included with the EC2 instance and provides a facility to store non-persistent data. This makes it ideal for temporary files that an application produces, which require performant storage. Both EBS and the instance store are expose to the EC2 instance as block level devices, and the OS can use its native management tools to format and mount these volumes as per a traditional disk – requiring no significant departure from the on prem configuration. In several instance types including the C5d and P3d are equipped with local NVMe storage which can support extremely IO intensive workloads.

Not all workloads require high performance storage. In many cases finding a compromise between price and performance is top priority. Amazon S3 provides highly durable, object storage at a significantly lower price point than block storage. S3 is ideal for a large number of use cases including content distribution, data ingestion, analytics, and backup. S3, however, is accessible via a RESTful API and does not provide conventional file system semantics as per EBS. This may make S3 less viable for applications that you can’t easily modify, but there are still options for using S3 in such a scenario.

An option for leveraging S3 is AWS Storage Gateway. Storage Gateway is a virtual appliance than can be run on-prem or on EC2. The Storage Gateway appliance can operate in three configurations: file gateway, volume gateway and tape gateway. File gateway provides an NFS interface, Volume Gateway provides an iSCSI interface, and Tape Gateway provides an iSCSI virtual tape library interface. This allows files, volumes, and tapes to be exposed to an application host through conventional protocols with the Storage Gateway appliance persisting data to S3. This allows an application to be agnostic to S3 while leveraging typical enterprise storage protocols.

Using S3 Storage via Storage Gateway

Figure 1: Using S3 Storage via Storage Gateway

Conclusion

A lift-and-shift can achieve significant performance gains on AWS by making use of a range of instance types, storage services, and other features. Even without any modification to the application, lift-and-shift workloads can benefit from cutting edge compute, network, and IO which can help realize significant, meaningful performance gains.

About the author

Dr. Jonathan Shapiro-Ward is an AWS Solutions Architect based in Toronto. He helps customers across Canada to transform their businesses and build industry leading cloud solutions. He has a background in distributed systems and big data and holds a PhD from the University of St Andrews.

New – Hibernate Your EC2 Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-hibernate-your-ec2-instances/

As you know, you can easily build highly scalable AWS applications that launch fresh EC2 instances on an as-needed basis. While the instances can be up and running in a matter of seconds, booting the operating system and the application can take considerable time. Also, caches and other memory-centric application components can take some time (sometimes tens of minutes) to preload or warm up. Both of these factors impose a delay that can force you to over-provision in case you need incremental capacity very quickly.

Hibernation for EC2 Instances
Today we are giving you the ability to launch EC2 instances, set them up as desired, hibernate them, and then bring them back to life when you need them. The hibernation process stores the in-memory state of the instance, along with its private and elastic IP addresses, allowing it to pick up exactly where it left off.

This feature is available today and you can use it on freshly launched M3, M4, M5, C3, C4, C5, R3, R4, and R5 instances running Amazon Linux 1 (support for Amazon Linux 2 is in the works and will be ready soon). It applies to On-Demand instances and instances running with Reserved Instance coverage.

When an instance is instructed to hibernate, it writes the in-memory state to a file in the root EBS volume and then (in effect) shuts itself down. The AMI used to launch the instance must be encrypted, as must the root EBS volume of the instance. The encryption ensures proper protection for sensitive data when it is copied from memory to the EBS volume.

While the instance is in hibernation, you pay only for the EBS volumes and Elastic IP Addresses attached to it; there are no other hourly charges (just like any other stopped instance).

Hibernation in Action
In order to check out this feature I launch a c4.large instance, and select hibernation as a stop behavior:

I also expand my instance’s root volume, adding 10 GB + the memory size of the instance to the desired size:

I also create and associate an Elastic IP address with my instance since the public IP address will change. My instance is up and running, and I can check the uptime:

Then I select the instance in the EC2 Console and choose Stop – Hibernate from the Instance State menu (API and CLI support is also available):

The instance state transitions from running to stopping, and then to stopped, in seconds:

The console provides additional information about the transition:

The SSH connection to the instance drops, since it is no longer running:

Later, when I am ready to proceed, I click Start:

This time the state goes from stopped to pending, and then to running, again in seconds, and I can reconnect. I can then use uptime to see that the instance has not been rebooted, but has continued from where it left off:

If I was using this instance interactively, I could use a session manager such as screen, tmux, or mosh to make this totally seamless. The most interesting use cases for hibernation revolve around long-running processes and services that take a lot of time to initialize before they are ready to accept traffic where this would not be a concern.

Things to Know
As you can see, hibernation is really easy to use, and I hope that you are already thinking of some ways to apply it to your application. Here are a couple of things to keep in mind:

Instance Type – You can enable and use hibernation on freshly launches instances of the types that I listed above.

Root Volume Size – The root volume must have free space equal to the amount of RAM on the instance in order for the hibernation to succeed.

Operating Systems – The newest Amazon Linux 1 AMIs are configured for hibernation, with many others in the works. You will need to create an encrypted AMI, using one of these AMIs as a base. You can also follow our directions to customize and use your own AMI.

Modifications – You cannot modify the instance size or type while it is in hibernation, but you can modify the user data and the EBS Optimization setting.

Pricing – While the instance is in hibernation, you pay only for the EBS storage and any Elastic IP addresses attached to the instance.

Performance – The time to hibernate or resume is dependent on the memory size of the instance, the amount of in-memory data to be saved, and the throughput of the root EBS volume.

Coming Soon – We are working on support for Amazon Linux 2, Ubuntu, Windows Server 2008 R2, Windows Server 2012, Windows Server 2012 R2, Windows Server 2016, along with the SQL Server variants of the Windows AMIs.

Available Now
This feature is available now in the US East (N. Virginia, Ohio), US West (N. California, Oregon), Canada (Central), South America (São Paulo), Asia Pacific (Mumbai, Seoul, Singapore, Sydney, Tokyo), and EU (Frankfurt, London, Ireland, Paris) Regions.

Jeff;

New – Amazon FSx for Windows File Server – Fast, Fully Managed, and Secure

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-fsx-for-windows-file-server-fast-fully-managed-and-secure/

Organizations that want to run Windows applications on the cloud are commonly looking for network file storage that’s fully compatible with their applications and their Windows environments. For example, enterprises use Active Directory for identification and Windows Access Control Lists for fine-grained control over access to folders and files, and their applications typically rely on storage that provides full Windows file system (NTFS file system) compatibility.

Amazon FSx for Windows File Server
Amazon FSx for Windows File Server fits all of these needs, and more. It was designed from the ground up to work with your existing Windows applications and environments, making lift-and-shift of your Windows workloads to the cloud super-easy. You get a native Windows file system backed by fully-managed Windows file servers, accessible via the widely adopted SMB (Server Message Block) protocol. Built on SSD storage, Amazon FSx for Windows File Server delivers the throughput, IOPS, and consistent sub-millisecond performance that you (and your Windows applications) expect.

Here are the most important things to know:

Accessibility & Protocol Support – You can access your shares from Amazon Elastic Compute Cloud (EC2) instances, Amazon WorkSpaces virtual desktops, Amazon AppStream 2.0 applications, and VMware Cloud on AWS. Versions 2.0 through 3.1.1 of SMB are supported, allowing you to use Windows versions starting from Windows 7 and Windows Server 2008, and current versions of Linux (via Samba). Active Directory integration is built in, allowing you to easily integrate with your existing enterprise environment.

Performance and TunabilityAmazon FSx for Windows File Server delivers consistent, sub-millisecond latency. You can set the file system size and throughput (in megabytes per second) independently, with plenty of latitude in each dimension. File systems can be as big as 64 TB, and can deliver up to 2,048 MB/second of throughput.

Management – Your file systems are fully managed and data is stored in redundant form within an AWS Availability Zone. You don’t have to worry about attaching and formatting additional storage devices, updating Windows Server, or recovering from hardware failures. Incremental file-system consistent backups are taken automatically every day, with the option to take additional backups when needed.

Security – You get multiple levels of access control and data protection. File system endpoints are created within Virtual Private Clouds (VPCs) and access is governed by Security Groups. Windows ACLs are used to control access to folders and files; IAM roles are used to control access to administrative functions, with administrative activities logged to AWS CloudTrail. Your data is encrypted in transit and (using a KMS key that you can control) at rest. The service is PCI-DSS compliant and can be used to build HIPAA-compliant applications.

Multi-AZ Deployment – You create file systems in distinct AWS Availability Zones, and can use Microsoft DFS to set up automatic replication and failover between them. You can also use Microsoft DFS Namespaces to create shared, common namespaces that span multiple file systems and provide up to 300 PB of storage.

Creating a File System
Amazon FSx for Windows File Server is easy to use. I start by confirming that I have a Active Directory with a Domain Controller in the VPC subnet (subnet-009a1149) where I plan to create my file system’s endpoints:

For testing purposes, I also have an EC2 instance running Windows in the same subnet:

I open the Amazon FSx Console, and click Create file system:

I choose my file system option:

I specify a name, size, optional throughput, and other parameters for my new file system, and click Review summary to proceed:

On another browser tab I verify that the security group for the file system is configured to allow connections from my EC2 instance on the desired ports (135, 445, and 55555):

On the next page I review the settings and the estimated monthly costs, and click Create file system. My file system starts out in the Creating status and transitions to Available in minutes:

I can see an overview at a glance:

And I can click the Network & Security tab to get the DNS name for my file system:

I copy the DNS name, hop over to my EC2 instance, open Explorer, and Map my file system (a shared named share is created automatically):

Then I can use it like any other share (I’m sure that your use case is better than mine, but perhaps not as historically significant):

Each file system includes one share (named share) automatically. I can connect to the file system and create additional shares using the standard Windows tools and wizards:

File-system consistent backups are made daily during the backup window for the file system, and are retained for up to 35 days, as specified when the file system was created. I can also make backups on an as-needed basis:

Available Now
Amazon FSx for Windows File Server is available now and you can start using it today in the US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Ireland) Regions, with expansion to other Regions planned for the coming months. Pricing is based on the amount of storage and throughput that you configure.

Jeff;

 

Firecracker – Lightweight Virtualization for Serverless Computing

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/firecracker-lightweight-virtualization-for-serverless-computing/

One of my favorite Amazon Leadership Principles is Customer Obsession. When we launched AWS Lambda, we focused on giving developers a secure serverless experience so that they could avoid managing infrastructure. In order to attain the desired level of isolation we used dedicated EC2 instances for each customer. This approach allowed us to meet our security goals but forced us to make some tradeoffs with respect to the way that we managed Lambda behind the scenes. Also, as is the case with any new AWS service, we did not know how customers would put Lambda to use or even what they would think of the entire serverless model. Our plan was to focus on delivering a great customer experience while making the backend ever-more efficient over time.

Just four years later (Lambda was launched at re:Invent 2014) it is clear that the serverless model is here to stay. Today, Lambda processes trillions of executions for hundreds of thousands of active customers every month. Last year we extended the benefits of serverless to containers with the launch of AWS Fargate, which now runs tens of millions of containers for AWS customers every week.

As our customers increasingly adopted serverless, it was time to revisit the efficiency issue. Taking our Invent and Simplify principle to heart, we asked ourselves what a virtual machine would look like if it was designed for today’s world of containers and functions!

Introducing Firecracker
Today I would like to tell you about Firecracker, a new virtualization technology that makes use of KVM. You can launch lightweight micro-virtual machines (microVMs) in non-virtualized environments in a fraction of a second, taking advantage of the security and workload isolation provided by traditional VMs and the resource efficiency that comes along with containers.

Here’s what you need to know about Firecracker:

Secure – This is always our top priority! Firecracker uses multiple levels of isolation and protection, and exposes a minimal attack surface.

High Performance – You can launch a microVM in as little as 125 ms today (and even faster in 2019), making it ideal for many types of workloads, including those that are transient or short-lived.

Battle-TestedFirecracker has been battled-tested and is already powering multiple high-volume AWS services including AWS Lambda and AWS Fargate.

Low OverheadFirecracker consumes about 5 MiB of memory per microVM. You can run thousands of secure VMs with widely varying vCPU and memory configurations on the same instance.

Open SourceFirecracker is an active open source project. We are already ready to review and accept pull requests, and look forward to collaborating with contributors from all over the world.

Firecracker was built in a minimalist fashion. We started with crosvm and set up a minimal device model in order to reduce overhead and to enable secure multi-tenancy. Firecracker is written in Rust, a modern programming language that guarantees thread safety and prevents many types of buffer overrun errors that can lead to security vulnerabilities.

Firecracker Security
As I mentioned earlier, Firecracker incorporates a host of security features! Here’s a partial list:

Simple Guest ModelFirecracker guests are presented with a very simple virtualized device model in order to minimize the attack surface: a network device, a block I/O device, a Programmable Interval Timer, the KVM clock, a serial console, and a partial keyboard (just enough to allow the VM to be reset).

Process Jail – The Firecracker process is jailed using cgroups and seccomp BPF, and has access to a small, tightly controlled list of system calls.

Static Linking – The firecracker process is statically linked, and can be launched from a jailer to ensure that the host environment is as safe and clean as possible.

Firecracker in Action
To get some experience with Firecracker, I launch an i3.metal instance and download three files (the firecracker binary, a root file system image, and a Linux kernel):

I need to set up the proper permission to access /dev/kvm:

$  sudo setfacl -m u:${USER}:rw /dev/kvm

I start firecracker in one PuTTY session, and then issue commands in another (the process listens on a Unix-domain socket and implements a REST API). The first command sets the configuration for my first guest machine:

$ curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/machine-config" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{
        \"vcpu_count\": 1,
        \"mem_size_mib\": 512
    }"

And, the second sets the guest kernel:

$ curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/boot-source" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{
        \"kernel_image_path\": \"./hello-vmlinux.bin\",
        \"boot_args\": \"console=ttyS0 reboot=k panic=1 pci=off\"
    }"

And, the third one sets the root file system:

$ curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/drives/rootfs" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{
        \"drive_id\": \"rootfs\",
        \"path_on_host\": \"./hello-rootfs.ext4\",
        \"is_root_device\": true,
        \"is_read_only\": false
    }"

With everything set to go, I can launch a guest machine:

# curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/actions" \
    -H  "accept: application/json" \
    -H  "Content-Type: application/json" \
    -d "{
        \"action_type\": \"InstanceStart\"
     }"

And I am up and running with my first VM:

In a real-world scenario I would script or program all of my interactions with Firecracker, and I would probably spend more time setting up the networking and the other I/O. But re:Invent awaits and I have a lot more to do, so I will leave that part as an exercise for you.

Collaborate with Us
As you can see this is a giant leap forward, but it is just a first step. The team is looking forward to telling you more, and to working with you to move ahead. Star the repo, join the community, and send us some code!

Jeff;

 

 

 

 

New C5n Instances with 100 Gbps Networking

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-c5n-instances-with-100-gbps-networking/

We launched the powerful, compute-intensive C5 instances last year, and followed up with the C5d instances earlier this year with the addition of local NVMe storage. Both instances are built on the AWS Nitro system and are powered by AWS-custom 3.0 Ghz Intel® Xeon® Platinum 8000 series processors. They are designed for compute-heavy applications such as batch processing, distributed analytics, high-performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding.

New 100 Gbps Networking
Today we are adding an even more powerful variant, the C5n instance. With up to 100 Gbps of network bandwidth, your simulations, in-memory caches, data lakes, and other communication-intensive applications will run better than ever. Here are the specs:

Instance Name vCPUs
RAM
EBS Bandwidth Network Bandwidth
c5n.large 2 5.25 GiB Up to 3.5 Gbps Up to 25 Gbps
c5n.xlarge 4 10.5 GiB Up to 3.5 Gbps Up to 25 Gbps
c5n.2xlarge 8 21 GiB Up to 3.5 Gbps Up to 25 Gbps
c5n.4xlarge 16 42 GiB 3.5 Gbps Up to 25 Gbps
c5n.9xlarge 36 96 GiB 7 Gbps 50 Gbps
c5n.18xlarge 72 192 GiB 14 Gbps 100 Gbps

The Nitro Hypervisor allows the full range of C5n instances to deliver performance that is just about indistinguishable from bare metal. Other AWS Nitro System components, including the Nitro Security Chip, hardware EBS processing, and hardware support for the software defined network inside of each VPC also enhance performance.

Each vCPU is a hardware hyperthread on the Intel Xeon Platinum 8000 series processor. You get full control over the C-states on the two largest sizes, allowing you to run a single core at up to 3.5 Ghz using Intel Turbo Boost Technology.

The new instances also feature a higher amount of memory per core, putting them in the current “sweet spot” for HPC applications that work most efficiently when there’s at least 4 GiB of memory for each core. The instances also benefit from some internal improvements that boost memory access speed by up to 19% in comparison to the C5 and C5d instances.

It’s All About the Network
Now let’s get to the big news!

The C5n instances incorporate the fourth generation of our custom Nitro hardware, allowing the high-end instances to provide up to 100 Gbps of network throughput, along with a higher ceiling on packets per second. The Elastic Network Interface (ENI) on the C5n uses up to 32 queues (in comparison to 8 on the C5 and C5d), allowing the packet processing workload to be better distributed across all available vCPUs. The ability to push more packets per second will make these instances a great fit for network appliances such as firewalls, routers, and 5G cellular infrastructure.

In order to make the most of the available network bandwidth, you need to be using the latest Elastic Network Adapter (ENA) drivers (available in the latest Amazon Linux, Red Hat 7.6, and Ubuntu AMIs, and in the upstream Linux kernel) and you need to make use of multiple traffic flows. Flows within a Placement Group can reach 10 Gbps; the rest can reach 5 Gbps. When using multiple flows on the high-end instances, you can transfer 100 Gbps between EC2 instances in the same region (within or across AZs), S3 buckets, and AWS services such as Amazon Relational Database Service (RDS), Amazon ElastiCache, and Amazon EMR.

Available Now
C5n instances are available now in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and AWS GovCloud (US-West) Regions and you can launch one (or an entire cluster of them) today in On-Demand, Reserved Instance, Spot, Dedicated Host, or Dedicated Instance form.

Jeff;

New – EC2 Instances (A1) Powered by Arm-Based AWS Graviton Processors

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/

Earlier this year I told you about the AWS Nitro System and promised you that it would allow us to “deliver new instance types more quickly than ever in the months to come.” Since I made that promise we have launched memory-intensive R5 and R5d instances, high frequency z1d instances, burstable T3 instances, high memory instances with up to 12 TiB of memory, and AMD-powered M5a and R5a instances. The purpose-built hardware and the lightweight hypervisor that comprise the AWS Nitro System allow us to innovate more quickly while devoting virtually all of the power of the host hardware to the instances.

We acquired Annapurna Labs in 2015 after working with them on the first version of the AWS Nitro System. Since then we’ve worked with them to build and release two generations of ASICs (chips, not shoes) that now offload all EC2 system functions to Nitro, allowing 100% of the hardware to be devoted to customer instances. A few years ago the team started to think about building an Amazon-built custom CPU designed for cost-sensitive scale-out workloads.

AWS Graviton Processors
Today we are launching EC2 instances powered by Arm-based AWS Graviton Processors. Built around Arm cores and making extensive use of custom-built silicon, the A1 instances are optimized for performance and cost. They are a great fit for scale-out workloads where you can share the load across a group of smaller instances. This includes containerized microservices, web servers, development environments, and caching fleets.

The A1 instances are available in five sizes, all EBS-Optimized by default, at a significantly lower cost:

Instance Name vCPUs
RAM
EBS Bandwidth Network Bandwidth On-Demand Price/Hour
US East (N. Virginia)
a1.medium 1 2 GiB Up to 3.5 Gbps Up to 10 Gbps $0.0255
a1.large 2 4 GiB Up to 3.5 Gbps Up to 10 Gbps $0.0510
a1.xlarge 4 8 GiB Up to 3.5 Gbps Up to 10 Gbps $0.1020
a1.2xlarge 8 16 GiB Up to 3.5 Gbps Up to 10 Gbps $0.2040
a1.4xlarge 16 32 GiB 3.5 Gbps Up to 10 Gbps $0.4080

If your application is written in a scripting language, odds are that you can simply move it over to an A1 instance and run it as-is. If your application compiles down to native code, you will need to rebuild it on an A1 instance.

I was lucky enough to get early access to some A1 instances in order to put them through their paces. The console was still under construction so I launched my first instance from the command line:

$ aws ec2 run-instances --image-id ami-036237e941dccd50e \
  --instance-type a1.medium --count 1 --key-name keys-jbarr-us-east

The instance is up and running in seconds; I can run uname to check the architecture:

AMIs are available for Amazon Linux 2, RHEL, and Ubuntu today, with additional operating system support on the way.

Available Now
The A1 instances are available now in the US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Ireland) Regions in On-Demand, Reserved Instance, Spot, Dedicated Instance, and Dedicated Host form and you can launch them today.

Jeff;

 

Building a tightly coupled molecular dynamics workflow with multi-node parallel jobs in AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/building-a-tightly-coupled-molecular-dynamics-workflow-with-multi-node-parallel-jobs-in-aws-batch/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services and Aswin Damodar, Senior Software Development Engineer, AWS Batch

At Supercomputing 2018 in Dallas, TX, AWS announced AWS Batch support for running tightly coupled workloads in a multi-node parallel jobs environment. This AWS Batch feature enables applications that require strong scaling for efficient computational workloads.

Some of the more popular workloads that can take advantage of this feature enhancement include computational fluid dynamics (CFD) codes such as OpenFoam, Fluent, and ANSYS. Other workloads include molecular dynamics (MD) applications such as AMBER, GROMACS, NAMD.

Running tightly coupled, distributed, deep learning frameworks is also now possible on AWS Batch. Applications that can take advantage include TensorFlow, MXNet, Pytorch, and Chainer. Essentially, any application scaling that benefits from tightly coupled–based scalability can now be integrated into AWS Batch.

In this post, we show you how to build a workflow executing an MD simulation using GROMACS running on GPUs, using the p3 instance family.

AWS Batch overview

AWS Batch is a service providing managed planning, scheduling, and execution of containerized workloads on AWS. Purpose-built for scalable compute workloads, AWS Batch is ideal for high throughput, distributed computing jobs such as video and image encoding, loosely coupled numerical calculations, and multistep computational workflows.

If you are new to AWS Batch, consider gaining familiarity with the service by following the tutorial in the Creating a Simple “Fetch & Run” AWS Batch Job post.

Prerequisites

You need an AWS account to go through this walkthrough. Other prerequisites include:

  • Launch an ECS instance, p3.2xlarge with a NVIDIA Tesla V100 backend. Use the Amazon Linux 2 AMIs for ECS.
  • In the ECS instance, install the latest CUDA 10 stack, which provides the toolchain and compilation libraries as well as the NVIDIA driver.
  • Install nvidia-docker2.
  • In your /etc/docker/daemon.json file, ensure that the default-runtime value is set to nvidia.
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
  • Finally, save the EC2 instance as an AMI in your account. Copy the AMI ID, as you need it later in the post.

Deploying the workload

In a production environment, it’s important to efficiently execute the compute workload with multi-node parallel jobs. Most of the optimization is on the application layer and how efficiently the Message Passing Interface (MPI) ranks (MPI and OpenMP threads) are distributed across nodes. Application-level optimization is out of scope for this post, but should be considered when running in production.

One of the key requirements for running on AWS Batch is a Dockerized image with the application, libraries, scripts, and code. For multi-node parallel jobs, you need an MPI stack for the tightly coupled communication layer and a wrapper script for the MPI orchestration. The running child Docker containers need to pass container IP address information to the master node to fill out the MPI host file.

The undifferentiated heavy lifting that AWS Batch provides is the Docker-to-Docker communication across nodes using Amazon ECS task networking. With multi-node parallel jobs, the ECS container receives environmental variables from the backend, which can be used to establish which running container is the master and which is the child.

  • AWS_BATCH_JOB_MAIN_NODE_INDEX—The designation of the master node in a multi-node parallel job. This is the main node in which the MPI job is launched.
  • AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS—The IPv4 address of the main node. This is presented in the environment for all children nodes.
  • AWS_BATCH_JOB_NODE_INDEX—The designation of the node index.
  • AWS_BATCH_JOB_NUM_NODES – The number of nodes launched as part of the node group for your multi-node parallel job.

If AWS_BATCH_JOB_MAIN_NODE_INDEX = AWS_BATCH_JOB_NODE_INDEX, then this is the main node. The following code block is an example MPI synchronization script that you can include as part of the CMD structure of the Docker container. Save the following code as mpi-run.sh.

#!/bin/bash

cd $JOB_DIR

PATH="$PATH:/opt/openmpi/bin/"
BASENAME="${0##*/}"
log () {
  echo "${BASENAME} - ${1}"
}
HOST_FILE_PATH="/tmp/hostfile"
AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"

aws s3 cp $S3_INPUT $SCRATCH_DIR
tar -xvf $SCRATCH_DIR/*.tar.gz -C $SCRATCH_DIR

sleep 2

usage () {
  if [ "${#@}" -ne 0 ]; then
    log "* ${*}"
    log
  fi
  cat <<ENDUSAGE
Usage:
export AWS_BATCH_JOB_NODE_INDEX=0
export AWS_BATCH_JOB_NUM_NODES=10
export AWS_BATCH_JOB_MAIN_NODE_INDEX=0
export AWS_BATCH_JOB_ID=string
./mpi-run.sh
ENDUSAGE

  error_exit
}

# Standard function to print an error and exit with a failing return code
error_exit () {
  log "${BASENAME} - ${1}" >&2
  log "${2:-1}" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
}

# Set child by default switch to main if on main node container
NODE_TYPE="child"
if [ "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" == 
"${AWS_BATCH_JOB_NODE_INDEX}" ]; then
  log "Running synchronize as the main node"
  NODE_TYPE="main"
fi


# wait for all nodes to report
wait_for_nodes () {
  log "Running as master node"

  touch $HOST_FILE_PATH
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
  
  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "master details -> $ip:$availablecores"
  echo "$ip slots=$availablecores" >> $HOST_FILE_PATH

  lines=$(uniq $HOST_FILE_PATH|wc -l)
  while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
  do
    log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, check again in 1 second"
    sleep 1
    lines=$(uniq $HOST_FILE_PATH|wc -l)
  done
  # Make the temporary file executable and run it with any given arguments
  log "All nodes successfully joined"
  
  # remove duplicates if there are any.
  awk '!a[$0]++' $HOST_FILE_PATH > ${HOST_FILE_PATH}-
deduped
  cat $HOST_FILE_PATH-deduped
  log "executing main MPIRUN workflow"

  cd $SCRATCH_DIR
  . /opt/gromacs/bin/GMXRC
  /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 \
                          -x PATH -x LD_LIBRARY_PATH -x 
GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
                          --allow-run-as-root --machinefile 
${HOST_FILE_PATH}-deduped \
                          $GMX_COMMAND
  sleep 2

  tar -czvf $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$SCRATCH_DIR/*
  aws s3 cp $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$S3_OUTPUT

  log "done! goodbye, writing exit code to 
$AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
  echo "0" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
  exit 0
}


# Fetch and run a script
report_to_master () {
  # get own ip and num cpus
  #
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)

  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "I am a child node -> $ip:$availablecores, reporting to the master node -> 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS}"
  until echo "$ip slots=$availablecores" | ssh 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS} "cat >> 
/$HOST_FILE_PATH"
  do
    echo "Sleeping 5 seconds and trying again"
  done
  log "done! goodbye"
  exit 0
  }


# Main - dispatch user request to appropriate function
log $NODE_TYPE
case $NODE_TYPE in
  main)
    wait_for_nodes "${@}"
    ;;

  child)
    report_to_master "${@}"
    ;;

  *)
    log $NODE_TYPE
    usage "Could not determine node type. Expected (main/child)"
    ;;
esac

The synchronization script supports downloading the assets from Amazon S3 as well as preparing the MPI host file based on GPU scheduling for GROMACS.

Furthermore, the mpirun stanza is captured in this script. This script can be a template for several multi-node parallel job applications by just changing a few lines.  These lines are essentially the GROMACS-specific steps:

. /opt/gromacs/bin/GMXRC
export OMP_NUM_THREADS=$OMP_THREADS
/opt/openmpi/bin/mpirun -np $MPI_THREADS --mca btl_tcp_if_include eth0 \
-x OMP_NUM_THREADS -x PATH -x LD_LIBRARY_PATH -x GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
--allow-run-as-root --machinefile ${HOST_FILE_PATH}-deduped \
$GMX_COMMAND

In your development environment for building Docker images, create a Dockerfile that prepares the software stack for running GROMACS. The key elements of the Dockerfile are:

  1. Set up a passwordless-ssh keygen.
  2. Download, and compile OpenMPI. In this Dockerfile, you are downloading the recently released OpenMPI 4.0.0 source and compiling on a NVIDIA Tesla V100 GPU-backed instance (p3.2xlarge).
  3. Download and compile GROMACS.
  4. Set up supervisor to run SSH at Docker container startup as well as processing the mpi-run.sh script as the CMD.

Save the following script as a Dockerfile:

FROM nvidia/cuda:latest

ENV USER root

# -------------------------------------------------------------------------------------
# install needed software -
# openssh
# mpi
# awscli
# supervisor
# -------------------------------------------------------------------------------------

RUN apt update
RUN DEBIAN_FRONTEND=noninteractive apt install -y iproute2 cmake openssh-server openssh-client python python-pip build-essential gfortran wget curl
RUN pip install supervisor awscli

RUN mkdir -p /var/run/sshd
ENV DEBIAN_FRONTEND noninteractive

ENV NOTVISIBLE "in users profile"

#####################################################
## SSH SETUP

RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN sed '[email protected]\s*required\s*[email protected] optional [email protected]' -i /etc/pam.d/sshd
RUN echo "export VISIBLE=now" >> /etc/profile

RUN echo "${USER} ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
ENV SSHDIR /root/.ssh
RUN mkdir -p ${SSHDIR}
RUN touch ${SSHDIR}/sshd_config
RUN ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ''
RUN cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys
RUN cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa
RUN echo " IdentityFile ${SSHDIR}/id_rsa" >> /etc/ssh/ssh_config
RUN echo "Host *" >> /etc/ssh/ssh_config && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config
RUN chmod -R 600 ${SSHDIR}/* && \
chown -R ${USER}:${USER} ${SSHDIR}/
# check if ssh agent is running or not, if not, run
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

##################################################
## S3 OPTIMIZATION

RUN aws configure set default.s3.max_concurrent_requests 30
RUN aws configure set default.s3.max_queue_size 10000
RUN aws configure set default.s3.multipart_threshold 64MB
RUN aws configure set default.s3.multipart_chunksize 16MB
RUN aws configure set default.s3.max_bandwidth 4096MB/s
RUN aws configure set default.s3.addressing_style path

##################################################
## CUDA MPI

RUN wget -O /tmp/openmpi.tar.gz https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz && \
tar -xvf /tmp/openmpi.tar.gz -C /tmp
RUN cd /tmp/openmpi* && ./configure --prefix=/opt/openmpi --with-cuda --enable-mpirun-prefix-by-default && \
make -j $(nproc) && make install
RUN echo "export PATH=$PATH:/opt/openmpi/bin" >> /etc/profile
RUN echo "export LD_LIBRARY_PATH=$LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64" >> /etc/profile

###################################################
## GROMACS 2018 INSTALL

ENV PATH $PATH:/opt/openmpi/bin
ENV LD_LIBRARY_PATH $LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64
RUN wget -O /tmp/gromacs.tar.gz http://ftp.gromacs.org/pub/gromacs/gromacs-2018.4.tar.gz && \
tar -xvf /tmp/gromacs.tar.gz -C /tmp
RUN cd /tmp/gromacs* && mkdir build
RUN cd /tmp/gromacs*/build && \
cmake .. -DGMX_MPI=on -DGMX_THREAD_MPI=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DGMX_BUILD_OWN_FFTW=ON -DCMAKE_INSTALL_PREFIX=/opt/gromacs && \
make -j $(nproc) && make install
RUN echo "source /opt/gromacs/bin/GMXRC" >> /etc/profile

###################################################
## supervisor container startup

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf
ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh
RUN chmod 755 supervised-scripts/mpi-run.sh

EXPOSE 22
RUN export PATH="$PATH:/opt/openmpi/bin"
ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh
RUN chmod 0755 batch-runtime-scripts/entry-point.sh

CMD /batch-runtime-scripts/entry-point.sh

After the container is built, push the image to your Amazon ECR repository and note the container image URI for later steps.

Set up GROMACS

For the input files, use the Chalcone Synthase (1CGZ) example, from RCSB.org. For this post, just run a simple simulation following the Lysozyme in Water GROMACS tutorial.

Execute the production MD run before the analysis (that is, after the system is solvated, neutralized, and equilibrated), so you can show that the longest part of the simulation can be achieved in a containizered workflow.

It is possible from the tutorial to run the entire workflow from PDB preparation to solvation, and energy minimization and analysis in AWS Batch.

Set up the compute environment

For the purpose of running the MD simulation in this test case, use two p3.2xlarge instances. Each instance provides one NVIDIA Tesla V100 GPU for which GROMACS distributes the job. You don’t have to launch specific instance types. With the p3 family, the MPI-wrapper can concomitantly modify the MPI ranks to accommodate the current GPU and node topology.

When the job is executed, instantiate two MPI processes with two OpenMP threads per MPI process. For this post, launch EC2 OnDemand, using the Amazon Linux AMIs we can take advantage of per-second billing.

Under Create Compute Environment, choose a managed compute environment and provide a name, such as gromacs-gpu-ce. Attach two roles:

  • AWSBatchServiceRole—Allows AWS Batch to make EC2 calls on your behalf.
  • ecsInstanceRole—Allows the underlying instance to make AWS API calls.

In the next panel, specify the following field values:

  • Provisioning model: EC2
  • Allowed instance types: p3 family
  • Minimum vCPUs: 0
  • Desired vCPUs: 0
  • Maximum vCPUs: 128

For Enable user-specified Ami ID and enter the AMI that you created earlier.

Finally, for the compute environment, specify the network VPC and subnets for launching the instances, as well as a security group. We recommend specifying a placement group for tightly coupled workloads for better performance. You can also create EC2 tags for the launch instances. We used name=gromacs-gpu-processor.

Next, choose Job Queues and create a gromacs-queue queue coupled with the compute environment created earlier. Set the priority to 1 and select Enable job queue.

Set up the job definition

In the job definition setup, you create a two-node group, where each node pulls the gromacs_mpi image. Because you are using the p3.2xlarge instance providing one V100 GPU per instance, your vCPU slots = 8 for scheduling purposes.

{
    "jobDefinitionName": "gromacs-jobdef",
    "jobDefinitionArn": "arn:aws:batch:us-east-2:<accountid>:job-definition/gromacs-jobdef:1",
    "revision": 6,
    "status": "ACTIVE",
    "type": "multinode",
    "parameters": {},
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:1",
                "container": {
                    "image": "<accountid>.dkr.ecr.us-east-2.amazonaws.com/gromacs_mpi:latest",
                    "vcpus": 8,
                    "memory": 24000,
                    "command": [],
                    "jobRoleArn": "arn:aws:iam::<accountid>:role/ecsTaskExecutionRole",
                    "volumes": [
                        {
                            "host": {
                                "sourcePath": "/scratch"
                            },
                            "name": "scratch"
                        },
                        {
                            "host": {
                                "sourcePath": "/efs"
                            },
                            "name": "efs"
                        }
                    ],
                    "environment": [
                        {
                            "name": "SCRATCH_DIR",
                            "value": "/scratch"
                        },
                        {
                            "name": "JOB_DIR",
                            "value": "/efs"
                        },
                        {
                            "name": "GMX_COMMAND",
                            "value": "gmx_mpi mdrun -deffnm md_0_1 -nb gpu -ntomp 1"
                        },
                        {
                            "name": "OMP_THREADS",
                            "value": "2"
                        },
                        {
                            “name”: “MPI_THREADS”,
                            “value”: “1”
                        },
                        {
                            "name": "S3_INPUT",
                            "value": "s3://ragab-md/1CGZ.tar.gz"
                        },
                        {
                            "name": "S3_OUTPUT",
                            "value": "s3://ragab-md"
                        }
                    ],
                    "mountPoints": [
                        {
                            "containerPath": "/scratch",
                            "sourceVolume": "scratch"
                        },
                        {
                            "containerPath": "/efs",
                            "sourceVolume": "efs"
                        }
                    ],
                    "ulimits": [],
                    "instanceType": "p3.2xlarge"
                }
            }
        ]
    }
}

Submit the GROMACS job

In the AWS Batch job submission portal, provide a job name and select the job definition created earlier as well as the job queue. Ensure that the vCPU value is set to 8 and the Memory (MiB) value is 24000.

Under Environmental Variables, within in each node group, ensure that the keys are set correctly as follows.

Key Value
SCRATCH_DIR /scratch
JOB_DIR /efs
OMP_THREADS 2
GMX_COMMAND gmx_mpi mdrun -deffnm md_0_1 -nb gpu
MPI_THREADS 2
S3_INPUT s3://<your input>
S3_OUTPUT s3://<your output>

Submit the job and wait for it to enter into the RUNNING state. After the job is in the RUNNING state, select the job ID and choose Nodes.

The containers listed each write to a separate Amazon CloudWatch log stream where you can monitor the progress.

After the job is completed the entire working directory is compressed and uploaded to S3, the trajectories (*.xtc) and input .gro files can be viewed in your favorite MD analysis package. For more information about preparing a desktop, see Deploying a 4x4K, GPU-backed Linux desktop instance on AWS.

You can view the trajectories in PyMOL as well as running any subsequent trajectory analysis.

Extending the solution

As we mentioned earlier, you can take this core workload and extend it as part of a job execution chain in a workflow. Native support for job dependencies exists in AWS Batch and alternatively in AWS Step Functions. With Step Functions, you can create a decision-based workflow tree to run the preparation, solvation, energy minimization, equilibration, production MD, and analysis.

Conclusion

In this post, we showed that tightly coupled, scalable MD simulations can be executed using the recently released multi-node parallel jobs feature for AWS Batch. You finally have an end-to-end solution for distributed and MPI-based workloads.

As we mentioned earlier, many other applications can also take advantage of this feature set. We invite you to try this out and let us know how it goes.

Want to discuss how your tightly coupled workloads can benefit on AWS Batch? Contact AWS.

New – Predictive Scaling for EC2, Powered by Machine Learning

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-predictive-scaling-for-ec2-powered-by-machine-learning/

When I look back on the history of AWS and think about the launches that truly signify the fundamentally dynamic, on-demand nature of the cloud, two stand out in my memory: the launch of Amazon EC2 in 2006 and the concurrent launch of CloudWatch Metrics, Auto Scaling, and Elastic Load Balancing in 2009. The first launch provided access to compute power; the second made it possible to use that access to rapidly respond to changes in demand. We have added a multitude of features to all of these services since then, but as far as I am concerned they are still central and fundamental!

New Predictive Scaling
Today we are making Auto Scaling even more powerful with the addition of predictive scaling. Using data collected from your actual EC2 usage and further informed by billions of data points drawn from our own observations, we use well-trained Machine Learning models to predict your expected traffic (and EC2 usage) including daily and weekly patterns. The model needs at least one day’s of historical data to start making predictions; it is re-evaluated every 24 hours to create a forecast for the next 48 hours.

We’ve done our best to make this really easy to use. You enable it with a single click, and then use a 3-step wizard to choose the resources that you want to observe and scale. You can configure some warm-up time for your EC2 instances, and you also get to see actual and predicted usage in a cool visualization! The prediction process produces a scaling plan that can drive one or more groups of Auto Scaled EC2 instances.

Once your new scaling plan is in action, you will be able to scale proactively, ahead of daily and weekly peaks. This will improve the overall user experience for your site or business, and it can also help you to avoid over-provisioning, which will reduce your EC2 costs.

Let’s take a look…

Predictive Scaling in Action
The first step is to open the Auto Scaling Console and click Get started:

I can select the resources to be observed and predictively scaled in three different ways:

I select an EC2 Auto Scaling group (not shown), then I assign my group a name, pick a scaling strategy, and leave both Enable predictive scaling and Enable dynamic scaling checked:

As you can see from the screen above, I can use predictive scaling, dynamic scaling, or both. Predictive scaling works by forecasting load and scheduling minimum capacity; dynamic scaling uses target tracking to adjust a designated CloudWatch metric to a specific target. The two models work well together because of the scheduled minimum capacity already set by predictive scaling.

I can also fine-tune the predictive scaling, but the default values will work well to get started:

I can forecast on one of three pre-chosen metrics (this is in the General settings):

Or on a custom metric:

I have the option to do predictive forecasting without actually scaling:

And I can set up a buffer time so that newly launched instances can warm up and be ready to handle traffic at the predicted time:

After a couple more clicks, the scaling plan is created and the learning/prediction process begins! I return to the console and I can see the forecasts for CPU Utilization (my chosen metric) and for the number of instances:

I can see the scaling actions that will implement the predictions:

I can also see the CloudWatch metrics for the Auto Scaling group:

And that’s all you need to do!

Here are a couple of things to keep in mind about predictive scaling:

Timing – Once the initial set of predictions have been made and the scaling plans are in place, the plans are updated daily and forecasts are made for the following 2 days.

Cost – You can use predictive scaling at no charge, and may even reduce your AWS charges.

Resources – We are launching with support for EC2 instances, and plan to support other AWS resource types over time.

Applicability – Predictive scaling is a great match for web sites and applications that undergo periodic traffic spikes. It is not designed to help in situations where spikes in load are not cyclic or predictable.

Long-Term Baseline – Predictive scaling maintains the minimum capacity based on historical demand; this ensures that any gaps in the metrics won’t cause an inadvertent scale-in.

Available Now
Predictive scaling is available now and you can starting using it today in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Singapore) Regions.

Jeff;

 

New – EC2 Auto Scaling Groups With Multiple Instance Types & Purchase Options

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options/

Earlier this year I told you about EC2 Fleet, an AWS building block that makes it easy for you to create fleets that are built from a combination of EC2 On-Demand, Reserved, and Spot Instances that span multiple EC2 instance types. In that post I showed you how to create a fleet and walked through an example that created a genomics processing pipeline that used a mix of M4 and M5 instances. I also dropped a hint to let you know that we were working on integrating EC2 Fleet with Auto Scaling and other AWS services.

Auto Scaling Across Multiple Instance Types & Purchase Options
Today I am happy to let you know that you can now create Auto Scaling Groups that grow and shrink in response to changing conditions, while also making use of the most economical combination of EC2 instance types and pricing models. You have full control of the instance types that will be used to build your group, along with the ability to control the mix of On-Demand and Spot. You can also update your existing Auto Scaling Groups to take advantage of this new feature.

The Auto Scaling Groups that you create are optimized anew each time a scale-out or scale-in event takes place, always seeking the lowest overall cost while meeting the other requirements set by your configuration. You can modify the configuration as newer instance types become available, allowing you to create a group that evolves in step with EC2.

Creating an Auto Scaling Group
I can create an Auto Scaling Group from the EC2 Console, CLI, or API. The first step is to make sure that I have a suitable Launch Template (it should not specify the use of Spot Instances). Here’s mine:

Then I navigate to my Auto Scaling Groups and click Create Auto Scaling group:

I click Launch Template, select my ProdWebServer template, and click Next Step to proceed:

I name my group and select Combine purchase models and instances to unlock the new functionality:

Now I select the instance types that I want to use. The list is prioritized: instances at the top of the list will be used in preference to those lower down when On-Demand instances are launched. My app will run fine on M4 or M5 instances with 2 or more vCPUs:

I can accept the default settings for my group’s composition or I can set them myself by unchecking Use default:

Here’s what I can do:

Maximum Spot Price – Sets the maximum Spot price that I want to pay. The default setting caps this bid at the On-Demand price.

Spot Allocation Strategy – Control the amount of per-AZ diversity for the Spot Instances. A larger number adds some flexibility at times when a particular instance type is in high demand within an AZ.

Optional On-Demand Base  – Controls how much of the initial capacity is made up of On-Demand Instances. Keeping this set to 0 indicates that I prefer to launch On-Demand Instances as a percentage of the total group capacity that is running at any given time.

On-Demand Percentage Above Base – Controls the percentage of the add-on to the initial group that is made up of On-Demand Instances versus the percentage that is made up of Spot Instances.

As you can see, I have full control over how my group is built. I leave them all as-is, set my group to start with 4 instances, choose my VPC subnets, and click Next to set up my scaling policies, as usual:

I disable scale-in for demo purposes (you don’t need to do this for your group):

I click past the Configure Notifications, and indicate that I want to tag my group and the EC2 instances in it:

Then I review my settings and click Create Auto Scaling Group to move ahead:

My initial group of four instances is ready to go within minutes:

I can filter by tag in the EC2 Console and display the Lifecycle column to see the mix of On-Demand and Spot Instances:

I can modify my Auto Scaling Group, reducing the On-Demand Percentage to 20% and doubling the Desired Capacity (this is my demo-mode way of showing you what happens when the group scales out):

The changes take effect within minutes; new Spot Instances are launched, some of the existing On-Demand Instances are terminated, and the composition of my group reflects the new settings:

Here are a couple of things to keep in mind when you start to use this cool new feature:

Reserved Instances – We plan to add support for the preferential use of Reserved Instances in the near future. Today, if you own Reserved Instances, specify their instance types as early as possible in the list I showed you earlier. Your discounts will apply to any On-Demand instances that match available Reserved Instances.

Weight – All instance types have the same weight; we plan to give you the ability to specify weights in the near future. This will allow you to specify custom capacity units for each instance using either memory or vCPUs, and to specify the overall desired capacity in the same units.

Cost – The feature itself is available to you at no charge. If you switch part or all of your Auto Scaling Groups over to Spot Instances, you may be able to save up to 90% when compared to On-Demand Instances.

ECS and EKS – If you are running Amazon ECS or Amazon Elastic Container Service for Kubernetes on a cluster that makes use of an Auto Scaling Group, you can update the group to make use of multiple instance types and purchase options.

Available Now
This feature is available now and you can start using today in all commercial AWS regions!

Jeff;

New Lower-Cost, AMD-Powered M5a and R5a EC2 Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-lower-cost-amd-powered-ec2-instances/

From the start, AWS has focused on choice and economy. Driven by a never-ending torrent of customer requests that power our well-known Virtuous Cycle, I think we have delivered on both over the years:

Choice – AWS gives you choices in a wide range of dimensions including locations (18 operational geographic regions, 4 more in the works, and 1 local region), compute models (instances, containers, and serverless), EC2 instance types, relational and NoSQL database choices, development languages, and pricing/purchase models.

Economy – We have reduced prices 67 times so far, and work non-stop to drive down costs and to make AWS an increasingly better value over time. We study usage patterns, identify areas for innovation and improvement, and deploy updates across the entire AWS Cloud on a very regular and frequent basis.

Today I would like to tell you about our latest development, one that provides you with a choice of EC2 instances that are more economical than ever!

Powered by AMD
The newest EC2 instances are powered by custom AMD EPYC processors running at 2.5 GHz and are priced 10% lower than comparable instances. They are designed to be used for workloads that don’t use all of compute power available to them, and provide you with a new opportunity to optimize your instance mix based on cost and performance.

Here’s what we are launching:

General Purpose – M5a instances are designed for general purpose workloads: web servers, app servers, dev/test environments, and gaming. The M5a instances are available in 6 sizes.

Memory Optimized – R5a instances are designed for memory-intensive workloads: data mining, in-memory analytics, caching, and so forth. The R5a instances are available in 6 sizes, with lower per-GiB memory pricing in comparison to the R5 instances.

The new instances are built on the AWS Nitro System. They can make use of existing HVM AMIs (as is the case with all other recent EC2 instance types, the AMI must include the ENA and NVMe drivers), and can be used in Cluster Placement Groups.

These new instances should be a great fit for customers who are looking to further cost-optimize their Amazon EC2 compute environment. As always, we recommend that you measure performance and cost on your own workloads when choosing your instance types.

General Purpose Instances
Here are the specs for the M5a instances:

Instance Name vCPUs RAM EBS-Optimized Bandwidth Network Bandwidth
m5a.large
2 8 GiB Up to 2.120 Gbps Up to 10 Gbps
m5a.xlarge
4 16 GiB Up to 2.120 Gbps Up to 10 Gbps
m5a.2xlarge
8 32 GiB Up to 2.120 Gbps Up to 10 Gbps
m5a.4xlarge
16 64 GiB 2.120 Gbps Up to 10 Gbps
m5a.12xlarge
48 192 GiB 5 Gbps 10 Gbps
m5a.24xlarge
96 384 GiB 10 Gbps 20 Gbps

Memory Optimized Instances
Here are the specs for the R5a instances:

Instance Name vCPUs RAM EBS-Optimized Bandwidth Network Bandwidth
r5a.large
2 16 GiB Up to 2.120 Gbps Up to 10 Gbps
r5a.xlarge
4 32 GiB Up to 2.120 Gbps Up to 10 Gbps
r5a.2xlarge
8 64 GiB Up to 2.120 Gbps Up to 10 Gbps
r5a.4xlarge
16 128 GiB 2.120 Gbps Up to 10 Gbps
r5a.12xlarge
48 384 GiB 5 Gbps 10 Gbps
r5a.24xlarge
96 768 GiB 10 Gbps 20 Gbps

Available Now
These instances are available now and you can start using them today in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Singapore) Regions in On-Demand, Spot, and Reserved Instance form. Pricing, as I noted earlier, is 10% lower than the equivalent existing instances. To learn more, visit our new AMD Instances page.

Jeff;

PS – We are also working on T3a instances; stay tuned for more info!