Tag Archives: Amazon EC2

Halodoc: Building the Future of Tele-Health One Microservice at a Time

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/halodoc-building-the-future-of-tele-health-one-microservice-at-a-time/

Halodoc, a Jakarta-based healthtech platform, uses tele-health and artificial intelligence to connect patients, doctors, and pharmacies. Join builder Adrian De Luca for this special edition of This is My Architecture as he dives deep into the solutions architecture of this Indonesian healthtech platform that provides healthcare services in one of the most challenging traffic environments in the world.

Explore how the company evolved its monolithic backend into decoupled microservices with Amazon EC2 and Amazon Simple Queue Service (SQS), adopted serverless to cost effectively support new user functionality with AWS Lambda, and manages the high volume and velocity of data with Amazon DynamoDB, Amazon Relational Database Service (RDS), and Amazon Redshift.

For more content like this, subscribe to our YouTube channels This is My Architecture, This is My Code, and This is My Model, or visit the This is My Architecture AWS website, which has search functionality and the ability to filter by industry, language, and service.

Running Simcenter STAR-CCM+ on AWS with AWS ParallelCluster, Elastic Fabric Adapter and Amazon FSx for Lustre

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/

This post is contributed by Anh Tran, Senior HPC Specialist Solution Architect, AWS

Introduction

AWS recently introduced many HPC services that boost the performance and scalability of Computational Fluid Dynamics (CFD) workloads on AWS. These services include: Amazon FSx for Lustre, Elastic Fabric Adapter (EFA), and AWS ParallelCluster 2.5.1. In this technical post, I walk through these three services. Additionally, I outline an example of using AWS ParallelCluster to set up an HPC system with EFA and Amazon FSx Lustre to run a CFD workload.  The CFD application that you will set up during this blog post is Simcenter STAR-CCM+  – the predominant CFD application from Siemens.

Service and solution overview

Services

This blog primarily uses three services – Amazon FSx for Lustre, EFA, and AWS ParallelCluster. Let’s dig into each of these services before reviewing the solution.

Amazon FSx for Lustre

In December 2018, AWS released Amazon FSx for Lustre. This is a fully managed, high-performance file system, optimized for fast processing workloads, like HPC. Amazon FSx for Lustre allows users to access and alter data from either Amazon S3 or on-premises seamlessly and exceptionally fast. For example, you can launch and run a file system that provides low latency access to your data. Additionally, you can read and write data at speeds of up to hundreds of gigabytes per second of throughput, and millions of IOPS. This speed and low-latency unleashes innovation at an unparalleled pace.  This blog post uses the latest version of Amazon FSx for Lustre which recently added a new API for moving data in and out of Amazon S3. This API also includes POSIX support, which allows files to mount with the same user id. Additionally, the latest version also includes a new backup feature that allows you to back up your files to an S3 bucket. I go into more detail of how to take advantage of this at the end of the blog.

Elastic Fabric Adapter

In April of 2019, AWS released EFA. This enables you to run applications requiring high levels of inter-node communications at scale on AWS.

AWS ParallelCluster 2.5.1.

AWS ParallelCluster is an open source cluster management tool that simplifies deploying and managing HPC clusters with Amazon FSx for Lustre, EFA, a variety of job schedulers, and the MPI library of your choice. AWS ParallelCluster simplifies cluster orchestration on AWS so that HPC environments become easy-to-use even for if you’re new to the cloud.  AWS recently released AWS ParallelCluster 2.5.1 – which is the version we will use for this blog.

These three AWS HPC components are optimal for CFD applications. Together, they provide simple deployment of HPC systems on AWS, low latency network communication for MPI workloads, and a fast, parallel filesystem. Now, let’s take a look at how these services come together and seamlessly run a real CFD application: Simcenter STAR-CCM+.

Solution

AWS has a long-standing collaboration with Siemens. AWS and Siemens are dedicated to enhancing Siemens’ customer experiences when they run Simcenter STAR-CCM+ apps on AWS. I am excited to walk you through the steps and the best practices for running Simcenter STAR-CCM+, the predominant CFD application from Siemens.

The Simcenter STAR-CCM+ application runs on an HPC system. This system is optimized with EFA and Amazon FSx for Lustre — all of which is managed by AWS ParallelCluster. AWS ParallelCluster simplifies the deployment process to such an extent that you can set up your HPC cluster with a high throughput parallel file system (Amazon FSx for Lustre), a high-throughput and low-latency network interface (EFA), and high-bandwidth network interconnects (100 Gbps using C5n instances) in less than 15 minutes.

Now that you have the services and solution overviews, we can get started. This blog post includes the following steps:

  1. Creating an HPC infrastructure stack on AWS, which will include:
    • How to set up AWS ParallelCluster for best performance
    • How to enable EFA
    • How to enable the Amazon FSx Lustre file system and how to use some basic Amazon FSx for Lustre for STAR-CCM+
    • How to connect to a remote desktop session using NICE DCV
  2. Installing the Simcenter STAR-CCM+ application and how to submit a Simcenter STAR-CCM+ job to HPC cluster.

The following diagram outlines the steps

Steps

Setting up the HPC infrastructure stack

Before you can run Simcenter STAR-CCM+, you need to build an HPC cluster first.  Some best practices that you should consider when setting up a cluster on AWS include:

  • Turn off Hyper-Threading (HT): AWS instances have HT turned on by default.
  • Use EFA and a cluster placement group for your compute fleet to minimize the latency between nodes.
  • Select the right instance type for your compute fleet. Here, I use C5n.18xlarge because of its high-performance CPU, high-bandwidth bandwidth networking, and the EFA network interface capabilities.

With HPC best practices in mind, you can set up your AWS ParallelCluster

Set up AWS ParallelCluster:

$aws s3 mb s3://benchmark-starccm
make_bucket: benchmark-starccm

*Note: As an example, I create an S3 bucket benchmark-starccm, however you should create an S3 bucket with a different name of your choice because the S3 bucket name must be globally unique.

Let’s download the STAR-CCM+ installation file and a case file, then upload them to the S3 bucket that we just created.

  • Download latest Simcenter STAR-CCM+ package from the Siemens portal. It will look something like this: STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
  • Download the Le Mans case file. The Le Mans case is 104 Million cells, which even today is considered large for a CFD case.  The file will look something like this:  LeMans_104M.sim

After downloading the STAR-CCM+ software and LeMans case file, upload them to the S3 bucket created above

aws s3 cp STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip s3://benchmark-starccm
aws s3 cp LeMans_104M.sim s3://benchmark-starccm/

We will use this same S3 bucket to install the Simcenter STAR-CCM+ application later in this tutorial

With all the ground work done, we can now build our HPC cluster. For more detailed instructions, you can consult Getting Started with AWS ParallelCluster.

  • Install AWS ParallelCluster
pip install aws-parallelcluster
  • Configure AWS ParallelCluster with some basic network information such as AWS Region ID, VPC ID, Subnet ID

pcluster configure

Modify your ~/.parallelcluster/config file to include a cluster section that minimally includes the following:

[aws]
aws_region_name = us-east-2

[cluster default]
vpc_settings = public
key_name = <Key-Name>
initial_queue_size = 2
max_queue_size = 100
maintain_initial_size = true
placement_group = DYNAMIC
placement = cluster
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge
cluster_type = ondemand

base_os = centos7
tags = {“Name” : “STARCCM”}

enable_efa = compute
fsx_settings = fsxshared
disable_hyperthreading = true

dcv_settings = hpc-dcv

[vpc public]
master_subnet_id = subnet-<Subnet-ID>
vpc_id = vpc-<VPC-ID>

[global]
update_check = true
sanity_check = true
cluster_template = default

[dcv hpc-dcv]

enable = master

[fsx fsxshared]
shared_dir = /fsx
storage_capacity = 1200
imported_file_chunk_size = 1024
import_path = s3://benchmark-starccm

export_path = s3://benchmark-starccm

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

  • Now, create your first HPC cluster with the name starccmby running

pcluster create starccm

Your HPC cluster should be ready in about 15 minutes.

While you’re waiting for your cluster to be ready, let’s take a deeper look at what some of the different parameters we used mean for our HPC cluster:

initial_queue_size: We will start with two compute instances after the HPC cluster is up.

max_queue_size: We will limit the maximum compute fleet to 100 instances. This allows us room to scale our jobs up to a large number of cores while putting a limit on the number of compute nodes to help control costs.

base_os: For this blog, we will select centos 7 as a base os. Currently we support Amazon Linux (alinux), Centos 7 (centos7), Ubuntu 16.04 (ubuntu1604), and Ubuntu 18.04 (ubuntu1804) with EFA.

master_instance_type: This can be any instance type. Here we choose c5.xlarge because it is inexpensive and relatively fast for the head node.

compute_instance_type: We select C5n.18xlarge because it is optimized for compute-intensive workloads and supports EFA for better scaling of HPC. Note that EFA is currently only available on c5n.18xlarge, c5n.metal, i3en.24xlarge, p3dn.24xlarge, inf1.24xlarge, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge, and p3dn.24xlarge. See the docs for Currently supported instances.

placement_group: We use placement_group to ensure our instances are located as physically close to one another as possible to minimize the latency between compute nodes and take advantage of EFA’s low latency networking.

enable_efa: with just one configuration line, we can easily turn on EFA support for our HPC cluster.

dcv_settings = hpc-dcv: With AWS ParallelCluster 2.5.1 you can use NICE DCV to support your remote visualization needs.

disable_hyperthreading: This setting turns off hyper-threading on the cluster

[fsx fsxshared]: This section contains the settings to define your FSx for Lustre parallel file system, including the location where the shared directory will be mounted, the storage capacity for the filesystem, the chunk size for files to be imported, and the location from which the data will be imported. You can read more about FSx for Lustre here.

[dcv hpc-dcv]: This section contains the settings to define your remote visualization setup. You can read more about DCV with AWS ParallelCluster here.

  •  After you set up your config file for AWS ParallelCluster, log in and verify that you can access the cluster’s head node
pcluster ssh starccm -i ~/path/to/ssh_key
  • Verify the compute nodes are up. We should see two c5n.18xlarge nodes.
$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
global - - - - - - - - - -
ip-172-31-14-220 lx-amd64 36 2 36 36 0.49 184.6G 11.7G 0.0 0.0
ip-172-31-2-137 lx-amd64 36 2 36 36 0.45 184.6G 11.7G 0.0 0.0
  • Verify the EFA driver has been loaded successfully.

In order to verify if EFA is installed correctly you will need to ssh into one of compute nodes and run :

[[email protected] ~]$ which mpirun
/opt/amazon/openmpi/bin/mpirun

[[email protected] ~]$ fi_info -p efa

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA

provider: efa
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::97:9fff:fe1e:4e78
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

At this point, EFA is verified.

Install Simcenter STAR-CCM+ application

Now that the HPC cluster using AWS ParallelCluster is set up, it’s time to install the Simcenter STAR-CCM+ application.  In the prior steps, you uploaded a Simcenter STAR-CCM+ application and a case file to S3 bucket and used that S3 bucket as a source for the Amazon FSx for Lustre /fsx storage. As soon as the cluster created, the installation file and the case file will be available in /fsx

[[email protected] fsx]$ cd /fsx
[[email protected] fsx]$ ls
LeMans_104M.sim  STAR-CCM-14.06.013_01_linux-x86_64.tar.gz  STAR-CCM+14.06.013_01_linux-x86_64.tar.gz

As you can see, Amazon FSx for Lustre has already downloaded the case file from the S3 bucket to the /fsx partition, so now you can start install Simcenter STAR-CCM+ software using the following steps.

  • Install Simcenter STAR-CCM+: install Simcenter STAR-CCM+ on /fsx  – a 1.2TB Lustre filesystem that you configured in a previous step.
cd /fsx
sudo unzip STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.zip
cd STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1
./STAR-CCM+15.02.003_01_linux-x86_64-2.12_gnu7.1.sh
Select Installation Location : /fsx/Siemens

After following all the standard installation steps from Simcenter STAR-CCM+, the application should be installed at the following location:

/fsx/Siemens/15.02.003/STAR-CCM+15.02.003/

  • Test the installation
[[email protected] fsx]$ /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+ -version
Simcenter STAR-CCM+ 2020.1 Build 15.02.003 (linux-x86_64-2.12/gnu7.1)

When the above code shows up, you correctly installed Simcenter STAR-CCM+, so now you can run the application.

Running Simcenter STAR-CCM+ on AWS ParallelCluster

Before I move on, let’s recap what you’ve have done so far.

  1. You set up an HPC cluster using AWS ParallelCluster with compute-optimized C5n instances, an Amazon FSx for Lustre filesystem, and EFA-enabled networking.
  2. You installed Simcenter STAR-CCM+ application

Now, let us create an SGE submission script and submit a job to the HPC cluster.

  • Create an SGE job submission script :
cd /fsx/

vi star-ccm.qsub

#!/bin/bash
#$ -N check
#$ -cwd
#$ -j Y
#$ -pe mpi 252

date

your_pod_key="your license key"
ccmp=" /fsx/Siemens/15.02.003/STAR-CCM+15.02.003/star/bin/starccm+”
case="/fsx/LeMans_104M.sim"

fabric="OFI"
tag="fsx-${fabric}"

${ccmp} \
-power \
-podkey $your_pod_key \
-licpath [email protected] \
-mpi openmpi \
-bs sge \
-benchmark:"-preits 40 -nits 20 -nps ${NSLOTS} -tag ${tag}" \
${case}

date

mkdir ${NSLOTS}new
mv *xml ${NSLOTS}new
  • Submit your Simcenter STAR-CCM+ job to your HPC cluster
qsub -V star-ccm.qsub

Now you have submitted an HPC job that requests 252 cores of c5n.18xlarge.

  • Check the status of your jobs by running
qstat -f

Simcenter STAR-CCM+ result analysis

Here are sample scaling results for the LeMans 104M cell benchmark case. As you can see, the Simcenter STAR-CCM+ result shows exciting performance and scaling results running on AWS. The performance is fast and the simulation scales very well with EFA-based networking and great CPU performance.

Connect to a NICE DCV session

AWS ParallelCluster is now natively integrated with NICE DCV. You can configure a NICE DCV session to visualize your STAR-CCM+ result or connect to the application remotely.

As you will recall from when we configured our AWS ParallelCluster in the previous section, I named the DCV server as hpc-dcv.  To create and connect to a NICE DCV session, just run:

pcluster dcv connect hpc-dcv -k <Key-Name>

After you connect to NICE DCV session, you will be able to access to a Linux desktop to work with STAR-CCM+ application.

Backup SIMCENTER STAR-CCM+ result to S3 bucket

After you finish your STAR-CCM+ simulation, you can backup data in /fsx to your S3 bucket that you created from running your application. You can now use Data Repository Tasks. Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket. One of the jobs is to export your changed file system contents back to its linked S3 bucket.

*Note: in order to use new Amazon FSx for Lustre feature, you will need to have AWS CLI version 1.16.309 or above.

In this case, I select to export the  STAR-CCM+ application directory Siemens as an example.

  • Exit the HPC head node, and go back to your laptop or Cloud9 environment where you have configured your AWS CLI. Find out your Amazon FSx Lustre ID by running:
aws fsx describe-file-systems
  • After you find the Amazon FSx for Lustre ID, which looks simiar to fs-0d72d520f620d765a, create a backup of the data by running:
create-data-repository-task

aws fsx create-data-repository-task --file-system-id fs-0d72d520f620d765a --type EXPORT_TO_REPOSITORY --paths Siemens,testfsx --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://benchmark-starccm/

Explanation:

–file-system-id: your file system ID

–type EXPORT_TO_REPOSITORY: we will export the data back to the S3 bucket

–paths Siemens,testfsx: the directories you want to export to S3 bucket

Format=REPORT_CSV_20191124:  note this is only name the Amazon FSx Lustre supports. Please keep it the same.

  • Check the status of the backup by running describe-create-data-repository-task
aws fsx describe-data-repository-tasks

As you can see, I can now use FSx for Lustre to install the Simcenter STAR-CCM+ application, and seamlessly move case data between on-premise to Amazon S3 and AWS HPC system. I can create a file system linked to a S3 bucket, create a FSx for Lustre filesystem, and export data back to their S3 bucket after running the CFD app.

Conclusion

Give it a try, set up an HPC environment, and let us know how it goes! If you need more information about running CFD and HPC cases on AWS you can find it on our HPC home page. Please feel free to contact us with questions that you might have.

 

Estimating Amazon EC2 instance needed when migrating ERP from IBM Power

Post Syndicated from Martin Yip original https://aws.amazon.com/blogs/compute/estimating-amazon-ec2-instance-needed-when-migrating-erp-from-ibm-power/

This post courtesy of CK Tan, AWS, Enterprise Migration Architect – APAC

Today there are many enterprise customers who are keen on migrating their mission critical Enterprise Resource Planning (ERP) applications such as Oracle E-Business Suite (Oracle EBS) or SAP running on IBM Power System from on-premises to the AWS cloud platform. They realize running critical ERP applications on AWS will enable their business to be more agile, cost-effective, and secure than on premises. AWS provides cloud native services that streamline your ability to adopt emerging technologies, enabling greater innovation, and faster time-to-value.

However, it is essential to note that both IBM Power System and Amazon Elastic Cloud Compute (EC2) are using different CPU processor architectures. Amazon EC2 instances are running on either x86 or ARM-based processors while IBM Power System is running on IBM Power processors. As a result, there are no direct performance benchmarks mapping the two platforms, or sizing for enterprise customers who intend to migrate their application running on IBM Power System to Amazon EC2.

The purpose of this blog is to share a methodology along with example of how to estimate the sizing of mission critical applications running on IBM Power System to Amazon EC2 running on x86 platform.

Methodology

In order to normalize the CPU performance between IBM Power and Intel x86 processors, we may refer to SAP Application Performance Standard (SAPS) for performance benchmark purposes. SAPS is a hardware-independent unit of measurement that describes the performance of a system configuration in the SAP environment. It is derived from the Sales and Distribution (SD) two-tier benchmark, where 100 SAPS is defined as 2,000 fully business processed order line items per hour.

In technical terms, this throughput is achieved by processing 6,000 dialog steps (screen changes), 2,000 postings per hour in the SD Benchmark, or 2,400 SAP transactions.

In the SD benchmark, fully business processed means the full business process of an order line item: creating the order, creating a delivery note for the order, displaying the order, changing the delivery, posting a goods issue, listing orders, and creating an invoice.

In this methodology, we compare the performance sizing by using SAPS which is not released by other ERP software such as Oracle EBS. However, we believe that this methodology will be most pertinent of indirect performance benchmark from an ERP software perspective as most of the enterprise businesses have similar process of ordering.

Discussion by Example

We will walk you through a real example to demonstrate how you can translate and migrate an SAP ERP 6.0 or Oracle EBS application database that is running on IBM Power 795 (IBM P795) with IBM AIX from on-premises to AWS cloud environment by choosing the right size of Amazon EC2 instances.

Step 1: IBM P795 System Performance Analysis

Users can leverage IBM nmon Analyzer which is a free tool to produce AIX performance analysis reports running on IBM Power System. The tool is designed to work with the latest version of nmon, but it is also tested with older versions for backwards compatibility. The tool is updated whenever nmon is updated, and at irregular intervals for new function. It allows users to:

  • View the data in spreadsheet form
  • Eliminate “bad” data
  • Automatically produce graphs for presentation to clients

Below is the performance analysis being captured for CPU, memory, disk throughput, disk Input / Output (I/O), and network I/O during the system’s database peak demand at month end closing – using IBM nmon Analyzer.

Figure 1: Actual Number of Pyhsical Core of CPU Utilization Analysis on IBM P795

 

Figure 2: Memory Usage Analysis on IBM P795

 

Figure 3: Total Disk Throughput Analysis on IBM P795

 

Figure 4: Total Disk I/O on IBM P795

 

Figure 5: Network I/O Analysis on IBM P795

Step 2: CPU Normalization

Users can find the SAPS results for different hardware manufacturers from this link. Table 1 below shows how to calculate the processing power for single CPU core with respect to different hardware manufacturers. We have chosen Amazon EC2 instances r5.metal and r4.16xlarge as potential candidates because they provide memory sizing which best fits the current demand of 408 GB – refer to Figure 2.

Instance TypeSAPS (2- Tier)Memory (GB)Total CPU CoresSAPS/Core
IBM P795688,6304,0962562,690
EC2 r5.metal140,000768482,917
EC2 r4.16xlarge76,400512362,122

Table 1: Calculate the SAPS Processing Power for Single Core of Physical CPU

In order to calculate the Normalization Factor for any targeted Amazon EC2 instance with respect to source system of IBM Power system, we will apply Equation 1 as shown below:

Normalization Factor = [SAPS Per Core] Amazon EC2 / [SAPS Per Core] IBM POWER System (1)

Therefore, by inserting the SAPS/Core being calculated from Table 1 into Equation (1):

  • The normalization factor for 1 core of Amazon EC2 of r5.metal to IBM P795 is equivalent to 1.08
  • The normalization factor for 1 core of Amazon EC2 of r4.16xlarge to IBM P795 is equivalent to 0.79

Table 2 below shows how to calculate normalized total number of CPU cores being allocated to each system. We need to make sure that the normalized total number of CPU cores from selected Amazon EC2 instance must be greater than or equal to the assigned entitlement capacity of IBM P795 in this case.

 

Instance TypeNormalization Factor (A)Total # CPU Cores (B)Normalized Total # CPU Core (A X B)
IBM P7951.003232
EC2 r5.metal1.084852
EC2 x1.32xlarge0.793628

Table 2: Normalized Total Number of CPU Core

Figure 1 indicates that the peak to meet the total processing power of database ERP running on IBM P795 is 29 cores of physical CPU as compared to assigned entitlement capacity of 32. As a result, the Amazon EC2 instance, r5.metal, should be the right candidate which will deliver sufficient processing power with enough buffer for future business growth in next 12 to 24 months.

Step 3: Amazon EC2 Right Sizing

Next, we need to check and ensure that other specifications such as network I/O, disk throughput and disk I/O will meet the current peak demand of the application. Table 3 shows the different specifications of r5.metal . It is clear from Table 3 that r5.metal meets all the demand requirements of the database ERP which is running on IBM P795.

SpecificationsAmazon EC2 Instance Type of r5.metal
Physical Core of CPU48 > 29 (current CPU utilization)
RAM (GiB)768 > 408 (current memory usage)
Network Perf (GBps)3 > 0.3 (current network I/O throughput)
Dedicated EBS BW (GBps)~ 2.4 > 2.0 (current disk throughput)
Maximum IOPS (16 KiB I/O)80,000 > 25,000 (current disk IO consumption)

Table 3: EC2 r5.metal Specifications

Conclusion

The methodology and example discussed in this blog gives a pragmatic estimation for optimal performance by selecting the right type and size of Amazon EC2 instances when you plan to migrate from IBM Power System, which is running mission critical ERP application, from on-premises datacenter to AWS cloud. This methodology has been applied to many enterprise cloud transformation projects and delivered more predictable performance with significant TCO savings. Additionally, this methodology can be adopted for capacity planning and helps enterprises establish strong business justifications for heterogeneous platform migration.

EC2 Price Reduction in the São Paulo Region (R5 and I3)

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/ec2-price-reduction-in-the-sao-paulo-region-r5-and-i3/

I’ve got good news for AWS customers using our South America (São Paulo) Region!

Effective February 1, 2020 we are reducing prices for On-Demand, Reserved and Dedicated Instances as follows:

  • All R5 families (R5, R5a, R5d, R5ad) – Up to 25%.
  • All I3 families (I3, I3en) – 13%.

The pricing pages have been updated.

Questions?
If you need assistance or have feedback, please reach out to your usual AWS support contacts, or post a message in the AWS Forum for Amazon EC2.

– Julien

Update on Amazon Linux AMI end-of-life

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/update-on-amazon-linux-ami-end-of-life/

Launched in September 2010, the Amazon Linux AMI has helped numerous customers build Linux-based applications on Amazon Elastic Compute Cloud (EC2). In order to bring them even more security, stability, and productivity, we introduced Amazon Linux 2 in 2017. Adding many modern features, Amazon Linux 2 is backed by long-term support, and we strongly encourage you to use it for your new applications.

As stated in the FAQ, we documented that the last version of the Amazon Linux AMI (2018.03) would be end-of-life on June 30, 2020. Based on customer feedback, we are extending the end-of-life date, and we’re also announcing a maintenance support period.

End-of-life Extension
The end-of-life for Amazon Linux AMI is now extended to December 31, 2020: until then, we will continue to provide security updates and refreshed versions of packages as needed.

Maintenance Support
Beyond December 31, 2020, the Amazon Linux AMI will enter a new maintenance support period that extends to June 30, 2023.

During this maintenance support period:

  • The Amazon Linux AMI will only receive critical and important security updates for a reduced set of packages.
  • It will no longer be guaranteed to support new EC2 platform capabilities, or new AWS features.

Supported packages will include:

  • The Linux kernel,
  • Low-level system libraries such as glibc and openssl,
  • Popular packages that are still in a supported state in their upstream sources, such as MySQL and PHP.

We will provide a detailed list of supported and unsupported packages in future posts.

Questions?
If you need assistance or have feedback, please reach out to your usual AWS support contacts, or post a message in the AWS Forum for Amazon Linux. Thank you for using Amazon Linux AMI!

– Julien

 

New – T3 Instances on Dedicated Single-Tenant Hardware

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-t3-instances-on-dedicated-single-tenant-hardware/

T3 instances use a burst pricing model that allows you to host general purpose workloads at low cost, with access to sustainable, full-core performance when needed. You can choose from seven different sizes and receive an assured baseline amount of processing power, courtesy of custom high frequency Intel® Xeon® Scalable Processors.

Our customers use them to host many different types of production and development workloads including microservices, small and medium databases, and virtual desktops. Some of our customers launch large fleets of T3 instances and use them to test applications in a wide range of conditions, environments, and configurations.

We launched the first EC2 Dedicated Instances way back in 2011. Dedicated Instances run on single-tenant hardware, providing physical isolation from instances that belong to other AWS accounts. Our customers use Dedicated Instances to further their compliance goals (PCI, SOX, FISMA, and so forth), and also use them to run software that is subject to license or tenancy restrictions.

Dedicated T3
Today I am pleased to announce that we are now making all seven sizes (t3.nano through t3.2xlarge) of T3 instances available in dedicated form, in 14 regions.You can now save money by using T3 instances to run workloads that require the use of dedicated hardware, while benefiting from access to the AVX-512 instructions and other advanced features of the latest generation of Intel® Xeon® Scalable Processors.

Just like the existing T3 instances, the dedicated T3 instances are powered by the Nitro system, and launch with Unlimited bursting enabled. They use ENA networking and offer up to 5 Gbps of network bandwidth.

You can launch dedicated T3 instances using the EC2 API, the AWS Management Console:

The AWS Command Line Interface (CLI):

$ aws ec2 run-instances --placement Tenancy=dedicated ...

or via a CloudFormation template (set tenancy to dedicated in your Launch Template).

Now Available
Dedicated T3 instances are available in the US East (N. Virginia), US East (Ohio), US West (N. California), South America (São Paulo), Canada (Central), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Europe (London), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Seoul) Regions.

You can purchase the instances in On-Demand or Reserved Instance form. There is an additional fee of $2 per hour when at least one Dedicated Instance of any type is running in a region, and $0.05 per hour when you you burst above the baseline performance for an extended period of time.

Jeff;

TMA Special: Connecting Taza Chocolate’s Legacy Equipment to the Cloud

Post Syndicated from Todd Escalona original https://aws.amazon.com/blogs/architecture/tma-special-connecting-taza-chocolates-legacy-equipment-to-the-cloud/

As a “bean to bar” chocolate manufacturer, Taza Chocolate uses traditional stone ground mills for the production of its famous chocolate discs. The analog, mid-century machines that the company imported from Central America were never built to connect to the cloud.

Along comes Tulip Interfaces, an AWS Industrial Software Competency Partner that makes the human and machine interaction easier by replacing paper processes with digital automation. Tulip retrofitted Taza’s legacy equipment with Internet of Things (IoT) sensors and connected it back to the AWS cloud.

Taza’s AWS cloud integration begins with Tulip’s own physical gateway that connects systems and machinery on the plant floor. Tulip then deploys IoT sensors to the machinery and passes outputs to the AWS cloud using an encrypted web socket where Tulip’s Kubernetes workers, managed by Kops, automatically schedule services across highly available instances and processes requests.

All job completion data is then fed to an Amazon RDS Multi-AZ PostgreSQL database that allows Taza to run visualizations and analytics for more insight using Prometheus and Garfana. In addition, all of the application definition metadata is contained in a MongoDB database service running on Amazon Elastic Cloud Compute (EC2) instances, which in return is VPC-peered with Kubernetes clusters. On top of this backend, Tulip uses a player application to stream metrics in near real-time that are displayed on the dashboard down on the shop floor and can be easily examined in order to help guide their operations and foster continuous improvements efforts to manufacturing operations.

Taza has realized many benefits from monitoring machine availability, performance, ambient conditions as well as overall process enhancements.

In this special, on-site This is My Architecture video, AWS Solutions Architect Evangelist Todd Escalona takes us on his journey through the Taza Chocolate factory where he meets with Taza’s Director of Manufacturing, Rich Moran, and Tulip’s DevOps lead, John Defreitas, to further explore how Tulip enables Taza Chocolate’s legacy equipment for cloud-based plant automation.

*Check out more This Is My Architecture video series.

Alejandra’s Top 5 Favorite re:Invent🎉 Launches of 2019

Post Syndicated from Alejandra Quetzalli original https://aws.amazon.com/blogs/aws/alejandras-top-5-favorite-reinvent%F0%9F%8E%89-launches-of-2019/

favorite re:Invent launches of 2019

While re:Invent 2019 may feel well over, I’m still feeling elated and curious about several of the launches that were announced that week. Is it just me, or did some of the new feature announcements seem to bring us closer to the Scifi worlds (i.e. AWS WaveLength anyone? and don’t get me started on Amazon Braket) of the future we envisioned as kids?

The future might very well be here. Can you handle it?

If you can, then I’m pumped to tell you why the following 5 launches of re:Invent 2019 got me the most excited.

[CAVEAT: Out of consideration for your sanity, dear reader, we try to keep these posts to a maximum word length. After all, I wouldn’t want you to fall asleep at your keyboard during work hours! Sadly, this also means I limited myself to only sharing a set number of the cool, new launches that happened. If you’re curious to read about ALL OF THEM, you can find them here: 2019 re:Invent Announcement Summary Page.]

 

1. Amazon Braket: explore Quantum Computing

Backstory of why I picked this one…

First of all, let’s address the 🐘elephant in the room🐘 and admit that 99.9% of us don’t really know what Quantum Computing is. But we want to! Because it sounds so cool and futuristic. So let’s give it a shot…

According to The Internet, a quantum computer is any computational device that uses the quantum mechanical phenomenas of superposition and entanglement to perform data operations. The basic principle of quantum computation is that quantum properties can be used to represent data and perform operations on it. Also, fun fact… in a “normal” computer —like your laptop— that data…that information… is stored in something called bits. But in a quantum computer, it is stored as qubits (quantum bits).

Quantum Computing is still in its infancy. Are you wondering where it will go?

What got launched?

Amazon Braket is a new service that makes it easy for scientists, researchers, and developers to build, test, and run quantum computing algorithms.

Sounds cool, but what does that actually mean?

The way it works is that Amazon Braket provides a development environment that enables you to design your own quantum algorithms from scratch or choose from a set of pre-built algorithms. Once you’ve picked your algorithm of choice, Amazon Braket provides you with a simulation service that helps you troubleshoot and verify your implementation of said algorithm. Once you’re ready, you could also choose to run your algorithm on a real quantum computer from one of our quantum hardware providers (i.e. D-Wave, IonQ, and Rigetti, etc).

So what are you waiting for? Go explore the future of quantum computing with Amazon Braket!

👉🏽Don’t forget to check out the docs: aws.amazon.com/braket
⚠Sign up to get notified when it’s released.

 

2. AWS Wavelength: ultra-low latency apps for 5G

Backstory of why I picked this one…

When I was a kid in the 80s, we were still on the beginning stages of the first wireless technology.

1G.

It had a lot of similarities to an old AM/FM radio. And just like with radio stations, cell phone calls ended up recieving interference from other callers ALL THE TIME. Sometimes, the calls became staticy if you were too far away from cell phone towers.

But it’s no longer the 80s, my dear readers. It’s 2019 and we’re all the way up to 5G now.

[note: When talking about 1,2, 3, 4 or 5G, the G stands for generation.]

What got launched?

AWS Wavelength combines high bandwidth and single-digit millisecond latency of 5G networks with AWS compute and storage services to enable developers to build new kinds of apps.

Phew, that was quite the brain🧠dump🗑, wasn’t it?

Sounds cool, but what does that actually mean?

Every generation of wireless technology has been defined by the speed of data transmission. So just how fast are we hoping 5G will be? Well, to give you a baseline…our fastest current 4G mobile networks offer about 45Mbps (megabits per second). But Qualcomm believes 5G could achieve browsing and download speeds about 10 to 20 times faster!

What makes this speed improvement possible is that 5G technology makes better use of the radio spectrum. It enables a more devices to access the mobile internet at the same time. Thus, it’s much better at handling thousands of devices simultaneously, without the congestion that was experienced in previous wireless generations.

At this speed, access to low latency services is really important. Why? Low latency is optimized to process a high volume of data messages with minimal delay (latency). This is exactly what you want if your business requires near real-time access to rapidly changing data.

Enter AWS Wavelength.

AWS Wavelength brings AWS services to the edge of the 5G network. It allows you to build the next generation of ultra-low latency apps using familiar AWS services, APIs, and tools. To deploy your app to 5G, simply extend your Amazon Virtual Private Cloud (VPC) to a Wavelength Zone and then create AWS resources like Amazon Elastic Compute Cloud (EC2) instances and Amazon Elastic Block Storage (EBS) volumes.

The other neat news is that AWS Wavelength will be in partnership with Verizon starting in 2020, as well as working with other carriers like Vodafone, SK Telecom, and KDDI to expand Wavelength Zones to more locations by the end of 2020.

👉🏽Don’t forget to check out the docs: aws.amazon.com/wavelength
⚠Sign up to get notified when it’s released.

 

3. AWS DeepComposer: learn Machine Learning with a piano keyboard!

Backstory of why I picked this one…

I do not have a Machine Learning (ML) background. At all.

But I do have a piano and musical background. 🎹🎶I learnt how to play the piano at 4, and I first got into composing when I was about 12 years old. Not having a super fancy piano instructor at the time, I remember wondering how an average person could learn how to compose, regardless of your musical background.

What got launched?

AWS DeepComposer is a machine learning-enabled keyboard for developers that also uses AI (Artificial Intelligence) to create original songs and melodies.

Sounds cool, but what does that actually mean?

AWS DeepComposer includes tutorials, sample code, and training data that can be used to get started building generative models, all without having to write a single line of code! This is great, because it helps encourage people new to ML to still give it a whirl.

Now the other neat thing about AWS DeepComposer, is that it opens the door for you to learn about Generative AI — one of the biggest advancements in AI technology . You’ll learn about Generative Adversarial Networks (GANs), a Generative AI technique that puts two different neural networks against each other to produce new and original digital works based on sample inputs. With AWS DeepComposer, you are training and optimizing GAN models to create original music. 🎶

Is that awesome, or what?

👉🏽Don’t forget to check out the docs: aws.amazon.com/deepcomposer
⚠Sign up to get notified when it’s released.

 

4. Amplify: now it’s ready for iOS and Android devs too!

Backstory of why I picked this one…

I used to be a CSS developer. Joining the Back-End world was an accident for me, since I first assumed I’d always be a Front-End developer.

Amplify makes it easy for developers to build and deploy Full-Stack apps that leverage the cloud. It’s a service that really helps bridge the gap between Front and Back-End development. Seeing Amplify now offer SDKs and libraries for iOS and Android devs sounds even more inclusive and exciting!

What got launched?

The Amplify Framework (open source project for building cloud-enabled mobile and web apps) is ready for iOS and Andriod developers! There are now — in preview— Amplify iOS and Amplify Android libraries for building scalable and secure cloud powered serverless apps.

Sounds cool, but what does that actually mean?

Developers can now add capabilities of Analytics, AI/ML, API (GraphQL and REST), DataStore, and Storage to their mobile apps with these new iOS and Android Amplify libraries.

This release also included support for the Predictions category in Amplify iOS that allows developers to easily add and configure AI/ML use cases with very few lines of code. (And no machine learning experience required!) This allows developers to then accomplish other use cases of text translation, speech to text generation, image recognition, text to speech, insights from text, etc. You can even hook it up to services such as Amazon Rekognition, Amazon Translate, Amazon Polly, Amazon Transcribe, Amazon Comprehend, and Amazon Textract.

👉🏽Don’t forget to check out the docs…
📳Android: aws-amplify.github.io/docs/android/start
📱iOS: aws-amplify.github.io/docs/ios/start

 

5. EC2 Image Builder

Backstory of why I picked this one…

In my 1st year at AWS as a Developer Advocate, I got really into robotics and IoT. I’m not giving that up anytime soon, but for 2020, I’m also excited to serve more customers that are new to core AWS services. You know, things like storage, compute, containers, databases, etc.

Thus, it came as no surprise to me when this new launch caught my eye… 👀

What got launched?

EC2 Image Builder is a service that makes it easier and faster to build and maintain secure container images. It greatly simplifies the creation, patching, testing, distribution, and sharing of Linux or Windows Server images.

Sounds cool, but what does that actually mean?

In the past, creating custom container images felt way too complex and time consuming. Most dev teams had to manually update VMs or build automation scripts to maintain these images.

Can you imagine?

Today, Amazon’s Image Builder service simplifies this process by allowing you to create custom OS images via an AWS GUI environment. You can also use it to build an automated pipeline that customizes, tests, and distributes your images in addition to keeping them secure and up-to-date. Sounds like a win-win to me. 🏆

👉🏽Don’t forget to check out the docs: aws.amazon.com/image-builder

 

¡Gracias por tu tiempo!
~Alejandra 💁🏻‍♀️ & Canela 🐾

Running ANSYS Fluent on Amazon EC2 C5n with Elastic Fabric Adapter (EFA)

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-ansys-fluent-on-amazon-ec2-c5n-with-elastic-fabric-adapter-efa/

Written by: Nicola Venuti, HPC Specialist Solutions Architect 

In July 2019 I published: “Best Practices for Running Ansys Fluent Using AWS ParallelCluster.” The first post demonstrated how to launch ANSYS Fluent on AWS using AWS ParallelCluster. In this blog, I discuss a new AWS service: the Elastic Fabric Adapter (EFA).   I also walk you through an example EFA for tightly coupled workloads. Finally, I demonstrate how you can accelerate your tightly coupled (MPI) workloads with EFA, which lowers your cost per job.

 

EFA is a new network interface for Amazon EC2 instances designed to accelerate tightly coupled (MPI) workloads on AWS. AWS announced EFA at re:Invent in 2018. If you want to learn more about EFA, you can read Jeff Barr’s blog post and watch this technical video led by our Principal Engineer, Brian Barrett.

 

After reading our first blog post, readers asked for benchmark results and possibly the cost associated to the job. So, in addition to a step-by-step guide on how to run your first ANSYS Fluent job with EFA, this blog post also shows the results (in terms of rating and scaling curve) up to 5000 cores of a common ANSYS Fluent benchmark the Formula-1 Race Car (140M cells Mesh), and the costs per job comparison among the most suitable Amazon EC2 instance types.

 

Create your HPC Cluster

In this part of the blog, I will walk you through the following:  the setup of AWS ParallelCluster configuration file, the setup the post-install script, and the deployment of your HPC cluster.

 

Setup AWS ParellelCluster
I use AWS ParallelCluster in this example because it simplifies the deployment of  HPC clusters on AWS.  This AWS supported, open source tool manages and deploys HPC clusters in the cloud.  Additionally, AWS ParellelCluster is already integrated with EFA, which eliminates extra effort to run your preferred HPC applications.

 

The latest release (2.5.1) of AWS ParellelCluster simplifies cluster deployment in three main ways. First, the updates remove the need for custom AMIs. Second, important components (particularly Nice DCV) run on AWS ParallelCluster. Finally, the hyperthreading can be shutdown using a new parameter in the configuration file.

 

Note: If you need additional instructions on how to install AWS ParallelCluster and get started, read this blog post, and/or the AWS ParallelCluster documentation.

 

The first few steps of this blog post differ from the previous post’s because of these updates. This means that the AWS ParallelCluster configuration file is different. In particular, here are the additions:

  1. enable_efa = compute in the [cluster] section
  2. the new [dcv] section and the dcv_settings parameter in the [cluster] section
  3. the new parameter disable_hyperthreading in the [cluster] section

These additions to the configuration file enable automatic functionalities that previously needed to be enabled manually.

 

Next, in your preferred text editor paste the following code:

aws_region_name = <your-preferred-region>

[global]
sanity_check = true
cluster_template = fluentEFA
update_check = true

[vpc my-vpc-1]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<Subnet-ID>

[cluster fluentEFA]
key_name = <Key-Name>
vpc_settings = my-vpc-1
compute_instance_type=c5n.18xlarge
master_instance_type=c5n.2xlarge
initial_queue_size = 0
max_queue_size = 100
maintain_initial_size = true
scheduler=slurm
cluster_type = ondemand
s3_read_write_resource=arn:aws:s3:::<Your-S3-Bucket>*
post_install = s3://<Your-S3-Bucket>/fluent-efa-post-install.sh
placement_group = DYNAMIC
placement = compute
base_os = centos7
tags = {"Name" : "fluentEFA"}
disable_hyperthreading = true
fsx_settings = parallel-fs
enable_efa = compute
dcv_settings = my-dcv

[dcv my-dcv]
enable = master

[fsx parallel-fs]
shared_dir = /fsx
storage_capacity = 1200
import_path = s3://<Your-S3-Bucket>
imported_file_chunk_size = 1024
export_path = s3://<Your-S3-Bucket>/export

 

Now that the ParellelCluster is set up, you are ready for the second step: post-install script.

 

Edit the Post-Install Script

Below is an example of a post install script. Make sure it is saved in the S3 bucket defined with the parameter post_install = s3://<Your-S3-Bucket>/fluent-efa-post-install.sh in the configuration file above.

 

To upload the post-install script into your S3 bucket, run the following command:

aws s3 cp fluent-efa-post-install.sh s3://<Your-S3-Bucket>/fluent-efa-post-install.sh

#!/bin/bash

#this will disable the ssh host key checking
#usually not needed, but Fluent might require this setting.
cat <<\EOF >> /etc/ssh/ssh_config
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
EOF

# set higher ulimits,
# usefull when running Fluent (and in general HPC applications) on multiple instances via mpi
cat <<\EOF >> /etc/security/limits.conf
* hard memlock unlimited
* soft memlock unlimited
* hard stack 1024000
* soft stack 1024000
* hard nofile 1024000
* soft nofile 1024000
EOF

#stop and disable the firewall
systemctl disable firewalld
systemctl stop firewalld

Now, you have in place all the s of AWS ParellelCluster, and you are ready to deploy your HPC cluster.

 

Deploy HPC Cluster

Run the following command to create your HPC cluster that is EFA enabled:

pcluster create -c fluent.config fluentEFA -t fluentEFA -r <your-preferred-region>

Note:  The “*” at the end of the s3_read_write_resource parameter line is needed in order to let AWS ParallelCluster accessing your S3 bucket correctly. So, for example, if your S3 bucket is called “ansys-download,” it would look like:

s3_read_write_resource=arn:aws:s3:::ansys-download*

 

You should have your HPC cluster up and running after following the three main steps in this section. Now you can install ANSYS Fluent.

Install ANSYS Fluent

The previous section of this post should take about 10 minutes to produce the following output:

Status: parallelcluster-fluentEFA - CREATE_COMPLETE
MasterPublicIP: 3.212.243.33
ClusterUser: centos
MasterPrivateIP: 10.6.1.153

Once you receive that successful output, you can move on to install ANSYS fluent. Enter the following commands to connect to the master node of your new cluster via SSH and/or DCV:

  1. via SSH: pcluster ssh fluentEFA -i ~/my_key.pem
  2. via DCV: pcluster dcv connect fluentEFA --key-path ~/my_key.pem

 

Once you are logged in, become root (sudo su - or sudo -i ), and install the ANSYS suite under the /fsx directory. You can install it manually, or you can use the sample script.

 

Note: I defined the import_path = s3://<Your-S3-Bucket> in the Amazon FSx section of the configuration file. This tells Amazon FSx to preload all the data from <Your-S3-Bucket>. I recommend copying the ANSYS installation files, and any other file or package you need, to S3 in advance. This step ensures that your files are available under the /fsx directory of your cluster.

 

The example below uses the ANSYS iso installation files. You can use either the tar or the iso file. You can download both from the ANSYS Customer Portal under Download → Current Release.

 

Run this sample script to install ANSYS:

#!/bin/bash

#check the installation directory
if [ ! -d "${1}" -o -z "${1}" ]; then
echo "Error: please check the install dir"
exit -1
fi

ansysDir="${1}/ansys_inc"
installDir="${1}/"

ansysDisk1="ANSYS2019R3_LINX64_Disk1.iso"
ansysDisk2="ANSYS2019R3_LINX64_Disk2.iso"

# mount the Disks
disk1="${installDir}/AnsysDisk1"
disk2="${installDir}/AnsysDisk2"
mkdir -p "${disk1}"
mkdir -p "${disk2}"

echo "Mounting ${ansysDisk1} ..."
mount -o loop "${installDir}/${ansysDisk1}" "${disk1}"

echo "Mounting ${ansysDisk2} ..."
mount -o loop "${installDir}/${ansysDisk2}" "${disk2}"

# INSTALL Ansys WB
echo "Installing Ansys ${ansysver}"
"${disk1}/INSTALL" -silent -install_dir "${ansysDir}" -media_dir2 "${disk2}"

echo "Ansys installed"

umount -l "${disk1}"
echo "${ansysDisk1} unmounted..."

umount -l "${disk2}"
echo "${ansysDisk2} unmounted..."

echo "Cleaning up temporary install directory"
rm -rf "${disk1}"
rm -rf "${disk2}"

echo "Installation process completed!"

 

Congrats, now you have successfully installed ANSYS Workbench!

 

Adapt the ANSYS Fluent mpi_wrapper

Now that your HPC cluster is running and that ANSYS Workbench is installed, you can patch ANSYS Fluent. ANSYS Fluent does not currently support EFA out-of-the-box, so, you need to make a few modifications to get your app running properly.

 

Complete the following steps to make the proper modifications:

 

Open mpirun.fl (an MPI wrapper script) with your preferred text editor:

vim /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl 

 

Comment this line 465:

# For better performance, suggested by Intel
FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv I_MPI_ADJUST_REDUCE 2 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_ADJUST_BCAST 1"

 

In addition to that, line 548:

FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv LD_PRELOAD $INTEL_ROOT/lib/libmpi_mt.so"

should be modified as follows:

FS_MPIRUN_FLAGS="$FS_MPIRUN_FLAGS -genv LD_PRELOAD $INTEL_ROOT/lib/release_mt/libmpi.so"

The library file location and name changed for Intel 2019 Update 5. Fixing this will remove the following error message:

ERROR: ld.so: object '/opt/intel/parallel_studio_xe_2019/compilers_and_libraries_2019/linux/mpi/intel64//lib/libmpi_mt.so' from LD_PR ELOAD cannot be preloaded: ignored

 

I recommend backing-up the MPI wrapper script before any modification:

cp /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl /fsx/ansys_inc/v195/fluent/fluent19.5.0/multiport/mpi_wrapper/bin/mpirun.fl.ORIG

Once these steps are completed, your ANSYS Fluent installation is properly modified to support EFA.

 

Run your first ANSYS Fluent job using EFA

You are almost ready to run your first ANSYS Fluent job using EFA. You can use the same submission script used previously.  Export INTELMPI_ROOT or OPENMPI_ROOT in order to specify the custom MPI library to use.

 

The following script demonstrates this step:

#!/bin/bash

#SBATCH -J Fluent
#SBATCH -o Fluent."%j".out

module load intelmpi
export INTELMPI_ROOT=/opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/

export [email protected]<your-license-server>
export [email protected]<your-license-server>

basedir="/fsx"
workdir="${basedir}/$(date "+%d-%m-%Y-%H-%M")-${SLURM_NPROCS}-$RANDOM"
mkdir "${workdir}"
cd "${workdir}"
cp "${basedir}/f1_racecar_140m.tar.gz" .
tar xzvf f1_racecar_140m.tar.gz
rm -f f1_racecar_140m.tar.gz
cd bench/fluent/v6/f1_racecar_140m/cas_dat

srun -l /bin/hostname | sort -n | awk '{print $2}' > hostfile
${basedir}/ansys_inc/v194/fluent/bin/fluentbench.pl f1_racecar_140m -t${SLURM_NPROCS} -cnf=hostfile -part=1 -nosyslog -noloadchk -ssh -mpi=intel -cflush

Save this snippet as fluent-run-efa.sh under /fsx and run it as follows:

sbatch -n 2304 /fsx/fluent-run-efa.sh

 

Note1: The number, 2304 cores, is an example, this command will tell AWS ParallelCluster to spin-up 64 C5n.18xlarge. Feel free to change it and run it as you wish.

Note2: you may want to copy on S3 the benchmark file f1_racecar_140m.tar.gz or any other dataset you want to use, so that it’s preloaded on Amazon FSx and ready for you to use.

 

Performance and cost considerations

Now I will show benchmark results (in terms of rating and scaling efficiency) and cost per job (only EC2 instances costs will be considered). The following graph shows the scaling curve of EFA vs C5.18xlarge vs the ideal scalability.

 

The Formula-1 Race Car used for this benchmark is a 140-M cells mesh. The range of 70k-100k cells per core optimizes cost for performance. Improvement in turnaround time continues up to 40,000 cells per core with an acceptable cost for the performance. C5n.18xlarge + EFA shows ~89% scaling efficiency at 3024 cores. This metric is a great improvement compared to the C5.18xlarge scaling (48% at 3024 cores). In both cases, I ran with the hyperthreading disabled, up to 84 instances in total.

Scaling vs number of cores graph

ANSYS has published some results of this benchmark here. The plot below shows the “Rating” of a Cray XC50 and C5n.18xlarge + EFA. In ANSYS’ own words the rating is defined as: “ the primary metric used to report performance results of the Fluent Benchmarks. It is defined as the number of benchmarks that can be run on a given machine (in sequence) in a 24 hour period. It is computed by dividing the number of seconds in a day (86,400 seconds) by the number of seconds required to run the benchmark. A higher rating means faster performance.”

 

The plot below shows C5n.18xlarge + EFA with a higher rating than the XC50, up to ~2400 cores, and is on par with it up to ~3800 cores.

In addition to turnaround time improvements, EFA brings another primary advantage: cost reduction. At the moment, C5n.18xlarge costs 27% more compared to C5.18xlarge (EFA is available at no additional cost). This price difference is due to the higher, 4x network performance (100-Gbps vs 25-Gbps) and 33% higher memory footprint (192 vs 144 GB). The following chart shows the cost comparison between C5.18xlarge and C5n.18xlarge + EFA as I scale out for the ANSYS benchmark run.

 

cost per run vs number of cores

Please note that the chart above shows the cost per job using the On-Demand price (OD) in US-East-1 (N. Virginia), for short jobs (that last minutes or even hours) you may want to consider using the EC2 Spot price. Spot Instances offer spare EC2 instances at steep discounts. At the moment I am writing this blog post, the C5n.18xlarge Spot price in N. Virginia is 70% lower compared to the On-Demand price: a significant price reduction.

 

Conclusions

This blog post reviewed best practices for running ANSYS Fluent with EFA, and walked through the performance and cost benefits of EFA running a 140-M cell ANSYS Fluent benchmark. Computational Fluid Dynamics and tightly coupled workloads involve an iterative process of tuning, refining, testing, and benchmarking. Many variables can affect performance of these workloads, so the AWS HPC team is committed to document best practices for MPI workloads on AWS.

I would also love to hear from you. Please let us know about other applications you would like us to test and features you would like to request.

re:Invent 2019: Introducing the Amazon Builders’ Library (Part I)

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/reinvent-2019-introducing-the-amazon-builders-library-part-i/

Today, I’m going to tell you about a new site we launched at re:Invent, the Amazon Builders’ Library, a collection of living articles covering topics across architecture, software delivery, and operations. You get to peek under the hood of how Amazon architects, releases, and operates the software underpinning Amazon.com and AWS.

Want to know how Amazon.com does what it does? This is for you. In this two-part series (the next one coming December 23), I’ll highlight some of the best architecture articles written by Amazon’s senior technical leaders and engineers.

Avoiding insurmountable queue backlogs

Avoiding insurmountable queue backlogs

In queueing theory, the behavior of queues when they are short is relatively uninteresting. After all, when a queue is short, everyone is happy. It’s only when the queue is backlogged, when the line to an event goes out the door and around the corner, that people start thinking about throughput and prioritization.

In this article, I discuss strategies we use at Amazon to deal with queue backlog scenarios – design approaches we take to drain queues quickly and to prioritize workloads. Most importantly, I describe how to prevent queue backlogs from building up in the first place. In the first half, I describe scenarios that lead to backlogs, and in the second half, I describe many approaches used at Amazon to avoid backlogs or deal with them gracefully.

Read the full article by David Yanacek – Principal Engineer

Timeouts, retries, and backoff with jitter

Timeouts, retries and backoff with jitter

Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools: timeouts, retries, and backoff.

Read the full article by Marc Brooker, Senior Principal Engineer

Challenges with distributed systems

Challenges with distributed systems

The moment we added our second server, distributed systems became the way of life at Amazon. When I started at Amazon in 1999, we had so few servers that we could give some of them recognizable names like “fishy” or “online-01”. However, even in 1999, distributed computing was not easy. Then as now, challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos. As the systems quickly grew larger and more distributed, what had been theoretical edge cases turned into regular occurrences.

Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed.

Read the full article by Jacob Gabrielson, Senior Principal Engineer

Static stability using Availability Zones

Static stability using availability zones

At Amazon, the services we build must meet extremely high availability targets. This means that we need to think carefully about the dependencies that our systems take. We design our systems to stay resilient even when those dependencies are impaired. In this article, we’ll define a pattern that we use called static stability to achieve this level of resilience. We’ll show you how we apply this concept to Availability Zones, a key infrastructure building block in AWS and therefore a bedrock dependency on which all of our services are built.

Read the full article by Becky Weiss, Senior Principal Engineer, and Mike Furr, Principal Engineer

Check back in two weeks to read about some other architecture-based expert articles that let you in on how Amazon does what it does.

Amazon EC2 Update – Inf1 Instances with AWS Inferentia Chips for High Performance Cost-Effective Inferencing

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/

Our customers are taking to Machine Learning in a big way. They are running many different types of workloads, including object detection, speech recognition, natural language processing, personalization, and fraud detection. When running on large-scale production workloads, it is essential that they can perform inferencing as quickly and as cost-effectively as possible. According to what they have told us, inferencing can account for up to 90% of the cost of their machine learning work.

New Inf1 Instances
Today we are launching Inf1 instances in four sizes. These instances are powered by AWS Inferentia chips, and are designed to provide you with fast, low-latency inferencing.

AWS Inferentia chips are designed to accelerate the inferencing process. Each chip can deliver the following performance:

  • 64 teraOPS on 16-bit floating point (FP16 and BF16) and mixed-precision data.
  • 128 teraOPS on 8-bit integer (INT8) data.

The chips also include a high-speed interconnect, and lots of memory. With 16 chips on the largest instance, your new and existing TensorFlow, PyTorch, and MxNet inferencing workloads can benefit from over 2 petaOPS of inferencing power. When compared to the G4 instances, the Inf1 instances offer up to 3x the inferencing throughput, and up to 40% lower cost per inference.

Here are the sizes and specs:

Instance Name
Inferentia Chips
vCPUsRAMEBS BandwidthNetwork Bandwidth
inf1.xlarge148 GiBUp to 3.5 GbpsUp to 25 Gbps
inf1.2xlarge1816 GiBUp to 3.5 GbpsUp to 25 Gbps
inf1.6xlarge42448 GiB3.5 Gbps25 Gbps
inf1.24xlarge1696192 GiB14 Gbps100 Gbps

The instances make use of custom Second Generation Intel® Xeon® Scalable (Cascade Lake) processors, and are available in On-Demand, Spot, and Reserved Instance form, or as part of a Savings Plan in the US East (N. Virginia) and US West (Oregon) Regions. You can launch the instances directly, and they will also be available soon through Amazon SageMaker and Amazon ECS, and Amazon Elastic Kubernetes Service.

Using Inf1 Instances
Amazon Deep Learning AMIs have been updated and contain versions of TensorFlow and MxNet that have been optimized for use in Inf1 instances, with PyTorch coming very soon. The AMIs contain the new AWS Neuron SDK, which contains commands to compile, optimize, and execute your ML models on the Inferentia chip. You can also include the SDK in your own AMIs and images.

You can build and train your model on a GPU instance such as a P3 or P3dn, and then move it to an Inf1 instance for production use. You can use a model natively trained in FP16, or you can use models that have been trained to 32 bits of precision and have AWS Neuron automatically convert them to BF16 form. Large models, such as those for language translation or natural language processing, can be split across multiple Inferentia chips in order to reduce latency.

The AWS Neuron SDK also allows you to assign models to Neuron Compute Groups, and to run them in parallel. This allows you to maximize hardware utilization and to use multiple models as part of Neuron Core Pipeline mode, taking advantage of the large on-chip cache on each Inferentia chip. Be sure to read the AWS Neuron SDK Tutorials to learn more!

Jeff;

 

AWS Now Available from a Local Zone in Los Angeles

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-now-available-from-a-local-zone-in-los-angeles/

AWS customers are always asking for more features, more bandwidth, more compute power, and more memory, while also asking for lower latency and lower prices. We do our best to meet these competing demands: we launch new EC2 instance types, EBS volume types, and S3 storage classes at a rapid pace, and we also reduce prices regularly.

AWS in Los Angeles
Today we are launching a Local Zone in Los Angeles, California. The Local Zone is a new type of AWS infrastructure deployment that brings select AWS services very close to a particular geographic area. This Local Zone is designed to provide very low latency (single-digit milliseconds) to applications that are accessed from Los Angeles and other locations in Southern California. It will be of particular interest to highly-demanding applications that are particularly sensitive to latency. This includes:

Media & Entertainment – Gaming, 3D modeling & rendering, video processing (including real-time color correction), video streaming, and media production pipelines.

Electronic Design Automation – Interactive design & layout, simulation, and verification.

Ad-Tech – Rapid decision making & ad serving.

Machine Learning – Fast, continuous model training; high-performance low-latency inferencing.

All About Local Zones
The new Local Zone in Los Angeles is a logical part of the US West (Oregon) Region (which I will refer to as the parent region), and has some unique and interesting characteristics:

Naming – The Local Zone can be accessed programmatically as us-west-2-lax-1a. All API, CLI, and Console access takes place through the us-west-2 API endpoint and the US West (Oregon) Console.

Opt-In – You will need to opt in to the Local Zone in order to use it. After opting in, you can create a new VPC subnet in the Local Zone, taking advantage of all relevant VPC features including Security Groups, Network ACLs, and Route Tables. You can target the Local Zone when you launch EC2 instances and other resources, or you can create a default subnet in the VPC and have it happen automatically.

Networking – The Local Zone in Los Angeles is connected to US West (Oregon) over Amazon’s private backbone network. Connections to the public internet take place across an Internet Gateway, giving you local ingress and egress to reduce latency. Elastic IP Addresses can be shared by a group of Local Zones in a particular geographic location, but they do not move between a Local Zone and the parent region. The Local Zone also supports AWS Direct Connect, giving you the opportunity to route your traffic over a private network connection.

Services – We are launching with support for seven EC2 instance types (T3, C5, M5, R5, R5d, I3en, and G4), two EBS volume types (io1 and gp2), Amazon FSx for Windows File Server, Amazon FSx for Lustre, Application Load Balancer, and Amazon Virtual Private Cloud. Single-Zone RDS is on the near-term roadmap, and other services will come later based on customer demand. Applications running in a Local Zone can also make use of services in the parent region.

Parent Region – As I mentioned earlier, the new Local Zone is a logical extension of the US West (Oregon) region, and is managed by the “control plane” in the region. API calls, CLI commands, and the AWS Management Console should use “us-west-2” or US West (Oregon).

AWS – Other parts of AWS will continue to work as expected after you start to use this Local Zone. Your IAM resources, CloudFormation templates, and Organizations are still relevant and applicable, as are your tools and (perhaps most important) your investment in AWS training.

Pricing & Billing – Instances and other AWS resources in Local Zones will have different prices than in the parent region. Billing reports will include a prefix that is specific to a group of Local Zones that share a physical location. EC2 instances are available in On Demand & Spot form, and you can also purchase Savings Plans.

Using a Local Zone
The first Local Zone is available today, and you can request access here:

In early 2020, you will be able opt in using the console, CLI, or by API call.

After opting in, I can list my AZs and see that the Local Zone is included:

Then I create a new VPC subnet for the Local Zone. This gives me transparent, seamless connectivity between the parent zone in Oregon and the Local Zone in Los Angeles, all within the VPC:

I can create EBS volumes:

They are, as usual, ready within seconds:

I can also see and use the Local Zone from within the AWS Management Console:

I can also use the AWS APIs, CloudFormation templates, and so forth.

Thinking Ahead
Local Zones give you even more architectural flexibility. You can think big, and you can think different! You now have the components, tools, and services at your fingertips to build applications that make use of any conceivable combination of legacy on-premises resources, modern on-premises cloud resources via AWS Outposts, resources in a Local Zone, and resources in one or more AWS regions.

In the fullness of time (as Andy Jassy often says), there could very well be more than one Local Zone in any given geographic area. In 2020, we will open a second one in Los Angeles (us-west-2-lax-1b), and are giving consideration to other locations. We would love to get your advice on locations, so feel free to leave me a comment or two!

Now Available
The Local Zone in Los Angeles is available now and you can start using it today. Learn more about Local Zones.

Jeff;

 

Coming Soon – Graviton2-Powered General Purpose, Compute-Optimized, & Memory-Optimized EC2 Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/coming-soon-graviton2-powered-general-purpose-compute-optimized-memory-optimized-ec2-instances/

We launched the first generation (A1) of Arm-based, Graviton-powered EC2 instances at re:Invent 2018. Since that launch, thousands of our customers have used them to run many different types of scale-out workloads including containerized microservices, web servers, and data/log processing.

The Operating System Vendors (OSV) and Independent Software Vendor (ISV) communities have been quick to embrace the Arm architecture and the A1 instances. You have your pick of multiple Linux & Unix distributions including Amazon Linux 2, Ubuntu, Red Hat, SUSE, Fedora, Debian, and FreeBSD:

You can also choose between three container services (Docker, Amazon ECS, and Amazon Elastic Kubernetes Service), multiple system agents, and lots of developer tools (AWS Developer Tools, Jenkins, and more).

The feedback on these instances has been strong and positive, and our customers have told us that they are ready to use Arm-based servers on their more demanding compute-heavy and memory-intensive workloads.

Graviton2
Today I would like to give you a sneak peek at the next generation of Arm-based EC2 instances. These instances are built on AWS Nitro System and will be powered by the new Graviton2 processor. This is a custom AWS design that is built using a 7 nm (nanometer) manufacturing process. It is based on 64-bit Arm Neoverse cores, and can deliver up to 7x the performance of the A1 instances, including twice the floating point performance. Additional memory channels and double-sized per-core caches speed memory access by up to 5x.

All of these performance enhancements come together to give these new instances a significant performance benefit over the 5th generation (M5, C5, R5) of EC2 instances. Our initial benchmarks show the following per-vCPU performance improvements over the M5 instances:

  • SPECjvm® 2008: +43% (estimated)
  • SPEC CPU® 2017 integer: +44% (estimated)
  • SPEC CPU 2017 floating point: +24% (estimated)
  • HTTPS load balancing with Nginx: +24%
  • Memcached: +43% performance, at lower latency
  • X.264 video encoding: +26%
  • EDA simulation with Cadence Xcellium: +54%

Based on these results, we are planning to use these instances to power Amazon EMR, Elastic Load Balancing, Amazon ElastiCache, and other AWS services.

The new instances raise the already-high bar on AWS security. Building on the existing capabilities of the AWS Nitro System, memory on the instances is encrypted with 256-bit keys that are generated at boot time, and which never leave the server.

We are working on three types of Graviton2-powered EC2 instances (the d suffix indicates NVMe local storage):

General Purpose (M6g and M6gd) – 1-64 vCPUs and up to 256 GiB of memory.

Compute-Optimized (C6g and C6gd) – 1-64 vCPUs and up to 128 GiB of memory.

Memory-Optimized (R6g and R6gd) – 1-64 vCPUs and up to 512 GiB of memory.

The instances will have up to 25 Gbps of network bandwidth, 18 Gbps of EBS-Optimized bandwidth, and will also be available in bare metal form. I will have more information to share with you in 2020.

M6g Preview
We are now running a preview of the M6g instances for testing on non-production workloads; if you are interested, please contact us.

Jeff;

AWS Load Balancer Update – Lots of New Features for You!

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-load-balancer-update-lots-of-new-features-for-you/

The AWS Application Load Balancer (ALB) and Network Load Balancer (NLB) are important parts of any highly available and scalable system. Today I am happy to share a healthy list of new features for ALB and NLB, all driven by customer requests.

Here’s what I have:

  • Weighted Target Groups for ALB
  • Least Outstanding Requests for ALB
  • Subnet Expansion for NLB
  • Private IP Address Selection for Internal NLB
  • Shared VPC Support for NLB

All of these features are available now and you can starting using them today!

It’s time for a closer look…

Weighted Target Groups for ALB
You can now use traffic weights for your ALB target groups; this will be very helpful for blue/green deployments, canary deployments, and hybrid migration/burst scenarios. You can register multiple target groups with any of the forward actions in your ALB routing rules, and associate a weight (0-999) with each one. Here’s a simple last-chance rule that sends 99% of my traffic to tg1 and the remaining 1% to tg2:

You can use this feature in conjunction with group-level target stickiness in order to maintain a consistent customer experience for a specified duration:

To learn more, read about Listeners for Your Load Balancers.

Least Outstanding Requests for ALB
You can now balance requests across targets based on the target with the lowest number of outstanding requests. This is especially useful for workloads with varied request sizes, target groups with containers & other targets that change frequently, and targets with varied levels of processing power, including those with a mix of instance types in a single auto scaling group. You can enable this new load balancing option by editing the attributes of an existing target group:

Enabling this option will disable any slow start; to learn more, read about ALB Routing Algorithms.

Subnet Expansion Support for NLB
You now have the flexibility to add additional subnets to an existing Network Load Balancer. This gives you more scaling options, and allows you to expand into newly opened Availability Zones while maintaining high availability. Select the NLB, and click Edit subnets in the Actions menu:

Then choose one or more subnets to add:

This is a good time to talk about multiple availability zones and redundancy. Since you are adding a new subnet, you want to make sure that you either have targets in it, or have cross-zone load balancing enabled.

Private IP Address Selection for Internal NLB
You can now select the private IPv4 address that is used for your internal-facing Network Load Balancer, on a per-subnet basis. This gives you additional control over network addressing, and removes the need to manually ascertain addresses and configure them into clients that do not support DNS-based routing:

You can also choose your own private IP addresses when you add additional subnets to an existing NLB.

Shared VPC Support for NLB
You can now create NLBs in shared VPCs. Using NLBs with VPC sharing, you can route traffic across subnets in VPCs owned by a centrally managed account in the same AWS Organization. You can also use NLBs to create an AWS PrivateLink service, which will enable users to privately access your services in the shared subnets from other VPCs or on-premises networks, without using public IPs or requiring the traffic to traverse the internet.

Jeff;

 

Running Cost-effective queue workers with Amazon SQS and Amazon EC2 Spot Instances

Post Syndicated from peven original https://aws.amazon.com/blogs/compute/running-cost-effective-queue-workers-with-amazon-sqs-and-amazon-ec2-spot-instances/

This post is contributed by Ran Sheinberg | Sr. Solutions Architect, EC2 Spot & Chad Schmutzer | Principal Developer Advocate, EC2 Spot | Twitter: @schmutze

Introduction

Amazon Simple Queue Service (SQS) is used by customers to run decoupled workloads in the AWS Cloud as a best practice, in order to increase their applications’ resilience. You can use a worker tier to do background processing of images, audio, documents and so on, as well as offload long-running processes from the web tier. This blog post covers the benefits of pairing Amazon SQS and Spot Instances to maximize cost savings in the worker tier, and a customer success story.

Solution Overview

Amazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a common best practice to use Amazon SQS with decoupled applications. Amazon SQS increases applications resilience by decoupling the direct communication between the frontend application and the worker tier that does data processing. If a worker node fails, the jobs that were running on that node return to the Amazon SQS queue for a different node to pick up.

Both the frontend and worker tier can run on Spot Instances, which offer spare compute capacity at steep discounts compared to On-Demand Instances. Spot Instances optimize your costs on the AWS Cloud and scale your application’s throughput up to 10 times for the same budget. Spot Instances can be interrupted with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. These can include analytics, containerized workloads, high performance computing (HPC), stateless web servers, rendering, CI/CD, and queue worker nodes—which is the focus of this post.

Worker tiers of a decoupled application are typically fault-tolerant. So, it is a prime candidate for running on interruptible capacity. Amazon SQS running on Spot Instances allows for more robust, cost-optimized applications.

By using EC2 Auto Scaling groups with multiple instance types that you configured as suitable for your application (for example, m4.xlarge, m5.xlarge, c5.xlarge, and c4.xlarge, in multiple Availability Zones), you can spread the worker tier’s compute capacity across many Spot capacity pools (a combination of instance type and Availability Zone). This increases the chance of achieving the scale that’s required for the worker tier to ingest messages from the queue, and of keeping that scale when Spot Instance interruptions occur, while selecting the lowest-priced Spot Instances in each availability zone.

You can also choose the capacity-optimized allocation strategy for the Spot Instances in your Auto Scaling group. This strategy automatically selects instances that have a lower chance of interruption, which decreases the chances of restarting jobs due to Spot interruptions. When Spot Instances are interrupted, your Auto Scaling group automatically replenishes the capacity from a different Spot capacity pool in order to achieve your desired capacity. Read the blog post “Introducing the capacity-optimized allocation strategy for Amazon EC2 Spot Instances” for more details on how to choose the suitable allocation strategy.

We focus on three main points in this blog:

  1. Best practices for using Spot Instances with Amazon SQS
  2. A customer example that uses these components
  3. Example solution that can help you get you started quickly

Application of Amazon SQS with Spot Instances

Amazon SQS eliminates the complexity of managing and operating message-oriented middleware. Using Amazon SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available. Amazon SQS is a fully managed service which allows you to set up a queue in seconds. It also allows you to use your preferred SDK to start writing and reading to and from the queue within minutes.

In the following example, we describe an AWS architecture that brings together the Amazon SQS queue and an EC2 Auto Scaling group running Spot Instances. The architecture is used for decoupling the worker tier from the web tier by using Amazon SQS. The example uses the Protect feature (which we will explain later in this post) to ensure that an instance currently processing a job does not get terminated by the Auto Scaling group when it detects that a scale-in activity is required due to a Dynamic Scaling Policy.Architecture diagram for using Amazon SQS with Spot Instances and Auto Scaling groups

AWS reference architecture used for decoupling the worker tier from the web tier by using Amazon SQS

Customer Example: How Trax Retail uses Auto Scaling groups with Spot Instances in their Amazon SQS application

Trax decided to run its queue worker tier exclusively on Spot Instances due to the fault-tolerant nature of its architecture and for cost-optimization purposes. The company digitizes the physical world of retail using Computer Vision. Their ‘Trax Factory’ transforms individual shelf into data and insights about retail store conditions.

Built using asynchronous event-driven architecture, Trax Factory is a cluster of microservices in which the completion of one service triggers the activation of another service. The worker tier uses Auto Scaling groups with dynamic scaling policies to increase and decrease the number of worker nodes in the worker tier.

You can create a Dynamic Scaling Policy by doing the following:

  1. Observe a Amazon CloudWatch metric. Watch the metric for the current number of messages in the Amazon SQS queue (ApproximateNumberOfMessagesVisible).
  2. Create a CloudWatch alarm. This alarm should be based on that metric you created in the prior step.
  3. Use your CloudWatch alarm in a Dynamic Scaling Policy. Use this policy increase and decrease the number of EC2 Instances in the Auto Scaling group.

In Trax’s case, due to the high variability of the number of messages in the queue, they opted to enhance this approach in order to minimize the time it takes to scale, by building a service that would call the SQS API and find the current number of messages in the queue more frequently, instead of waiting for the 5 minute metric refresh interval in CloudWatch.

Trax ensures that its applications are always scaled to meet the demand by leveraging the inherent elasticity of Amazon EC2 instances. This elasticity ensures that end users are never affected and/or service-level agreements (SLA) are never violated.

With a Dynamic Scaling Policy, the Auto Scaling group can detect when the number of messages in the queue has decreased, so that it can initiate a scale-in activity. The Auto Scaling group uses its configured termination policy for selecting the instances to be terminated. However, this policy poses the risk that the Auto Scaling group might select an instance for termination while that instance is currently processing an image. That instance’s work would be lost (although the image would eventually be processed by reappearing in the queue and getting picked up by another worker node).

To decrease this risk, you can use Auto Scaling groups instance protection. This means that every time an instance fetches a job from the queue, it also sends an API call to EC2 to protect itself from scale-in. The Auto Scaling group does not select the protected, working instance for termination until the instance finishes processing the job and calls the API to remove the protection.

Handling Spot Instance interruptions

This instance-protection solution ensures that no work is lost during scale-in activities. However, protecting from scale-in does not work when an instance is marked for termination due to Spot Instance interruptions. These interruptions occur when there’s increased demand for On-Demand Instances in the same capacity pool (a combination of an instance type in an Availability Zone).

Applications can minimize the impact of a Spot Instance interruption. To do so, an application catches the two-minute interruption notification (available in the instance’s metadata), and instructs itself to stop fetching jobs from the queue. If there’s an image still being processed when the two minutes expire and the instance is terminated, the application does not delete the message from the queue after finishing the process. Instead, the message simply becomes visible again for another instance to pick up and process after the Amazon SQS visibility timeout expires.

Alternatively, you can release any ongoing job back to the queue upon receiving a Spot Instance interruption notification by setting the visibility timeout of the specific message to 0. This timeout potentially decreases the total time it takes to process the message.

Testing the solution

If you’re not currently using Spot Instances in your queue worker tier, we suggest testing the approach described in this post.

For that purpose, we built a simple solution to demonstrate the capabilities mentioned in this post, using an AWS CloudFormation template. The stack includes an Amazon Simple Storage Service (S3) bucket with a CloudWatch trigger to push notifications to an SQS queue after an image is uploaded to the Amazon S3 bucket. Once the message is in the queue, it is picked up by the application running on the EC2 instances in the Auto Scaling group. Then, the image is converted to PDF, and the instance is protected from scale-in for as long as it has an active processing job.

To see the solution in action, deploy the CloudFormation template. Then upload an image to the Amazon S3 bucket. In the Auto Scaling Groups console, check the instance protection status on the Instances tab. The protection status is shown in the following screenshot.

instance protection status in console

You can also see the application logs using CloudWatch Logs:

/usr/local/bin/convert-worker.sh: Found 1 messages in https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB

/usr/local/bin/convert-worker.sh: Found work to convert. Details: INPUT=Capture1.PNG, FNAME=capture1, FEXT=png

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --protected-from-scale-in

/usr/local/bin/convert-worker.sh: Convert done. Copying to S3 and cleaning up

/usr/local/bin/convert-worker.sh: Running: aws s3 cp /tmp/capture1.pdf s3://qtest-s3bucket-18fdpm2j17wxx

/usr/local/bin/convert-worker.sh: Running: aws sqs --output=json delete-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/qtest-sqsQueue-1CL0NYLMX64OB --receipt-handle

/usr/local/bin/convert-worker.sh: Running: aws autoscaling set-instance-protection --instance-ids i-0a184c5ae289b2990 --auto-scaling-group-name qtest-autoScalingGroup-QTGZX5N70POL --no-protected-from-scale-in

Conclusion

This post helps you architect fault tolerant worker tiers in a cost optimized way. If your queue worker tiers are fault tolerant and use the built-in Amazon SQS features, you can increase your application’s resilience and take advantage of Spot Instances to save up to 90% on compute costs.

In this post, we emphasized several best practices to help get you started saving money using Amazon SQS and Spot Instances. The main best practices are:

  • Diversifying your Spot Instances using Auto Scaling groups, and selecting the right Spot allocation strategy
  • Protecting instances from scale-in activities while they process jobs
  • Using the Spot interruption notification so that the application stop polling the queue for new jobs before the instance is terminated

We hope you found this post useful. If you’re not using Spot Instances in your queue worker tier, we suggest testing the approach described here. Finally, we would like to thank the Trax team for sharing its architecture and best practices. If you want to learn more, watch the “This is my architecture” video featuring Trax and their solution.

We’d love your feedback—please comment and let me know what you think.


About the authors

 

Ran Sheinberg is a specialist solutions architect for EC2 Spot Instances with Amazon Web Services. He works with AWS customers on cost optimizing their compute spend by utilizing Spot Instances across different types of workloads: stateless web applications, queue workers, containerized workloads, analytics, HPC and others.

 

 

 

 

As a Principal Developer Advocate for EC2 Spot at AWS, Chad’s job is to make sure our customers are saving at scale by using EC2 Spot Instances to take advantage of the most cost-effective way to purchase compute capacity. Follow him on Twitter here! @schmutze

 

New – Amazon EBS Fast Snapshot Restore (FSR)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-ebs-fast-snapshot-restore-fsr/

Amazon Elastic Block Store (EBS) has been around for more than a decade and is a fundamental AWS building block. You can use it to create persistent storage volumes that can store up to 16 TiB and supply up to 64,000 IOPS (Input/Output Operations per Second). You can choose between four types of volumes, making the choice that best addresses your data transfer throughput, IOPS, and pricing requirements. If your requirements change, you can modify the type of a volume, expand it, or change the performance while the volume remains online and active. EBS snapshots allow you to capture the state of a volume for backup, disaster recovery, and other purposes. Once created, a snapshot can be used to create a fresh EBS volume. Snapshots are stored in Amazon Simple Storage Service (S3) for high durability.

Our ever-creative customers are using EBS snapshots in many interesting ways. In addition to the backup and disaster recovery use cases that I just mentioned, they are using snapshots to quickly create analytical or test environments using data drawn from production, and to support Virtual Desktop Interface (VDI) environments. As you probably know, the AMIs (Amazon Machine Images), that you use to launch EC2 instances are also stored as one or more snapshots.

Fast Snapshot Restore
Today we are launching Fast Snapshot Restore (FSR) for EBS. You can enable it for new and existing snapshots on a per-AZ (Availability Zone) basis, and then create new EBS volumes that deliver their maximum performance and do not need to be initialized.

This performance enhancement will allow you to build AWS-based systems that are even faster and more responsive than before. Faster boot times will speed up your VDI environments and allow your Auto Scaling Groups to come online and start processing traffic more quickly, even if you use large and/or custom AMIs. I am sure that you will soon dream up new applications that can take advantage of this new level of speed and predictability.

Fast Snapshot Restore can be enabled on a snapshot even while the snapshot is being created. If you create nightly backup snapshots, enabling them for FSR will allow you to do fast restores the following day regardless of the size of the volume or the snapshot.

Enabling & Using Fast Snapshot Restore
I can get started in minutes! I open the EC2 Console and find the first snapshot that I want to set up for fast restore:

I select the snapshot and choose Manage Fast Snapshot Restore from the Actions menu:

Then I select the Availability Zones where I plan to create EBS volumes, and click Save:

After the settings are saved, I receive a confirmation:

The console shows me that my snapshot is being enabled for Fast Snapshot Restore:

The status progresses from enabling to optimizing, and then to enabled. Behind the scenes and with no extra effort on my part, the optimization process provisions extra resources to deliver the fast restores, proceeding at a rate of one TiB per hour. By contrast, non-optimized volumes retrieve data directly from the S3-stored snapshot on an incremental, on-demand basis.

Once the optimization is complete, I can create volumes from the snapshot in the usual way, confident that they will be ready in seconds and pre-initialized for full performance! Each FSR-enabled snapshot supports creation of up to 10 initialized volumes per hour per Availability Zone; additional volume creations will be non-initialized. As my needs change, I can enable Fast Snapshot Restore in additional Availability Zones and I can disable it in Zones where I had previously enabled it.

When Fast Snapshot Restore is enabled for a snapshot in a particular Availability Zone, a bucket-based credit system governs the acceleration process. Creating a volume consumes a credit; the credits refill over time, and the maximum number of credits is a function of the FSR-enabled snapshot size. Here are some guidelines:

  • A 100 GiB FSR-enabled snapshot will have a maximum credit balance of 10, and a fill rate of 10 credits per hour.
  • A 4 TiB FSR-enabled snapshot will have a maximum credit balance of 1, and a fill rate of 1 credit every 4 hours.

In other words, you can do 1 TiB of restores per hour for a given FSR-enabled snapshot within an AZ.

Things to Know
Here are some things to know about Fast Snapshot Restore:

Regions & AZs – Fast Snapshot Restore is available in all Availability Zones of the US East (N. Virginia), US West (Oregon), US West (N. California), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Sydney), and Asia Pacific (Tokyo) Regions.

Pricing – You pay $0.75 for each hour that Fast Snapshot Restore is enabled for a snapshot in a particular Availability Zone, pro-rated and with a minimum of one hour.

Monitoring – You can use the following per-minute CloudWatch metrics to track the state of the credit bucket for each FSR-enabled snapshot:

  • FastSnapshotRestoreCreditsBalance – The number of volume creation credits that are available.
  • FastSnapshotRestoreCreditsBucketSize – The maximum number of volume creation credits that can be accumulated.

CLI & Programmatic Access – You can use the enable-fast-snapshot-restores, describe-fast-snapshot-restores, and disable-fast-snapshot-restores commands to create and manage your accelerated snapshots from the command line. You can also use the EnableFastSnapshotRestores, DescribeFastSnapshotRestores, and DisableFastSnapshotRestores API functions from your application code.

CloudWatch Events – You can use the EBS Fast Snapshot Restore State-change Notification event type to invoke Lambda functions or other targets when the state of a snapshot/AZ pair changes. Events are emitted on successful and unsuccessful transitions to the enabling, optimizing, enabled, disabling, and disabled states.

Data Lifecycle Manager – You can enable FSR on snapshots created by your DLM lifecycle policies, specify AZs, and specify the number of snapshots to be FSR-enabled. You can use an existing CloudFormation template to integrate FSR into your DLM policies (read about the AWS::DLM::LifecyclePolicy to learn more).

In the Works
We are launching with support for snapshots that you own. Over time, we intend to expand coverage and allow you to enable Fast Snapshot Restore for snapshots that you have been granted access to.

Available Now
Fast Snapshot Restore is available now and you can start using it today!

Jeff;

 

Add defense in depth against open firewalls, reverse proxies, and SSRF vulnerabilities with enhancements to the EC2 Instance Metadata Service

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/

Since it first launched over 10 years ago, the Amazon EC2 Instance Metadata Service (IMDS) has helped customers build secure and scalable applications. The IMDS solved a big security headache for cloud users by providing access to temporary, frequently rotated credentials, removing the need to hardcode or distribute sensitive credentials to instances manually or programatically. Attached locally to every EC2 instance, the IMDS runs on a special “link local” IP address of 169.254.169.254 that means only software running on the instance can access it. For applications with access to IMDS, it makes available metadata about the instance, its network, and its storage. The IMDS also makes the AWS credentials available for any IAM role that is attached to the instance.

When you run applications in the cloud, application security is as critical as instance security; if the applications running on an instance have vulnerabilities or misconfigurations, there can be serious consequences. While application security plays an important role in a layered defense, AWS also constantly evaluates where to add layers, even within the instance, to minimize the damage that can occur when these situations occur.

Today, AWS is making v2 of the EC2 Instance Metadata Service (IMDSv2) available. The existing instance metadata service (IMDSv1) is fully secure, and AWS will continue to support it. But IMDSv2 adds new “belt and suspenders” protections for four types of vulnerabilities that could be used to try to access the IMDS. These new protections go well beyond other types of mitigations, while working seamlessly with existing mitigations such as restricting IAM roles and using local firewall rules to restrict access to the IMDS. AWS is also making new versions of the AWS SDKs and CLIs available that support IMDSv2.

What’s new in IMDSv2

With IMDSv2, every request is now protected by session authentication. A session begins and ends a series of requests that software running on an EC2 instance uses to access the locally-stored EC2 instance metadata and credentials. The software starts a session with a simple HTTP PUT request to IMDSv2. IMDSv2 returns a secret token to the software running on the EC2 instance, which will use the token as a password to make requests to IMDSv2 for metadata and credentials. Unlike traditional passwords, you don’t need to worry about getting the token to the software, because the software gets it for itself with the PUT request. The token is never stored by IMDSv2 and can never be retrieved by subsequent calls, so a session and its token are effectively destroyed when the process using the token terminates. There’s no limit on the number of requests within a single session, and there’s no limit on the number of IMDSv2 sessions. Sessions can last up to six hours and, for added security, a session token can only be used directly from the EC2 instance where that session began.

For example, this curl recipe retrieves a session token that’s valid for the full six hours (21600 seconds) and then uses that token to access the EC2 instance’s profile metadata:


TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`

curl http://169.254.169.254/latest/meta-data/profile -H "X-aws-ec2-metadata-token: $TOKEN"

If you need to write code against the IMDSv2 directly, you can get more detail on the new scheme in the EC2 User Guide.

How these changes add defense in depth

IMDSv2’s changes are easy to use, and you’ll start using it automatically if you’re using the updated AWS SDKs and CLIs. These changes go beyond other types of mitigations to protect against misconfigured-open website application firewalls, misconfigured-open reverse proxies, unpatched SSRF vulnerabilities, and misconfigured-open layer-3 firewalls and network address translation.

Protecting against open Website Application Firewalls

Some Web Application Firewall (WAF) services, such as AWS WAF, can’t be configured to act as open WAFs. However, some third-party WAFs can be misconfigured to allow attackers unauthorized access to the network behind the WAF, including the EC2 IMDS.

Many WAFs are designed to act invisibly, so that they can protect websites and applications without administrators having to change or reconfigure the applications that are behind the WAF. To be transparent, WAFs usually pass on all of the headers that come with a request, and do not add their own headers, such as the standard X-Forwarded-For header that other kinds of proxies add. In other words, applications behind a WAF get requests just as the requester sent them.

The AWS approach is to block open WAFs by using a type of request that open WAFs very rarely support, HTTP PUT requests. Although web services such as Amazon S3 use PUT requests for object storage, they’re an uncommon type of request for websites and browsers to use. Our analysis of third-party WAF products and open WAF misconfigurations found that the vast majority do not permit HTTP PUT requests. We’re using this PUT request to provide a new layer of defense that goes beyond any existing capabilities – we’ve architected the IMDSv2 service to require a PUT request at the beginning of a session, which will prevent open WAFs from being abused to access the IMDS in the vast majority of cases.

Protecting against open reverse proxies

As it happens, it’s also very rare for open reverse proxies to allow PUT requests, but IMDSv2 has another layer of defense against open reverse proxies. Reverse proxies, such as Apache httpd or Squid, can also be misconfigured to allow external requests that reach internal resources, but it’s still normal for these proxies to send an X-Forwarded-For HTTP header. That header itself is used to pass on the IP address of the original caller. IMDSv2 will also not issue session tokens to any caller with an X-Forwarded-For header, which is effective at blocking unauthorized access due to misconfigurations like an open reverse proxy.

Protecting against SSRF vulnerabilities

SSRF vulnerabilities allow attackers to make unauthorized requests from web applications. Since these requests come from the application itself, they can be used to access internal resources that the application has access to but that were not intended to be accessible to outsiders. SSRF vulnerabilities vary in their severity, and some are immune to other types of mitigations. For instance, blocking SSRFs through static headers in instance metadata requests is effective only when the vulnerability merely allows the attacker to control the URL that is being requested; however, AWS analysis found many SSRF vulnerabilities that allow attackers to set arbitrary headers because the SSRF vulnerability impacts the application’s own header processing.

IMDSv2’s combination of beginning a session with a PUT request, and then requiring the secret session token in other requests, is always strictly more effective than requiring only a static header. AWS analysis of real-world vulnerabilities found that this combination protects against the vast majority of SSRF vulnerabilities.

Protecting against open layer 3 firewalls and NATs

Last, there is a final layer of defense in IMDSv2 that is designed to protect EC2 instances that have been misconfigured as open routers, layer 3 firewalls, VPNs, tunnels, or NAT devices. With IMDSv2, the PUT response containing the secret token will, by default, not be able to travel outside the instance. This is accomplished by having the default Time To Live (TTL) on the low-level IP packets containing the secret token set to “1,” much lower than a typical value, such as “64.” Hardware and software that handle packets, including EC2 instances, subtract 1 from each packet’s TTL field whenever they pass it on. If the TTL gets to 0, the packet is discarded, and an error message is sent back to the sender. A packet with a TTL of “64” can therefore make sixty-four “hops” in a network before giving up, while a packet with a TTL of “1” can exist in just one. This feature allows legitimate traffic to get to an intended destination, but is designed to stop packets from endlessly running around in circles if there’s a loop in a network.

With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.

Making the transition

Both IMDSv1 and IMDSv2 will be available and enabled by default, and customers can choose which they will use. The IMDS can now be restricted to v2 only, or IMDS (v1 and v2) can also be disabled entirely. AWS recommends adopting v2 and restricting access to v2 only for added security. IMDSv1 remains available for customers who have tools and scripts using v1, and who are comfortable with the existing security posture of their instances.

A number of tools are available to make transitioning to v2 and disabling v1 seamless. Starting today, a new CloudWatch metric is available that provides visibility into the number of v1 calls that are being made on any given instance. Customers can use this metric to monitor how often v1 is still being accessed as Amazon Machine Images, the AWS SDKs, CLIs, cloud-init, and other software accessing the IMDS is updated, released, and upgraded. When you can see that an instance can be launched, activated, used for service, and the metric is zero, it is safe to require v2 of the IMDS, disabling v1. For more information on transitioning to IMDSv2, see the user guide.

Security can also be further enhanced while this transition is happening. AWS credentials provided by the IMDS now include an ec2:RoleDelivery IAM context key. Credentials provided by the older IMDSv1 have an ec2:RoleDelivery value of “1.0,” and credentials using the new scheme will have an ec2:RoleDelivery value of “2.0.” This context key makes it easy to enforce use of the new scheme on a service-by-service or resource-by-resource basis by using those context keys as conditions in IAM policies, resource policies, or AWS Organizations service control policies. For example, if all of the software accessing an S3 bucket has been upgraded to use IMDSv2, then that S3 bucket can be safely restricted to only allow access to role-account credentials that have the “2.0” value (or greater) for the context key. The effect is that credentials retrieved using IMDSv1 will be prevented from accessing the bucket. AWS CloudTrail is also being updated to record the new ec2:RoleDelivery parameters.

Hear about IMDSv2 at re:Invent

Mark Ryland will be talking in more detail about IMDSv2, and the transition to it, at AWS re:Invent in December. We’ll update this post soon with a link to the session in the re:Invent catalog.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

In The Works – New AMD-Powered, Compute-Optimized EC2 Instances (C5a/C5ad)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/in-the-works-new-amd-powered-compute-optimized-ec2-instances-c5a-c5ad/

We’re getting ready to give you even more power and even more choices when it comes to EC2 instances.

We will soon launch C5a and C5ad instances powered by custom second-generation AMD EPYC “Rome” processors running at frequencies as high as 3.3 GHz. You will be able to use these compute-optimized instances to run your batch processing, distributed analytics, web applications and other compute-intensive workloads. Like the existing AMD-powered instances in the M, R and T families, the C5a and C5ad instances are built on the AWS Nitro System and give you an opportunity to balance your instance mix based on cost and performance.

The instances will be available in eight sizes and also in bare metal form, with up to 192 vCPUs and 384 GiB of memory. The C5ad instances will include up to 7.6 TiB of fast, local NVMe storage, making them perfect for video encoding, image manipulation, and other media processing workloads.

The bare metal instances (c5an.metal and c5adn.metal) will offer twice as much memory and double the vCPU count of comparable instances, making them some of the largest and most powerful compute-optimized instances yet. The bare metal variants will have access to 100 Gbps of network bandwidth and will be compatible with Elastic Fabric Adapter — perfect for your most demanding HPC workloads!

I’ll have more information soon, so stay tuned!

Jeff;

New – Savings Plans for AWS Compute Services

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-savings-plans-for-aws-compute-services/

I first wrote about EC2 Reserved Instances a decade ago! Since I wrote that post, our customers have saved billions of dollars by using Reserved Instances to commit to usage of a specific instance type and operating system within an AWS region.

Over the years we have enhanced the Reserved Instance model to make it easier for you to take advantage of the RI discount. This includes:

Regional Benefit – This enhancement gave you the ability to apply RIs across all Availability Zones in a region.

Convertible RIs – This enhancement allowed you to change the operating system or instance type at any time.

Instance Size Flexibility – This enhancement allowed your Regional RIs to apply to any instance size within a particular instance family.

The model, as it stands today, gives you discounts of up to 72%, but it does require you to coordinate your RI purchases and exchanges in order to ensure that you have an optimal mix that covers usage that might change over time.

New Savings Plans
Today we are launching Savings Plans, a new and flexible discount model that provides you with the same discounts as Reserved Instances, in exchange for a commitment to use a specific amount (measured in dollars per hour) of compute power over a one or three year period.

Every type of compute usage has an On Demand price and a (lower) Savings Plan price. After you commit to a specific amount of compute usage per hour, all usage up to that amount will be covered by the Saving Plan, and anything past it will be billed at the On Demand rate.

If you own Reserved Instances, the Savings Plan applies to any On Demand usage that is not covered by the RIs. We will continue to sell RIs, but Savings Plans are more flexible and I think many of you will prefer them!

Savings Plans are available in two flavors:

Compute Savings Plans provide the most flexibility and help to reduce your costs by up to 66% (just like Convertible RIs). The plans automatically apply to any EC2 instance regardless of region, instance family, operating system, or tenancy, including those that are part of EMR, ECS, or EKS clusters, or launched by Fargate. For example, you can shift from C4 to C5 instances, move a workload from Dublin to London, or migrate from EC2 to Fargate, benefiting from Savings Plan prices along the way, without having to do anything.

EC2 Instance Savings Plans apply to a specific instance family within a region and provide the largest discount (up to 72%, just like Standard RIs). Just like with RIs, your savings plan covers usage of different sizes of the same instance type (such as a c5.4xlarge or c5.large) throughout a region. You can even switch switch from Windows to Linux while continuing to benefit, without having to make any changes to your savings plan.

Purchasing a Savings Plan
AWS Cost Explorer will help you to choose a Savings Plan, and will guide you through the purchase process. Since my own EC2 usage is fairly low, I used a test account that had more usage. I open AWS Cost Explorer, then click Recommendations within Savings Plans:

I choose my Recommendation options, and review the recommendations:

Cost Explorer recommends that I purchase $2.40 of hourly Savings Plan commitment, and projects that I will save 40% (nearly $1200) per month, in comparison to On-Demand. This recommendation tries to take into account variable usage or temporary usage spikes in order to recommend the steady state capacity for which we believe you should consider a Savings Plan. In my case, the variable usage averages out to $0.04 per hour that we’re recommending I keep as On-Demand.

I can see the recommended Savings Plans at the bottom of the page, select those that I want to purchase, and Add them to my cart:

When I am ready to proceed, I click View cart, review my purchases, and click Submit order to finalize them:

My Savings Plans become active right away. I can use the Cost Explorer’s Performance & Coverage reports to review my actual savings, and to verify that I own sufficient Savings Plans to deliver the desired amount of coverage.

Available Now
As you can see, Savings Plans are easy to use! You can access compute power at discounts of up to 72%, while gaining the flexibility to change compute services, instance types, operating systems, regions, and so forth.

Savings Plans are available in all AWS regions outside of China, and you can start to purchase (and benefit) from them today!

Jeff;

 

Now Available: New C5d Instance Sizes and Bare Metal Instances

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/now-available-new-c5d-instance-sizes-and-bare-metal-instances/

Amazon EC2 C5 instances are very popular for running compute-heavy workloads like batch processing, distributed analytics, high-performance computing, machine/deep learning inference, ad serving, highly scalable multiplayer gaming, and video encoding.

In 2018, we added blazing fast local NVMe storage, and named these new instances C5d. They are a great fit for applications that need access to high-speed, low latency local storage like video encoding, image manipulation and other forms of media processing. They will also benefit applications that need temporary storage of data, such as batch and log processing and applications that need caches and scratch files.

Just a few weeks ago, we launched new instances sizes and a bare metal option for C5 instances. Today, we are happy to add the same capabilities to the C5d family: 12xlarge, 24xlarge, and a bare metal option.

The new C5d instance sizes run on Intel’s Second Generation Xeon Scalable processors (code-named Cascade Lake) with sustained all-core turbo frequency of 3.6GHz and maximum single core turbo frequency of 3.9 GHz.

The new processors also enable a new feature called Intel Deep Learning Boost, a capability based on the AVX-512 instruction set. Thanks to the new Vector Neural Network Instructions (AVX-512 VNNI), deep learning frameworks will speed up typical machine learning operations like convolution, and automatically improve inference performance over a wide range of workloads.

These instances are based on the AWS Nitro System, with dedicated hardware accelerators for EBS processing (including crypto operations), the software-defined network inside of each Virtual Private Cloud (VPC), and ENA networking.

New Instance Sizes for C5d: 12xlarge and 24xlarge
Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.12xlarge4896 GiB2 x 900 GB NVMe SSD7 Gbps12 Gbps
c5d.24xlarge96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Previously, the largest C5d instance available was c5d.18xlarge, with 72 logical processors, 144 GiB of memory, and 1.8 TB of storage. As you can see, the new 24xlarge size increases available resources by 33%, in order to help you crunch those super heavy workloads. Last but not least, customers also get 50% more NVMe storage per logical processor on both 12xlarge and 24xlarge, with up to 3.6 TB of local storage!

Bare Metal C5d
As is the case with the existing bare metal instances (M5, M5d, R5, R5d, z1d, and so forth), your operating system runs on the underlying hardware and has direct access to processor and other hardware.

Bare metal instances can be used to run software with specific requirements, e.g. applications that are exclusively licensed for use on physical, non-virtualized hardware. These instances can also be used to run tools and applications that require access to low-level processor features such as performance counters.

Here are the specs:

Instance NameLogical ProcessorsMemoryLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
c5d.metal96192 GiB4 x 900 GB NVMe SSD14 Gbps25 Gbps

Bare metal instances can also take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, and other AWS services.

Now Available!
You can start using these new instances today in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Canada (Central), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Europe (London), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), South America (São Paulo), and AWS GovCloud (US-West).

Please send us feedback, either on the AWS forum for Amazon EC2, or through your usual AWS support contacts.

Julien;