Tag Archives: EC2 Spot Instances

AWS Online Tech Talks – June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2018/

AWS Online Tech Talks – June 2018

Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

 

Analytics & Big Data

June 18, 2018 | 11:00 AM – 11:45 AM PTGet Started with Real-Time Streaming Data in Under 5 Minutes – Learn how to use Amazon Kinesis to capture, store, and analyze streaming data in real-time including IoT device data, VPC flow logs, and clickstream data.
June 20, 2018 | 11:00 AM – 11:45 AM PT – Insights For Everyone – Deploying Data across your Organization – Learn how to deploy data at scale using AWS Analytics and QuickSight’s new reader role and usage based pricing.

 

AWS re:Invent
June 13, 2018 | 05:00 PM – 05:30 PM PTEpisode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar.
Compute

June 25, 2018 | 01:00 PM – 01:45 PM PTAccelerating Containerized Workloads with Amazon EC2 Spot Instances – Learn how to efficiently deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances.

June 26, 2018 | 01:00 PM – 01:45 PM PTEnsuring Your Windows Server Workloads Are Well-Architected – Get the benefits, best practices and tools on running your Microsoft Workloads on AWS leveraging a well-architected approach.

 

Containers
June 25, 2018 | 09:00 AM – 09:45 AM PTRunning Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.

 

Databases

June 18, 2018 | 01:00 PM – 01:45 PM PTOracle to Amazon Aurora Migration, Step by Step – Learn how to migrate your Oracle database to Amazon Aurora.
DevOps

June 20, 2018 | 09:00 AM – 09:45 AM PTSet Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tools – Learn how to set up a CI/CD pipeline for deploying containers using the AWS Developer Tools.

 

Enterprise & Hybrid
June 18, 2018 | 09:00 AM – 09:45 AM PTDe-risking Enterprise Migration with AWS Managed Services – Learn how enterprise customers are de-risking cloud adoption with AWS Managed Services.

June 19, 2018 | 11:00 AM – 11:45 AM PTLaunch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new

 

AWS Environments

June 21, 2018 | 11:00 AM – 11:45 AM PTLeading Your Team Through a Cloud Transformation – Learn how you can help lead your organization through a cloud transformation.

June 21, 2018 | 01:00 PM – 01:45 PM PTEnabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.

June 28, 2018 | 01:00 PM – 01:45 PM PTFireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device.
IoT

June 27, 2018 | 11:00 AM – 11:45 AM PTAWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.

 

Machine Learning

June 19, 2018 | 09:00 AM – 09:45 AM PTIntegrating Amazon SageMaker into your Enterprise – Learn how to integrate Amazon SageMaker and other AWS Services within an Enterprise environment.

June 21, 2018 | 09:00 AM – 09:45 AM PTBuilding Text Analytics Applications on AWS using Amazon Comprehend – Learn how you can unlock the value of your unstructured data with NLP-based text analytics.

 

Management Tools

June 20, 2018 | 01:00 PM – 01:45 PM PTOptimizing Application Performance and Costs with Auto Scaling – Learn how selecting the right scaling option can help optimize application performance and costs.

 

Mobile
June 25, 2018 | 11:00 AM – 11:45 AM PTDrive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.

 

Security, Identity & Compliance

June 26, 2018 | 09:00 AM – 09:45 AM PTUnderstanding AWS Secrets Manager – Learn how AWS Secrets Manager helps you rotate and manage access to secrets centrally.
June 28, 2018 | 09:00 AM – 09:45 AM PTUsing Amazon Inspector to Discover Potential Security Issues – See how Amazon Inspector can be used to discover security issues of your instances.

 

Serverless

June 19, 2018 | 01:00 PM – 01:45 PM PTProductionize Serverless Application Building and Deployments with AWS SAM – Learn expert tips and techniques for building and deploying serverless applications at scale with AWS SAM.

 

Storage

June 26, 2018 | 11:00 AM – 11:45 AM PTDeep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services.
June 27, 2018 | 01:00 PM – 01:45 PM PTChanging the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances.
June 28, 2018 | 11:00 AM – 11:45 AM PTBig Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.

Creating a 1.3 Million vCPU Grid on AWS using EC2 Spot Instances and TIBCO GridServer

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/creating-a-1-3-million-vcpu-grid-on-aws-using-ec2-spot-instances-and-tibco-gridserver/

Many of my colleagues are fortunate to be able to spend a good part of their day sitting down with and listening to our customers, doing their best to understand ways that we can better meet their business and technology needs. This information is treated with extreme care and is used to drive the roadmap for new services and new features.

AWS customers in the financial services industry (often abbreviated as FSI) are looking ahead to the Fundamental Review of Trading Book (FRTB) regulations that will come in to effect between 2019 and 2021. Among other things, these regulations mandate a new approach to the “value at risk” calculations that each financial institution must perform in the four hour time window after trading ends in New York and begins in Tokyo. Today, our customers report this mission-critical calculation consumes on the order of 200,000 vCPUs, growing to between 400K and 800K vCPUs in order to meet the FRTB regulations. While there’s still some debate about the magnitude and frequency with which they’ll need to run this expanded calculation, the overall direction is clear.

Building a Big Grid
In order to make sure that we are ready to help our FSI customers meet these new regulations, we worked with TIBCO to set up and run a proof of concept grid in the AWS Cloud. The periodic nature of the calculation, along with the amount of processing power and storage needed to run it to completion within four hours, make it a great fit for an environment where a vast amount of cost-effective compute power is available on an on-demand basis.

Our customers are already using the TIBCO GridServer on-premises and want to use it in the cloud. This product is designed to run grids at enterprise scale. It runs apps in a virtualized fashion, and accepts requests for resources, dynamically provisioning them on an as-needed basis. The cloud version supports Amazon Linux as well as the PostgreSQL-compatible edition of Amazon Aurora.

Working together with TIBCO, we set out to create a grid that was substantially larger than the current high-end prediction of 800K vCPUs, adding a 50% safety factor and then rounding up to reach 1.3 million vCPUs (5x the size of the largest on-premises grid). With that target in mind, the account limits were raised as follows:

  • Spot Instance Limit – 120,000
  • EBS Volume Limit – 120,000
  • EBS Capacity Limit – 2 PB

If you plan to create a grid of this size, you should also bring your friendly local AWS Solutions Architect into the loop as early as possible. They will review your plans, provide you with architecture guidance, and help you to schedule your run.

Running the Grid
We hit the Go button and launched the grid, watching as it bid for and obtained Spot Instances, each of which booted, initialized, and joined the grid within two minutes. The test workload used the Strata open source analytics & market risk library from OpenGamma and was set up with their assistance.

The grid grew to 61,299 Spot Instances (1.3 million vCPUs drawn from 34 instance types spanning 3 generations of EC2 hardware) as planned, with just 1,937 instances reclaimed and automatically replaced during the run, and cost $30,000 per hour to run, at an average hourly cost of $0.078 per vCPU. If the same instances had been used in On-Demand form, the hourly cost to run the grid would have been approximately $93,000.

Despite the scale of the grid, prices for the EC2 instances did not move during the bidding process. This is due to the overall size of the AWS Cloud and the smooth price change model that we launched late last year.

To give you a sense of the compute power, we computed that this grid would have taken the #1 position on the TOP 500 supercomputer list in November 2007 by a considerable margin, and the #2 position in June 2008. Today, it would occupy position #360 on the list.

I hope that you enjoyed this AWS success story, and that it gives you an idea of the scale that you can achieve in the cloud!

Jeff;

AWS Online Tech Talks – April & Early May 2018

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-april-early-may-2018/

We have several upcoming tech talks in the month of April and early May. Come join us to learn about AWS services and solution offerings. We’ll have AWS experts online to help answer questions in real-time. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

April & early May — 2018 Schedule

Compute

April 30, 2018 | 01:00 PM – 01:45 PM PTBest Practices for Running Amazon EC2 Spot Instances with Amazon EMR (300) – Learn about the best practices for scaling big data workloads as well as process, store, and analyze big data securely and cost effectively with Amazon EMR and Amazon EC2 Spot Instances.

May 1, 2018 | 01:00 PM – 01:45 PM PTHow to Bring Microsoft Apps to AWS (300) – Learn more about how to save significant money by bringing your Microsoft workloads to AWS.

May 2, 2018 | 01:00 PM – 01:45 PM PTDeep Dive on Amazon EC2 Accelerated Computing (300) – Get a technical deep dive on how AWS’ GPU and FGPA-based compute services can help you to optimize and accelerate your ML/DL and HPC workloads in the cloud.

Containers

April 23, 2018 | 11:00 AM – 11:45 AM PTNew Features for Building Powerful Containerized Microservices on AWS (300) – Learn about how this new feature works and how you can start using it to build and run modern, containerized applications on AWS.

Databases

April 23, 2018 | 01:00 PM – 01:45 PM PTElastiCache: Deep Dive Best Practices and Usage Patterns (200) – Learn about Redis-compatible in-memory data store and cache with Amazon ElastiCache.

April 25, 2018 | 01:00 PM – 01:45 PM PTIntro to Open Source Databases on AWS (200) – Learn how to tap the benefits of open source databases on AWS without the administrative hassle.

DevOps

April 25, 2018 | 09:00 AM – 09:45 AM PTDebug your Container and Serverless Applications with AWS X-Ray in 5 Minutes (300) – Learn how AWS X-Ray makes debugging your Container and Serverless applications fun.

Enterprise & Hybrid

April 23, 2018 | 09:00 AM – 09:45 AM PTAn Overview of Best Practices of Large-Scale Migrations (300) – Learn about the tools and best practices on how to migrate to AWS at scale.

April 24, 2018 | 11:00 AM – 11:45 AM PTDeploy your Desktops and Apps on AWS (300) – Learn how to deploy your desktops and apps on AWS with Amazon WorkSpaces and Amazon AppStream 2.0

IoT

May 2, 2018 | 11:00 AM – 11:45 AM PTHow to Easily and Securely Connect Devices to AWS IoT (200) – Learn how to easily and securely connect devices to the cloud and reliably scale to billions of devices and trillions of messages with AWS IoT.

Machine Learning

April 24, 2018 | 09:00 AM – 09:45 AM PT Automate for Efficiency with Amazon Transcribe and Amazon Translate (200) – Learn how you can increase the efficiency and reach your operations with Amazon Translate and Amazon Transcribe.

April 26, 2018 | 09:00 AM – 09:45 AM PT Perform Machine Learning at the IoT Edge using AWS Greengrass and Amazon Sagemaker (200) – Learn more about developing machine learning applications for the IoT edge.

Mobile

April 30, 2018 | 11:00 AM – 11:45 AM PTOffline GraphQL Apps with AWS AppSync (300) – Come learn how to enable real-time and offline data in your applications with GraphQL using AWS AppSync.

Networking

May 2, 2018 | 09:00 AM – 09:45 AM PT Taking Serverless to the Edge (300) – Learn how to run your code closer to your end users in a serverless fashion. Also, David Von Lehman from Aerobatic will discuss how they used [email protected] to reduce latency and cloud costs for their customer’s websites.

Security, Identity & Compliance

April 30, 2018 | 09:00 AM – 09:45 AM PTAmazon GuardDuty – Let’s Attack My Account! (300) – Amazon GuardDuty Test Drive – Practical steps on generating test findings.

May 3, 2018 | 09:00 AM – 09:45 AM PTProtect Your Game Servers from DDoS Attacks (200) – Learn how to use the new AWS Shield Advanced for EC2 to protect your internet-facing game servers against network layer DDoS attacks and application layer attacks of all kinds.

Serverless

April 24, 2018 | 01:00 PM – 01:45 PM PTTips and Tricks for Building and Deploying Serverless Apps In Minutes (200) – Learn how to build and deploy apps in minutes.

Storage

May 1, 2018 | 11:00 AM – 11:45 AM PTBuilding Data Lakes That Cost Less and Deliver Results Faster (300) – Learn how Amazon S3 Select And Amazon Glacier Select increase application performance by up to 400% and reduce total cost of ownership by extending your data lake into cost-effective archive storage.

May 3, 2018 | 11:00 AM – 11:45 AM PTIntegrating On-Premises Vendors with AWS for Backup (300) – Learn how to work with AWS and technology partners to build backup & restore solutions for your on-premises, hybrid, and cloud native environments.

New Amazon EC2 Spot pricing model: Simplified purchasing without bidding and fewer interruptions

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

Contributed by Deepthi Chelupati and Roshni Pary

Amazon EC2 Spot Instances offer spare compute capacity in the AWS Cloud at steep discounts. Customers—including Yelp, NASA JPL, FINRA, and Autodesk—use Spot Instances to reduce costs and get faster results. Spot Instances provide acceleration, scale, and deep cost savings to big data workloads, containerized applications such as web services, test/dev, and many types of HPC and batch jobs.

At re:Invent 2017, we launched a new pricing model that simplified the Spot purchasing experience. The new model gives you predictable prices that adjust slowly over days and weeks, with typical savings of 70-90% over On-Demand. With the previous pricing model, some of you had to invest time and effort to analyze historical prices to determine your bidding strategy and maximum bid price. Not anymore.

How does the new pricing model work?

You don’t have to bid for Spot Instances in the new pricing model, and you just pay the Spot price that’s in effect for the current hour for the instances that you launch. It’s that simple. Now you can request Spot capacity just like you would request On-Demand capacity, without having to spend time analyzing market prices or setting a maximum bid price.

Previously, Spot Instances were terminated in ascending order of bids, and the Spot price was set to the highest unfulfilled bid. The market prices fluctuated frequently because of this. In the new model, the Spot prices are more predictable, updated less frequently, and are determined by supply and demand for Amazon EC2 spare capacity, not bid prices. You can find the price that’s in effect for the current hour in the EC2 console.

As you can see from the above Spot Instance Pricing History graph (available in the EC2 console under Spot Requests), Spot prices were volatile before the pricing model update. However, after the pricing model update, prices are more predictable and change less frequently.

In the new model, you still have the option to further control costs by submitting a “maximum price” that you are willing to pay in the console when you request Spot Instances:

You can also set your maximum price in EC2 RunInstances or RequestSpotFleet API calls, or in command line requests:

$ aws ec2 run-instances --instance-market-options 
'{"MarketType":"Spot", "SpotOptions": {"SpotPrice": "0.12"}}' \
    --image-id ami-1a2b3c4d --count 1 --instance-type c4.2xlarge

The default maximum price is the On-Demand price and you can continue to set a maximum Spot price of up to 10x the On-Demand price. That means, if you have been running applications on Spot Instances and use the RequestSpotInstances or RequestSpotFleet operations, you can continue to do so. The new Spot pricing model is backward compatible and you do not need to make any changes to your existing applications.

Fewer interruptions

Spot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2, because EC2 needs the capacity back. We have significantly reduced the interruptions with the new pricing model. Now instances are not interrupted because of higher competing bids, and you can enjoy longer workload runtimes. The typical frequency of interruption for Spot Instances in the last 30 days was less than 5% on average.

To reduce the impact of interruptions and optimize Spot Instances, diversify and run your application across multiple capacity pools. Each instance family, each instance size, in each Availability Zone, in every Region is a separate Spot pool. You can use the RequestSpotFleet API operation to launch thousands of Spot Instances and diversify resources automatically. To further reduce the impact of interruptions, you can also set up Spot Instances and Spot Fleets to respond to an interruption notice by stopping or hibernating rather than terminating instances when capacity is no longer available.

Spot Instances are now available in 18 Regions and 51 Availability Zones, and offer 100s of instance options. We have eliminated bidding, simplified the pricing model, and have made it easy to get started with Amazon EC2 Spot Instances for you to take advantage of the largest pool of cost-effective compute capacity in the world. See the Spot Instances detail page for more information and create your Spot Instance here.

AWS Online Tech Talks – January 2018

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-january-2018/

Happy New Year! Kick of 2018 right by expanding your AWS knowledge with a great batch of new Tech Talks. We’re covering some of the biggest launches from re:Invent including Amazon Neptune, Amazon Rekognition Video, AWS Fargate, AWS Cloud9, Amazon Kinesis Video Streams, AWS PrivateLink, AWS Single-Sign On and more!

January 2018– Schedule

Noted below are the upcoming scheduled live, online technical sessions being held during the month of January. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.

Webinars featured this month are:

Monday January 22

Analytics & Big Data
11:00 AM – 11:45 AM PT Analyze your Data Lake, Fast @ Any Scale  Lvl 300

Database
01:00 PM – 01:45 PM PT Deep Dive on Amazon Neptune Lvl 200

Tuesday, January 23

Artificial Intelligence
9:00 AM – 09:45 AM PT  How to get the most out of Amazon Rekognition Video, a deep learning based video analysis service Lvl 300

Containers

11:00 AM – 11:45 AM Introducing AWS Fargate Lvl 200

Serverless
01:00 PM – 02:00 PM PT Overview of Serverless Application Deployment Patterns Lvl 400

Wednesday, January 24

DevOps
09:00 AM – 09:45 AM PT Introducing AWS Cloud9  Lvl 200

Analytics & Big Data
11:00 AM – 11:45 AM PT Deep Dive: Amazon Kinesis Video Streams
Lvl 300
Database
01:00 PM – 01:45 PM PT Introducing Amazon Aurora with PostgreSQL Compatibility Lvl 200

Thursday, January 25

Artificial Intelligence
09:00 AM – 09:45 AM PT Introducing Amazon SageMaker Lvl 200

Mobile
11:00 AM – 11:45 AM PT Ionic and React Hybrid Web/Native Mobile Applications with Mobile Hub Lvl 200

IoT
01:00 PM – 01:45 PM PT Connected Product Development: Secure Cloud & Local Connectivity for Microcontroller-based Devices Lvl 200

Monday, January 29

Enterprise
11:00 AM – 11:45 AM PT Enterprise Solutions Best Practices 100 Achieving Business Value with AWS Lvl 100

Compute
01:00 PM – 01:45 PM PT Introduction to Amazon Lightsail Lvl 200

Tuesday, January 30

Security, Identity & Compliance
09:00 AM – 09:45 AM PT Introducing Managed Rules for AWS WAF Lvl 200

Storage
11:00 AM – 11:45 AM PT  Improving Backup & DR – AWS Storage Gateway Lvl 300

Compute
01:00 PM – 01:45 PM PT  Introducing the New Simplified Access Model for EC2 Spot Instances Lvl 200

Wednesday, January 31

Networking
09:00 AM – 09:45 AM PT  Deep Dive on AWS PrivateLink Lvl 300

Enterprise
11:00 AM – 11:45 AM PT Preparing Your Team for a Cloud Transformation Lvl 200

Compute
01:00 PM – 01:45 PM PT  The Nitro Project: Next-Generation EC2 Infrastructure Lvl 300

Thursday, February 1

Security, Identity & Compliance
09:00 AM – 09:45 AM PT  Deep Dive on AWS Single Sign-On Lvl 300

Storage
11:00 AM – 11:45 AM PT How to Build a Data Lake in Amazon S3 & Amazon Glacier Lvl 300

Amazon EC2 Update – Streamlined Access to Spot Capacity, Smooth Price Changes, Instance Hibernation

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-ec2-update-streamlined-access-to-spot-capacity-smooth-price-changes-instance-hibernation/

EC2 Spot Instances give you access to spare compute capacity in the AWS Cloud. Our customers use fleets of Spot Instances to power their CI/CD environments & traffic generators, host web servers & microservices, render movies, and to run many types of analytics jobs, all at prices that offer significant savings in comparison to On-Demand Instances.

New Streamlined Access
Today we are introducing a new, streamlined access model for Spot Instances. You simply indicate your desire to use Spot capacity when you launch an instance via the RunInstances function, the run-instances command, or the AWS Management Console to submit a request that will be fulfilled as long as the capacity is available. With no extra effort on your part you’ll save up to 90% off of the On-Demand price for the instance type, allowing you to boost your overall application throughput by up to 10x for the same budget. The instances that you launch in this way will continue to run until you terminate them or if EC2 needs to reclaim them for On-Demand usage. At that point the instance will be given the usual 2-minute warning and then reclaimed, making this a great fit for applications that are fault-tolerant.

Unlike the old model which required an understanding of Spot markets, bidding, and calls to a standalone asynchronous API, the new model is synchronous and as easy to use as On-Demand. Your code or your script receives an Instance ID immediately and need not check back to see if the request has been processed and accepted.

We’ve made this as clean and as simple as possible, with the expectation that it will be easy to modify many current scripts and applications to request and make use of Spot capacity. If you want to exercise additional control over your Spot instance budget, you have the option to specify a maximum price when you make a request for capacity. If you use Spot capacity to power your Amazon EMR, Amazon ECS, or AWS Batch clusters, or if you launch Spot instances by way of a AWS CloudFormation template or Auto Scaling Group, you will benefit from this new model without having to make any changes.

Applications that are built around RequestSpotInstances or RequestSpotFleet will continue to work just fine with no changes. However, you now have the option to make requests that do not include the SpotPrice parameter.

Smooth Price Changes
As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand. As I mentioned earlier, you will continue to save an average of 70-90% off the On-Demand price, and you will continue to pay the Spot price that’s in effect for the time period your instances are running. Applications built around our Spot Fleet feature will continue to automatically diversify placement of their Spot Instances across the most cost-effective pools based on the configuration you specified when you created the fleet.

Spot in Action
To launch a Spot Instance from the command line; simply specify the Spot market:

$ aws ec2 run-instances –-market Spot --image-id ami-1a2b3c4d --count 1 --instance-type c3.large 

Instance Hibernation
If you run workloads that keep a lot of state in memory, you will love this new feature!

You can arrange for instances to save their in-memory state when they are reclaimed, allowing the instances and the applications on them to pick up where they left off when capacity is once again available, just like closing and then opening your laptop. This feature works on C3, C4, and certain sizes of R3, R4, and M4 instances running Amazon Linux, Ubuntu, or Windows Server, and is supported by the EC2 Hibernation Agent.

The in-memory state is written to the root EBS volume of the instance using space that is set-aside when the instance launches. The private IP address and any Elastic IP Addresses are also preserved across a stop/start cycle.

Jeff;

Creating a Cost-Efficient Amazon ECS Cluster for Scheduled Tasks

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances.  Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.

Architecture

The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.

Walkthrough

To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

  • Amazon ECR repository for storing the Docker image to be used in the task definition
  • S3 bucket that holds the transferred logs
  • Task definition, with image name and S3 bucket as environment variables provided via input parameter
  • CloudWatch Events rule
  • Amazon ECS cluster
  • Amazon ECS container instances using Spot Fleet
  • IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

  1. Clone the code from GitHub that performs RDS API calls to retrieve the log files.
    git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
  2. Build and tag the image.
    cd aws-ecs-scheduled-tasks/container-code/src && ls

    Dockerfile		rdslogsshipper.py	requirements.txt

    docker build -t rdslogsshipper .

    Sending build context to Docker daemon 9.728 kB
    Step 1 : FROM python:3
     ---> 41397f4f2887
    Step 2 : WORKDIR /usr/src/app
     ---> Using cache
     ---> 59299c020e7e
    Step 3 : COPY requirements.txt ./
     ---> 8c017e931c3b
    Removing intermediate container df09e1bed9f2
    Step 4 : COPY rdslogsshipper.py /usr/src/app
     ---> 099a49ca4325
    Removing intermediate container 1b1da24a6699
    Step 5 : RUN pip install --no-cache-dir -r requirements.txt
     ---> Running in 3ed98b30901d
    Collecting boto3 (from -r requirements.txt (line 1))
      Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB)
    Collecting botocore (from -r requirements.txt (line 2))
      Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB)
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1))
      Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB)
    Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1))
      Downloading jmespath-0.9.3-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2))
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2))
      Downloading docutils-0.14-py3-none-any.whl (543kB)
    Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2))
      Downloading six-1.10.0-py2.py3-none-any.whl
    Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3
    Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0
     ---> f892d3cb7383
    Removing intermediate container 3ed98b30901d
    Step 6 : COPY . .
     ---> ea7550c04fea
    Removing intermediate container b558b3ebd406
    Successfully built ea7550c04fea
  3. Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
  4. Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
  5. Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
    aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule  --schedule-expression "rate(15 minutes)"

    {
        "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule"
    }
  6. CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
    1. Create the IAM role to be assumed by CloudWatch.
      aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json

      {
          "Role": {
              "AssumeRolePolicyDocument": {
                  "Version": "2012-10-17", 
                  "Statement": [
                      {
                          "Action": "sts:AssumeRole", 
                          "Effect": "Allow", 
                          "Principal": {
                              "Service": "events.amazonaws.com"
                          }
                      }
                  ]
              }, 
              "RoleId": "AROAIRYYLDCVZCUACT7FS", 
              "CreateDate": "2017-07-14T22:44:52.627Z", 
              "RoleName": "Test-Role", 
              "Path": "/", 
              "Arn": "arn:aws:iam::12345678901:role/Test-Role"
          }
      }

      The following is an example of the event-role.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                    "Service": "events.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
    2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
      aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json

      {
          "Policy": {
              "PolicyName": "test-policy", 
              "CreateDate": "2017-07-14T22:51:20.293Z", 
              "AttachmentCount": 0, 
              "IsAttachable": true, 
              "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", 
              "DefaultVersionId": "v1", 
              "Path": "/", 
              "Arn": "arn:aws:iam::123455678901:policy/test-policy", 
              "UpdateDate": "2017-07-14T22:51:20.293Z"
          }
      }

      The following is an example of the event-policy.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ecs:RunTask"
                ],
                "Resource": [
                    "arn:aws:ecs:*::task-definition/"
                ],
                "Condition": {
                    "ArnLike": {
                        "ecs:cluster": "arn:aws:ecs:*::cluster/"
                    }
                }
            }
          ]
      }
    3. Attach the IAM policy to the role.
      aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
  7. Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
    aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"

    {
        "FailedEntries": [], 
        "FailedEntryCount": 0
    }

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.

Conclusion

In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.

Natural Language Processing at Clemson University – 1.1 Million vCPUs & EC2 Spot Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/

My colleague Sanjay Padhi shared the guest post below in order to recognize an important milestone in the use of EC2 Spot Instances.

Jeff;


A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region. The researchers conducted nearly half a million topic modeling experiments to study how human language is processed by computers. Topic modeling helps in discovering the underlying themes that are present across a collection of documents. Topic models are important because they are used to forecast business trends and help in making policy or funding decisions. These topic models can be run with many different parameters and the goal of the experiments is to explore how these parameters affect the model outputs.

The Experiment
Professor Amy Apon, Co-Director of the Complex Systems, Analytics and Visualization Institute at Clemson University with Professor Alexander Herzog and graduate students Brandon Posey and Christopher Gropp in collaboration with members of the AWS team as well as AWS Partner Omnibond performed the experiments.  They used software infrastructure based on CloudyCluster that provisions high performance computing clusters on dynamically allocated AWS resources using Amazon EC2 Spot Fleet. Spot Fleet is a collection of biddable spot instances in EC2 responsible for maintaining a target capacity specified during the request. The SLURM scheduler was used as an overlay virtual workload manager for the data analytics workflows. The team developed additional provisioning and workflow automation software as shown below for the design and orchestration of the experiments. This setup allowed them to evaluate various topic models on different data sets with massively parallel parameter sweeps on dynamically allocated AWS resources. This framework can easily be used beyond the current study for other scientific applications that use parallel computing.

Ramping to 1.1 Million vCPUs
The figure below shows elastic, automatic expansion of resources as a function of time, in the US East (Northern Virginia) Region. At just after 21:40 (GMT-1) on Aug. 26, 2017, the number of vCPUs utilized was 1,119,196. Clemson researchers also took advantage of the new per-second billing for the EC2 instances that they launched. The vCPU count usage is comparable to the core count on the largest supercomputers in the world.

Here’s the breakdown of the EC2 instance types that they used:

Campus resources at Clemson funded by the National Science Foundation were used to determine an effective configuration for the AWS experiments as compared to campus resources, and the AWS cloud resources complement the campus resources for large-scale experiments.

Meet the Team
Here’s the team that ran the experiment (Professor Alexander Herzog, graduate students Christopher Gropp and Brandon Posey, and Professor Amy Apon):

Professor Apon said about the experiment:

I am absolutely thrilled with the outcome of this experiment. The graduate students on the project are amazing. They used resources from AWS and Omnibond and developed a new software infrastructure to perform research at a scale and time-to-completion not possible with only campus resources. Per-second billing was a key enabler of these experiments.

Boyd Wilson (CEO, Omnibond, member of the AWS Partner Network) told me:

Participating in this project was exciting, seeing how the Clemson team developed a provisioning and workflow automation tool that tied into CloudyCluster to build a huge Spot Fleet supercomputer in a single region in AWS was outstanding.

About the Experiment
The experiments test parameter combinations on a range of topics and other parameters used in the topic model. The topic model outputs are stored in Amazon S3 and are currently being analyzed. The models have been applied to 17 years of computer science journal abstracts (533,560 documents and 32,551,540 words) and full text papers from the NIPS (Neural Information Processing Systems) Conference (2,484 documents and 3,280,697 words). This study allows the research team to systematically measure and analyze the impact of parameters and model selection on model convergence, topic composition and quality.

Looking Forward
This study constitutes an interaction between computer science, artificial intelligence, and high performance computing. Papers describing the full study are being submitted for peer-reviewed publication. I hope that you enjoyed this brief insight into the ways in which AWS is helping to break the boundaries in the frontiers of natural language processing!

Sanjay Padhi, Ph.D, AWS Research and Technical Computing

 

New – Stop & Resume Workloads on EC2 Spot Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-stop-resume-workloads-on-ec2-spot-instances/

EC2 Spot Instances give you access to spare EC2 compute capacity at up to 90% off of the On-Demand rates. Starting with the ability to request a specific number of instances of a particular size, we made Spot Instances even more useful and flexible with support for Spot Fleets and Auto Scaling Spot Fleets, allowing you to maintain any desired level of compute capacity.

EC2 users have long had the ability to stop running instances while leaving EBS volumes attached, opening the door to applications that automatically pick up where they left off when the instance starts running again.

Stop and Resume Spot Workloads
Today we are blending these two important features, allowing you to set up Spot bids and Spot Fleets that respond by stopping (rather than terminating) instances when capacity is no longer available at or below your bid price. EBS volumes attached to stopped instances remain intact, as does the EBS-backed root volume. When capacity becomes available, the instances are started and can keep on going without having to spend time provisioning applications, setting up EBS volumes, downloading data, joining network domains, and so forth.

Many AWS customers have enhanced their applications to create and make use of checkpoints, adding some resilience and gaining the ability to take advantage of EC2’s start/stop feature in the process. These customers will now be able to run these applications on Spot Instances, with savings that average 70-90%.

While the instances are stopped, you can modify the EBS Optimization, User data, Ramdisk ID, and Delete on Termination attributes. Stopped Spot Instances do not incur any charges for compute time; space for attached EBS volumes is charged at the usual rates.

Here’s how you create a Spot bid or Spot Fleet and specify the use of stop/start:

Things to Know
This feature is available now and you can start using it today in all AWS Regions where Spot Instances are available. It is designed to work well in conjunction with the new per-second billing for EC2 instances and EBS volumes, with the potential for another dimension of cost savings over and above that provided by Spot Instances.

EBS volumes always exist within a particular Availability Zone (AZ). As a result, Spot and Spot Fleet requests that specify a particular AZ will always restart in that AZ.

Take care when using this feature in conjunction with Spot Fleets that have the potential to span a wide variety of instance types. Because the composition of the fleet can change over time, you need to pay attention to your account’s limits for IP addresses and EBS volumes.

I’m looking forward to hearing about the new and creative uses that you’ll come up with for this feature. If you thought that your application was not a good fit for Spot Instances, or if the overhead needed to handle interruptions was too high, it is time to take another look!

Jeff;

 

Deadline 10 – Launch a Rendering Fleet in AWS

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/deadline-10-launch-a-rendering-fleet-in-aws/

Graphical rendering is a compute-intensive task that is, as they say, embarrassingly parallel. Looked at another way, this means that there’s a more or less linear relationship between the number of processors that are working on the problem and the overall wall-clock time that it takes to complete the task. In a creative endeavor such as movie-making, getting the results faster spurs creativity, improves the feedback loop, gives you time to make more iterations and trials, and leads to a better result. Even if you have a render farm in-house, you may still want to turn to the cloud in order to gain access to more compute power at peak times. Once you do this, the next challenge is to manage the combination of in-house resources, cloud resources, and the digital assets in a unified fashion.

Deadline 10
Earlier this week we launched Deadline 10, a powerful render management system. Building on technology that we brought on board with the acquisition of Thinkbox Software, Deadline 10 is designed to extend existing on-premises rendering into the AWS Cloud, giving you elasticity and flexibility while remaining simple and easy to use. You can set up and manage large-scale distributed jobs that span multiple AWS regions and benefit from elastic, usage-based AWS licensing for popular applications like Deadline for Autodesk 3ds Max, Maya, Arnold, and dozens more, all available from the Thinkbox Marketplace. You can purchase software licenses from the marketplace, use your existing licenses, or use them together.

Deadline 10 obtains cloud-based compute resources by managing bids for EC2 Spot Instances, providing you with access to enough low-cost compute capacity to let your imagination run wild! It uses your existing AWS account, tags EC2 instances for tracking, and synchronizes your local assets to the cloud before rendering begins.

A Quick Tour
Let’s take a quick tour of Deadline 10 and see how it makes use of AWS. The AWS Portal is available from the View menu:

The first step is to log in to my AWS account:

Then I configure the connection server, license server, and the S3 bucket that will be used to store rendering assets:

Next, I set up my Spot fleet, establishing a maximum price per hour for each EC2 instance, setting target capacity, and choosing the desired rendering application:

I can also choose any desired combination of EC2 instance types:

When I am ready to render I click on Start Spot Fleet:

This will initiate the process of bidding for and managing Spot Instances. The running instances are visible from the Portal:

I can monitor the progress of my rendering pipeline:

I can stop my Spot fleet when I no longer need it:

Deadline 10 is now available for usage based license customers; a new license is needed for traditional floating license users. Pricing for yearly Deadline licenses has been reduced to $48 annually. If you are already using an earlier version of Deadline, feel free to contact us to learn more about licensing options.

Jeff;

AWS Online Tech Talks – May 2017

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-2017/

Spring has officially sprung. As you enjoy the blossoming of May flowers, it may be worthy to also note some of the great tech talks blossoming online during the month of May. This month’s AWS Online Tech Talks features sessions on topics like AI, DevOps, Data, and Serverless just to name a few.

May 2017 – Schedule

Below is the upcoming schedule for the live, online technical sessions scheduled for the month of May. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts. All schedule times for the online tech talks are shown in the Pacific Time (PDT) time zone.

Webinars featured this month are:

Monday, May 15

Artificial Intelligence

9:00 AM – 10:00 AM: Integrate Your Amazon Lex Chatbot with Any Messaging Service

 

Tuesday, May 16

Compute

10:30 AM – 11:30 AM: Deep Dive on Amazon EC2 F1 Instance

IoT

12:00 Noon – 1:00 PM: How to Connect Your Own Creations with AWS IoT

Wednesday, May 17

Management Tools

9:00 AM – 10:00 AM: OpsWorks for Chef Automate – Automation Made Easy!

Serverless

10:30 AM – 11:30 AM: Serverless Orchestration with AWS Step Functions

Enterprise & Hybrid

12:00 Noon – 1:00 PM: Moving to the AWS Cloud: An Overview of the AWS Cloud Adoption Framework

 

Thursday, May 18

Compute

9:00 AM – 10:00 AM: Scaling Up Tenfold with Amazon EC2 Spot Instances

Big Data

10:30 AM – 11:30 AM: Building Analytics Pipelines for Games on AWS

12:00 Noon – 1:00 PM: Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight

 

Monday, May 22

Artificial Intelligence

9:00 AM – 10:00 AM: What’s New with Amazon Rekognition

Serverless

10:30 AM – 11:30 AM: Building Serverless Web Applications

 

Tuesday, May 23

Hands-On Lab

8:30 – 10:00 AM: Hands On Lab: Windows Workloads on AWS

Big Data

10:30 AM – 11:30 AM: Streaming ETL for Data Lakes using Amazon Kinesis Firehose

DevOps

12:00 Noon – 1:00 PM: Deep Dive: Continuous Delivery for AI Applications with ECS

 

Wednesday, May 24

Storage

9:00 – 10:00 AM: Moving Data into the Cloud with AWS Transfer Services

Containers

12:00 Noon – 1:00 PM: Building a CICD Pipeline for Container Deployment to Amazon ECS

 

Thursday, May 25

Mobile

9:00 – 10:00 AM: Test Your Android App with Espresso and AWS Device Farm

Security & Identity

10:30 AM – 11:30 AM: Advanced Techniques for Federation of the AWS Management Console and Command Line Interface (CLI)

 

Tuesday, May 30

Databases

9:00 – 10:00 AM: DynamoDB: Architectural Patterns and Best Practices for Infinitely Scalable Applications

Compute

10:30 AM – 11:30 AM: Deep Dive on Amazon EC2 Elastic GPUs

Security & Identity

12:00 Noon – 1:00 PM: Securing Your AWS Infrastructure with Edge Services

 

Wednesday, May 31

Hands-On Lab

8:30 – 10:00 AM: Hands On Lab: Introduction to Microsoft SQL Server in AWS

Enterprise & Hybrid

10:30 AM – 11:30 AM: Best Practices in Planning a Large-Scale Migration to AWS

Databases

12:00 Noon – 1:00 PM: Convert and Migrate Your NoSQL Database or Data Warehouse to AWS

 

The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.

Tara

Creating a Simple “Fetch & Run” AWS Batch Job

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/creating-a-simple-fetch-and-run-aws-batch-job/

Dougal Ballantyne
Dougal Ballantyne, Principal Product Manager – AWS Batch

Docker enables you to create highly customized images that are used to execute your jobs. These images allow you to easily share complex applications between teams and even organizations. However, sometimes you might just need to run a script!

This post details the steps to create and run a simple “fetch & run” job in AWS Batch. AWS Batch executes jobs as Docker containers using Amazon ECS. You build a simple Docker image containing a helper application that can download your script or even a zip file from Amazon S3. AWS Batch then launches an instance of your container image to retrieve your script and run your job.

AWS Batch overview

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.

With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 Spot Instances.

“Fetch & run” walkthrough

The following steps get everything working:

  • Build a Docker image with the fetch & run script
  • Create an Amazon ECR repository for the image
  • Push the built image to ECR
  • Create a simple job script and upload it to S3
  • Create an IAM role to be used by jobs to access S3
  • Create a job definition that uses the built image
  • Submit and run a job that execute the job script from S3

Prerequisites

Before you get started, there a few things to prepare. If this is the first time you have used AWS Batch, you should follow the Getting Started Guide and ensure you have a valid job queue and compute environment.

After you are up and running with AWS Batch, the next thing is to have an environment to build and register the Docker image to be used. For this post, register this image in an ECR repository. This is a private repository by default and can easily be used by AWS Batch jobs

You also need a working Docker environment to complete the walkthrough. For the examples, I used Docker for Mac. Alternatively, you could easily launch an EC2 instance running Amazon Linux and install Docker.

You need the AWS CLI installed. For more information, see Installing the AWS Command Line Interface.

Building the fetch & run Docker image

The fetch & run Docker image is based on Amazon Linux. It includes a simple script that reads some environment variables and then uses the AWS CLI to download the job script (or zip file) to be executed.

To get started, download the source code from the aws-batch-helpers GitHub repository. The following link pulls the latest version: https://github.com/awslabs/aws-batch-helpers/archive/master.zip. Unzip the downloaded file and navigate to the “fetch-and-run” folder. Inside this folder are two files:

  • Dockerfile
  • fetchandrun.sh

Dockerfile is used by Docker to build an image. Look at the contents; you should see something like the following:

FROM amazonlinux:latest

RUN yum -y install unzip aws-cli
ADD fetch_and_run.sh /usr/local/bin/fetch_and_run.sh
WORKDIR /tmp
USER nobody

ENTRYPOINT ["/usr/local/bin/fetch_and_run.sh"]
  • The FROM line instructs Docker to pull the base image from the amazonlinux repository, using the latest tag.
  • The RUN line executes a shell command as part of the image build process.
  • The ADD line, copies the fetchandrun.sh script into the /usr/local/bin directory inside the image.
  • The WORKDIR line, sets the default directory to /tmp when the image is used to start a container.
  • The USER line sets the default user that the container executes as.
  • Finally, the ENTRYPOINT line instructs Docker to call the /usr/local/bin/fetchandrun.sh script when it starts the container. When running as an AWS Batch job, it is passed the contents of the command parameter.

Now, build the Docker image! Assuming that the docker command is in your PATH and you don’t need sudo to access it, you can build the image with the following command (note the dot at the end of the command):

docker build -t awsbatch/fetch_and_run .   

This command should produce an output similar to the following:

Sending build context to Docker daemon 373.8 kB

Step 1/6 : FROM amazonlinux:latest
latest: Pulling from library/amazonlinux
c9141092a50d: Pull complete
Digest: sha256:2010c88ac1e7c118d61793eec71dcfe0e276d72b38dd86bd3e49da1f8c48bf54
Status: Downloaded newer image for amazonlinux:latest
 ---> 8ae6f52035b5
Step 2/6 : RUN yum -y install unzip aws-cli
 ---> Running in e49cba995ea6
Loaded plugins: ovl, priorities
Resolving Dependencies
--> Running transaction check
---> Package aws-cli.noarch 0:1.11.29-1.45.amzn1 will be installed

  << removed for brevity >>

Complete!
 ---> b30dfc9b1b0e
Removing intermediate container e49cba995ea6
Step 3/6 : ADD fetch_and_run.sh /usr/local/bin/fetch_and_run.sh
 ---> 256343139922
Removing intermediate container 326092094ede
Step 4/6 : WORKDIR /tmp
 ---> 5a8660e40d85
Removing intermediate container b48a7b9c7b74
Step 5/6 : USER nobody
 ---> Running in 72c2be3af547
 ---> fb17633a64fe
Removing intermediate container 72c2be3af547
Step 6/6 : ENTRYPOINT /usr/local/bin/fetch_and_run.sh
 ---> Running in aa454b301d37
 ---> fe753d94c372

Removing intermediate container aa454b301d37
Successfully built 9aa226c28efc

In addition, you should see a new local repository called fetchandrun, when you run the following command:

docker images
REPOSITORY               TAG              IMAGE ID            CREATED             SIZE
awsbatch/fetch_and_run   latest           9aa226c28efc        19 seconds ago      374 MB
amazonlinux              latest           8ae6f52035b5        5 weeks ago         292 MB

To add more packages to the image, you could update the RUN line or add a second one, right after it.

Creating an ECR repository

The next step is to create an ECR repository to store the Docker image, so that it can be retrieved by AWS Batch when running jobs.

  1. In the ECR console, choose Get Started or Create repository.
  2. Enter a name for the repository, for example: awsbatch/fetchandrun.
  3. Choose Next step and follow the instructions.

    fetchAndRunBatch_1.png

You can keep the console open, as the tips can be helpful.

Push the built image to ECR

Now that you have a Docker image and an ECR repository, it is time to push the image to the repository. Use the following AWS CLI commands, if you have used the previous example names. Replace the AWS account number in red with your own account.

aws ecr get-login --region us-east-1

docker tag awsbatch/fetch_and_run:latest 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetch_and_run:latest

docker push 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetch_and_run:latest

Create a simple job script and upload to S3

Next, create and upload a simple job script that is executed using the fetchandrun image that you just built and registered in ECR. Start by creating a file called myjob.sh with the example content below:

!/bin/bash

date
echo "Args: [email protected]"
env
echo "This is my simple test job!."
echo "jobId: $AWS_BATCH_JOB_ID"
echo "jobQueue: $AWS_BATCH_JQ_NAME"
echo "computeEnvironment: $AWS_BATCH_CE_NAME"
sleep $1
date
echo "bye bye!!"

Upload the script to an S3 bucket.

aws s3 cp myjob.sh s3://<bucket>/myjob.sh

Create an IAM role

When the fetchandrun image runs as an AWS Batch job, it fetches the job script from Amazon S3. You need an IAM role that the AWS Batch job can use to access S3.

  1. In the IAM console, choose Roles, Create New Role.
  2. Enter a name for your new role, for example: batchJobRole, and choose Next Step.
  3. For Role Type, under AWS Service Roles, choose Select next to “Amazon EC2 Container Service Task Role” and then choose Next Step.

    fetchAndRunBatch_2.png

  4. On the Attach Policy page, type “AmazonS3ReadOnlyAccess” into the Filter field and then select the check box for that policy.

    fetchAndRunBatch_3.png

  5. Choose Next Step, Create Role. You see the details of the new role.

    fetchAndRunBatch_4.png

Create a job definition

Now that you’ve have created all the resources needed, pull everything together and build a job definition that you can use to run one or many AWS Batch jobs.

  1. In the AWS Batch console, choose Job Definitions, Create.
  2. For the Job Definition, enter a name, for example, fetchandrun.
  3. For IAM Role, choose the role that you created earlier, batchJobRole.
  4. For ECR Repository URI, enter the URI where the fetchandrun image was pushed, for example: 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetchandrun.
  5. Leave the Command field blank.
  6. For vCPUs, enter 1. For Memory, enter 500.

    fetchAndRunBatch_5.png

  7. For User, enter “nobody”.

  8. Choose Create job definition.

Submit and run a job

Now, submit and run a job that uses the fetchandrun image to download the job script and execute it.

  1. In the AWS Batch console, choose Jobs, Submit Job.
  2. Enter a name for the job, for example: script_test.
  3. Choose the latest fetchandrun job definition.
  4. For Job Queue, choose a queue, for example: first-run-job-queue.
  5. For Command, enter myjob.sh,60.
  6. Choose Validate Command.

    fetchAndRunBatch_6.png

  7. Enter the following environment variables and then choose Submit job.

    • Key=BATCHFILETYPE, Value=script
    • Key=BATCHFILES3_URL, Value=s3:///myjob.sh. Don’t forget to use the correct URL for your file.

    fetchAndRunBatch_7.png

  8. After the job is completed, check the final status in the console.

    fetchAndRunBatch_8.png

  9. In the job details page, you can also choose View logs for this job in CloudWatch console to see your job log.

    fetchAndRunBatch_9.png

How the fetch and run image works

The fetchandrun image works as a combination of the Docker ENTRYPOINT and COMMAND feature, and a shell script that reads environment variables set as part of the AWS Batch job. When building the Docker image, it starts with a base image from Amazon Linux and installs a few packages from the yum repository. This becomes the execution environment for the job.

If the script you planned to run needed more packages, you would add them using the RUN parameter in the Dockerfile. You could even change it to a different base image such as Ubuntu, by updating the FROM parameter.

Next, the fetchandrun.sh script is added to the image and set as the container ENTRYPOINT. The script simply reads some environment variables and then downloads and runs the script/zip file from S3. It is looking for the following environment variables BATCHFILETYPE and BATCHFILES3URL. If you run fetchand_run.sh, with no environment variables, you get the following usage message:

  • BATCHFILETYPE not set, unable to determine type (zip/script) of

Usage:

export BATCH_FILE_TYPE="script"

export BATCH_FILE_S3_URL="s3://my-bucket/my-script"

fetch_and_run.sh script-from-s3 [ <script arguments> ]

– or –

export BATCH_FILE_TYPE="zip"

export BATCH_FILE_S3_URL="s3://my-bucket/my-zip"

fetch_and_run.sh script-from-zip [ <script arguments> ]

This shows that it supports two values for BATCHFILETYPE, either “script” or “zip”. When you set “script”, it causes fetchandrun.sh to download a single file and then execute it, in addition to passing in any further arguments to the script. If you set it to “zip”, this causes fetchandrun.sh to download a zip file, then unpack it and execute the script name passed and any further arguments. You can use the “zip” option to pass more complex jobs with all the applications dependencies in one file.

Finally, the ENTRYPOINT parameter tells Docker to execute the /usr/local/bin/fetchandrun.sh script when creating a container. In addition, it passes the contents of the COMMAND parameter as arguments to the script. This is what enables you to pass the script and arguments to be executed by the fetchandrun image with the Command field in the SubmitJob API action call.

Summary

In this post, I detailed the steps to create and run a simple “fetch & run” job in AWS Batch. You can now easily use the same job definition to run as many jobs as you need by uploading a job script to Amazon S3 and calling SubmitJob with the appropriate environment variables.

If you have questions or suggestions, please comment below.

Registering Spot Instances with AWS OpsWorks Stacks

Post Syndicated from Amir Golan original https://aws.amazon.com/blogs/devops/registering-spot-instances-with-aws-opsworks-stacks/

AWS OpsWorks Stacks is a configuration management service that helps you configure and operate applications of all shapes and sizes using Chef. You can define the application’s architecture and the specification of each component, including package installation, software configuration, and more.

Amazon EC2 Spot instances allow you to bid on spare Amazon EC2 computing capacity. Because Spot instances are often available at a discount compared to On-Demand instances, you can significantly reduce the cost of running your applications, grow your applications’ compute capacity and throughput for the same budget, and enable new types of cloud computing applications.

You can use Spot instances with AWS OpsWorks Stacks in the following ways:

  • As a part of an Auto Scaling group, as described in this blog post. You can follow the steps in the blog post and choose to use Spot instance option, in the launch configuration described in step 5.
  • To provision a Spot instance in the EC2 console and have it automatically register with an OpsWorks stack as described here.

The example used in this post will require you to create the following resources:

IAM instance profile: an IAM profile that grants your instances permission to register themselves with OpsWorks.

Lambda function: a function that deregisters your instances from an OpsWorks stack.

Spot instance: the Spot instance that will run your application.

CloudWatch Event role: an event that will trigger the Lambda function whenever your Spot instance is terminated.

Step 1: Create an IAM instance profile

When a Spot instance starts, it must be able to make an API call to register itself with an OpsWorks stack. By assigning an instance with an IAM instance profile, the instance will be able to make calls to OpsWorks.

Open the IAM console at https://console.aws.amazon.com/iam/, choose Roles, and then choose Create New Role. Type a name for the role, and then choose Next Step. Choose the Amazon EC2 role, and then select the check box next to the AWSOpsWorksInstanceRegistration policy. Finally, select Next Step, and then choose Create Role. As its name suggests, the AWSOpsWorksInstanceRegistration policy will allow the instance to make API calls only to register an instance. Copy the following policy to the new role you’ve just created.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "opsworks:AssignInstance",
                "opsworks:DescribeInstances",
 "ec2:CreateTags"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

1b

Step 2: Create a Lambda function

This Lambda function is responsible for deregistering an instance from your OpsWorks stack. It will be invoked whenever the Spot instance is terminated.

Open the AWS Lambda console at https://us-west-2.console.aws.amazon.com/lambda/home, and choose the option to create a Lambda function. If you are prompted to choose a blueprint, choose Skip. Type a name for the Lambda function, and from the Runtime drop-down list, select Python 2.7.

Next, paste the following code into the Lambda Function Code text box:

import boto3

def lambda_handler(event, context):
    ec2_instance_id = event['detail']['instance-id']
    ec2 = boto3.client('ec2')
    for tag in ec2.describe_instances(InstanceIds=[ec2_instance_id])['Reservations'][0]['Instances'][0]['Tags']:
        if (tag['Key'] == 'opsworks_stack_id'):
            opsworks_stack_id = tag['Value']
            opsworks = boto3.client('opsworks', 'us-east-1')
            for instance in opsworks.describe_instances(StackId=opsworks_stack_id)['Instances']:
                if ('Ec2InstanceId' in instance):
                    if (instance['Ec2InstanceId'] == ec2_instance_id):
                        print("Deregistering OpsWorks instance " + instance['InstanceId'])
                        opsworks.deregister_instance(InstanceId=instance['InstanceId'])
    return ec2_instance_id

The result should look like this:

2

Step 3: Create a CloudWatch event

Whenever the Spot instance is terminated, we need to trigger the Lambda function from step 2 to deregister the instance from its associated stack.

Open the AWS CloudWatch console at https://console.aws.amazon.com/cloudwatch/home, choose Events, and then choose the Create rule button. Choose Amazon EC2 from the event selector. Select Specific state(s), and choose Terminated. Select the Lambda function you created earlier as the target. Finally, choose the Configure details button.

3b

Step 4: Create a Spot instance

Open the EC2 console at https://console.aws.amazon.com/ec2sp/v1/spot/home, and choose the Request Spot Instances button. Use the latest release of Amazon Linux. On the details page, under IAM instance profile, choose the instance profile you created in step 1. Paste the following script into the User data field:

#!/bin/bash
sed -i'' -e 's/.*requiretty.*//' /etc/sudoers
pip install --upgrade awscli
STACK_ID=3464f35f-16b4-44dc-8073-a9cd19533ad5
LAYER_ID=ba04682c-6e32-481d-9d0e-e2fa72b55314
INSTANCE_ID=$(/usr/bin/aws opsworks register --use-instance-profile --infrastructure-class ec2 --region us-east-1 --stack-id $STACK_ID --override-hostname $(tr -cd 'a-z' < /dev/urandom |head -c8) --local 2>&1 |grep -o 'Instance ID: .*' |cut -d' ' -f3)
EC2_INSTANCE_ID=$(/usr/bin/aws opsworks describe-instances --region us-east-1 --instance-ids $INSTANCE_ID | grep -o '"Ec2InstanceId": "i-.*'| grep -o 'i-[a-z0-9]*')
/usr/bin/aws ec2 create-tags --region us-east-1 --resources $EC2_INSTANCE_ID --tags Key=opsworks_stack_id,Value=$STACK_ID
/usr/bin/aws opsworks wait instance-registered --region us-east-1 --instance-id $INSTANCE_ID
/usr/bin/aws opsworks assign-instance --region us-east-1 --instance-id $INSTANCE_ID --layer-ids $LAYER_ID

This script will automatically register your Spot instance on boot with a corresponding OpsWorks stack and layer. Be sure to fill in the following fields:

STACK_ID=YOUR_STACK_ID
LAYER_ID=YOUR_LAYER_ID

4sd

It will take a few minutes for the instance to be provisioned and come online. You’ll see fulfilled displayed in the Status column and active displayed in the State column. After the instance and request are both in an active state, the instance should be fully booted and registered with your OpsWorks stack/layer.

5

You can also view the instance and its online state in the OpsWorks console under Spot Instance.

6

You can manually terminate a Spot instance from the OpsWorks service console. Simply choose the stop button and the Spot instance will be terminated and removed from your stack. Unlike an On-Demand Instance in OpsWorks, when a Spot instance is stopped, it cannot be restarted.

In case your Spot instance is terminated through other means (for example, in the EC2 console), a CloudWatch event will trigger the Lambda function, which will automatically deregister the instance from your OpsWorks stack.

 

Conclusion

You can now use OpsWorks Stacks to define your application’s architecture and software configuration while leveraging the attractive pricing of Spot instances.

Dynamically Scale Applications on Amazon EMR with Auto Scaling

Post Syndicated from Jonathan Fritz original https://aws.amazon.com/blogs/big-data/dynamically-scale-applications-on-amazon-emr-with-auto-scaling/

Jonathan Fritz is a Senior Product Manager for Amazon EMR

Customers running Apache Spark, Presto, and the Apache Hadoop ecosystem take advantage of Amazon EMR’s elasticity to save costs by terminating clusters after workflows are complete and resizing clusters with low-cost Amazon EC2 Spot Instances. For instance, customers can create clusters for daily ETL or machine learning jobs and shut them down when they complete, or scale out a Presto cluster serving BI analysts during business hours for ad hoc, low-latency SQL on Amazon S3.

With new support for Auto Scaling in Amazon EMR releases 4.x and 5.x, customers can now add (scale out) and remove (scale in) nodes on a cluster more easily. Scaling actions are triggered automatically by Amazon CloudWatch metrics provided by EMR at 5 minute intervals, including several YARN metrics related to memory utilization, applications pending, and HDFS utilization.

In EMR release 5.1.0, we introduced two new metrics, YARNMemoryAvailablePercentage and ContainerPendingRatio, which serve as useful cluster utilization metrics for scalable, YARN-based frameworks like Apache Spark, Apache Tez, and Apache Hadoop MapReduce. Additionally, customers can use custom CloudWatch metrics in their Auto Scaling policies.

The following is an example Auto Scaling policy on an instance group that scales 1 instance at a time to a maximum of 40 instances or a minimum of 10 instances. The instance group scales out when the memory available in YARN is less than 15%, and scales in when this metric is greater than 75%: Also, the instance group scales out when the ratio of pending YARN containers over allocated YARN containers is 0.75.

AutoScaling_SocialMedia

Additionally, customers can now configure the scale-down behavior when nodes are terminated from their cluster on EMR 5.1.0. By default, EMR now terminates nodes only during a scale-in event at the instance hour boundary, regardless of when the request was submitted. Because EC2 charges per full hour regardless of when the instance is terminated, this behavior enables applications running on your cluster to use instances in a dynamically scaling environment more cost effectively.

Conversely, customers can set the previous default for EMR releases earlier than 5.1.0, which blacklists and drains tasks from nodes before terminating them, regardless of proximity to the instance hour boundary. With either behavior, EMR removes the least active nodes first and blocks termination if it could lead to HDFS corruption.

You can create or modify Auto Scaling polices using the EMR console, AWS CLI, or the AWS SDKs with the EMR API. To enable Auto Scaling, EMR also requires an additional IAM role to grant permission for Auto Scaling to add and terminate capacity. For more information, see Auto Scaling with Amazon EMR. If you have any questions or would like to share an interesting use case about encryption on EMR, please leave a comment below.


Related

Use Apache Flink on Amazon EMR

o_Flink_2


Real World AWS Scalability

Post Syndicated from Stefano Buliani original https://aws.amazon.com/blogs/compute/real-world-aws-scalability/

This is a guest post from Linda Hedges, Principal SA, High Performance Computing.

—–

One question we often hear is: “How well will my application scale on AWS?” For HPC workloads that cross multiple nodes, the cluster network is at the heart of scalability concerns. AWS uses advanced Ethernet networking technology which, like all things AWS, is designed for scale, security, high availability, and low cost. This network is exceptional and continues to benefit from Amazon’s rapid pace of development. For real world applications, all but the most demanding customers find that their applications run very well on AWS! Many have speculated that highly-coupled workloads require a name-brand network fabric to achieve good performance. For most applications, this is simply not the case. As with all clusters, the devil is in the details and some applications benefit from cluster tuning. This blog discusses the scalability of a representative, real-world application and provides a few performance tips for achieving excellent application performance using STARCCM+ as an example. For more HPC specific information, please see our website.

TLG Aerospace, a Seattle based aerospace engineering services company, runs most of their STARCCM+ Computational Fluid Dynamics (CFD) cases on AWS. A detailed case study describing TLG Aerospace’s experience and the results they achieved can be found here. This blog uses one of their CFD cases as an example to understand AWS scalability. By leveraging Amazon EC2 Spot instances which allow customers to purchase unused capacity at significantly reduced rates, TLG Aerospace consistently achieves an 80% cost savings compared to their previous cloud and on-premise HPC cluster options. TLG Aerospace experiences solid value, terrific scale-up, and effectively limitless case throughput – all with no queue wait!

HPC applications such as Computational Fluid Dynamics (CFD) depend heavily on the application’s ability to efficiently scale compute tasks in parallel across multiple compute resources. Parallel performance is often evaluated by determining an application’s scale-up. Scale-up is a function of the number of processors used and is defined as the time it takes to complete a run on one processor, divided by the time it takes to complete the same run on the number of processors used for the parallel run.

As an example, consider an application with a time to completion, or turn-around time of 32 hours when run on one processor. If the same application runs in one hour when run on 32 processors, then the scale-up is 32 hours of time on one processor / 1 hour time on 32 processors, or equal to 32, for 32 processes. Scaling is considered to be excellent when the scale-up is close to or equal to the number of processors on which the application is run.

If the same application took eight hours to complete on 32 processors, it would have a scale-up of only four: 32 (time on one processor) / 8 (time to complete on 32 processors). A scale-up of four on 32 processors is considered to be poor.

In addition to characterizing the scale-up of an application, scalability can be further characterized as “strong” or “weak” scaling. Note that the term “weak”, as used here, does not mean inadequate or bad but is a technical term facilitating the description of the type of scaling which is sought. Strong scaling offers a traditional view of application scaling, where a problem size is fixed and spread over an increasing number of processors. As more processors are added to the calculation, good strong scaling means that the time to complete the calculation decreases proportionally with increasing processor count.

In comparison, weak scaling does not fix the problem size used in the evaluation, but purposely increases the problem size as the number of processors also increases. The ratio of the problem size to the number of processors on which the case is run is held constant. For a CFD calculation, problem size most often refers to the size of the grid for a similar configuration.

An application demonstrates good weak scaling when the time to complete the calculation remains constant as the ratio of compute effort to the number of processors is held constant. Weak scaling offers insight into how an application behaves with varying case size.

Figure 1: Strong Scaling Demonstrated for a 16M Cell STARCCM+ CFD Calculation

Scale-up as a function of increasing processor count is shown in Figure 1 for the STARCCM+ case data provided by TLG Aerospace. This is a demonstration of “strong” scalability. The blue line shows what ideal or perfect scalability looks like. The purple triangles show the actual scale-up for the case as a function of increasing processor count. Excellent scaling is seen to well over 400 processors for this modest-sized 16M cell case as evidenced by the closeness of these two curves. This example was run on Amazon EC2 c3.8xlarge instances, each an Intel E5-2680, providing either 16 cores or 32 hyper-threaded processors.

AWS customers can choose to run their applications on either threads or cores. AWS provides access to the underlying hardware of our servers. For an application like STARCCM+, excellent linear scaling can be seen when using either threads or cores though testing of a specific case and application is always recommended. For this example, threads were chosen as the processing basis. Running on threads offered a few percent performance improvement when compared to running the same case on cores. Note that the number of available cores is equal to half of the number of available threads.

The scalability of real-world problems is directly related to the ratio of the compute-effort per-core to the time required to exchange data across the network. The grid size of a CFD case provides a strong indication of how much computational effort is required for a solution. Thus, larger cases will scale to even greater processor counts than for the modest size case discussed here.

Figure 2: Scale-up and Efficiency as a Function of Cells per Processor

STARCCM+ has been shown to demonstrate exceptional “weak” scaling on AWS. That’s not shown here though weak scaling is reflected in Figure 2 by plotting the cells per processor on the horizontal axis. The purple line in Figure 2 shows scale-up as a function of grid cells per processor. The vertical axis for scale-up is on the left-hand side of the graph as indicated by the purple arrow. The green line in Figure 2 shows efficiency as a function of grid cells per processor. The vertical axis for efficiency is shown on the right hand side of the graph and is indicated with a green arrow. Efficiency is defined as the scale-up divided by the number of processors used in the calculation.

Weak scaling is evidenced by considering the number of grid cells per processor as a measure of compute effort. Holding the grid cells per processor constant while increasing total case size demonstrates weak scaling. Weak scaling is not shown here because only one CFD case is used. Fewer grid cells per processor means reduced computational effort per processor. Maintaining efficiency while reducing cells per processor demonstrates the excellent strong scalability of STARCCM+ on AWS.

Efficiency remains at about 100% between approximately 250,000 grid cells per thread (or processor) and 100,000 grid cells per thread. Efficiency starts to fall off at about 100,000 grid cells per thread. An efficiency of at least 80% is maintained until 25,000 grid cells per thread. Decreasing grid cells per processor leads to decreased efficiency because the total computational effort per processor is reduced. Note that the perceived ability to achieve more than 100% efficiency (here, at about 150,000 cells per thread) is common in scaling studies, is case specific, and often related to smaller effects such as timing variation and memory caching.

Figure 3: Cost for per Run Based on Spot Pricing ($0.35 per hour for c3.8xlarge) as a function of Turn-around Time

Plots of scale-up and efficiency offer understanding of how a case or application scales. The bottom line though is that what really matters to most HPC users is case turn-around time and cost. A plot of turn-around time versus CPU cost for this case is shown in Figure 3. As the number of threads are increased, the total turn-around time decreases. But also, as the number of threads increase, the inefficiency increases. Increasing inefficiency leads to increased cost. The cost shown is based on typical Amazon Spot price for the c3.8xlarge and only includes the computational costs. Small costs will also be incurred for data storage.

Minimum cost and turn-around time were achieved with approximately 100,000 cells per thread. Many users will choose a cell count per thread to achieve the lowest possible cost. Others, may choose a cell count per thread to achieve the fastest turn-around time. If a run is desired in 1/3rd the time of the lowest price point, it can be achieved with approximately 25,000 cells per thread. (Note that many users run STARCCM+ with significantly fewer cells per thread than this.) While this increases the compute cost, other concerns, such as license costs or schedules can be over-riding factors. For this 16M cell case, the added inefficiency results in an increase in run price from $3 to $4 for computing. Many find the reduced turn-around time well worth the price of the additional instances.

As with any cluster, good performance requires attention to the details of the cluster set up. While AWS allows for the quick set up and take down of clusters, performance is affected by many of the specifics in that set up. This blog provides some examples.

On AWS, a placement group is a grouping of instances within a single Availability Zone that allow for low latency between the instances. Placement groups are recommended for all applications where low latency is a requirement. A placement group was used to achieve the best performance from STARCCM+. More on placement groups can be found in our docs.

Amazon Linux is a version of Linux maintained by Amazon. The distribution evolved from Red Hat Linux (RHEL) and is designed to provide a stable, secure, and highly performant environment. Amazon Linux is optimized to run on AWS and offers excellent performance for running HPC applications. For the case presented here, the operating system used was Amazon Linux. Other Linux distributions are also performant. However, it is strongly recommended that for Linux HPC applications, a minimum of the version 3.10 Linux kernel be used to be sure of using the latest Xen libraries. See our Amazon Linux page to learn more.

Amazon Elastic Block Store (EBS) is a persistent block level storage device often used for cluster storage on AWS. EBS provides reliable block level storage volumes that can be attached (and removed) from an Amazon EC2 instance. A standard EBS general purpose SSD (gp2) volume is all that is required to meet the needs of STARCCM+ and was used here. Other HPC applications may require faster I/O to prevent data writes from being a bottle neck to turn-around speed. For these applications, other storage options exist. A guide to amazon storage is found here.

As mentioned previously, STARCCM+, like many other CFD solvers, runs well on both threads and cores. Hyper-Threading can improve the performance of some MPI applications depending on the application, the case, and the size of the workload allocated to each thread; it may also slow performance. The one-size-fits-all nature of the static cluster compute environments means that most HPC clusters disable hyper-threading. To first order, it is believed that computationally intensive workloads run best on cores while those that are I/O bound may run best on threads. Again, a few percent increase in performance was discovered for this case, by running with threads. If there is no time to evaluate the effect of hyper-threading on case performance than it is recommended that hyper-threading be disabled. When hyper-threading is disabled, it is important to bind the core to designated CPU. This is called processor or CPU affinity. It almost universally improves performance over unpinned cores for computationally intensive workloads.

Occasionally, an application will include frequent time measurement in their code; perhaps this is done for performance tuning. Under these circumstances, performance can be improved by setting the clock source to the TSC (Time Stamp Counter). This tuning was not required for this application but is mentioned here for completeness.

When evaluating an application, it is always recommended that a meaningful real-world case be used. A case that is too big or too small won’t reflect the performance and scalability achievable in every day operation. The only way to know positively how an application will perform on AWS is to try it!

AWS offers solid strong scaling and exceptional weak scaling. Good performance can be achieved on AWS, for most applications. In addition to low cost and quick turn-around time, important considerations for HPC also include throughput and availability. AWS offers effectively limitless throughput, security, cost-savings, and high-availability making queues a “thing of the past”. A long queue wait makes for a very long case turn-around time, regardless of the scale.